APPLICATIONS OF CLUSTERING TO INFORMATION SYSTEM DESIGN LARRY E. STANFEI. Department of Management and Marketing, Clarkson College. Potsdam, NY 13676,US.4 (Received for publicution 5 Mup 1982)
Abstract-Given the difficulty of designing and creating information systems of many components and interconnections, it is commonplace to find these tasks accomplished by means of a partition into subsystems. Later the distinct subsystems are made to interface with one another and an overall system is achieved. The purpose of the present paper is to point out the availability of methods for effecting the partition in optimal or approximately optimal ways. Clustering algorithms for the specific case of information systems are obtained and exemplified. I. INTRODUCTION
It is unnecessary to review at length the relative inadequacy of analysis and design techniques when confronted with the task of analyzing or designing complicated systems. Even during the period when it was fashionable to advocate “total systems design,” no one was so bold as to presume that, excepting very special cases of structural simplicity, the state of the art had proceeded to the point of making that objective feasible. This is not to derogate the nobility of the idea, for, ideed, if it is possible to identify an optimal system, if one does not take into account all the existing interactions in the total system, then how can the result be optimal? Surely an amalgamation of parts of a system is doomed to suboptimality. Being a particular sort of system, an information system is subject to the same limitations, of course, and a vast quantity has been written upon the topic of the design of information systems. It is instructive to notice how authors with widely divergent backgrounds, points of view, and reference disciplines arrive, logically, at the conclusion that information system design, if it is to succeed, must take form as an interconnected collection of subsystems. Four brief references illustrate this fact. Within a framework of qualitative postulates, Langefors [13] adduced, via a series of theorems in the same spirit, that an iterative design scheme based upon a subsystem structure was, in fact, the only way an effective complicated system could be designed. Typical texts on management information systems advocate design in terms of subsystems though their approach tends to be less formal. Senn[21], for example, makes mention of design strategies, all of which involve subsystems and which resemble, in addition, the hierarchical strategy of Langefors. As the foundation of an automated design aid, ISMS, various partitioning schemes based upon elementary graph theory are given by Hansen et a/.[6]. These allow a partitioning of system elements into subsets called levels, and within levels more closely related subsets identifiable as circuits, for example, are salient prospects for constituting subsystems. Mentioning system structure, Spang[23] wrote “A good rule of thumb indicates that a system should be divided into functional parts in such a way that there exists maximum independence with well-defined simple interfaces and a minimum of required communication.” Finally, of course, we may cite empirical evidence as experienced by anyone who has ever worked on a large complicated problem, system or not. Feasibility seems to demand a decomposition into subproblems and the conclusions of these and a great number of authors appear, after at most a little reflection, to embody a sort of natural principle, rather the divide and conquer of humankind. Langefors, it will be noted, conjectured that evolution itself might have transpired according to such an iterative scheme as he described.
?r;
1 \KKk k. sI\CllI
To summarize, then, assuming we had a knowledge of the system components and how these were interrelated, a valid problem is to decide how to partition these-optimally if we assume we are capable of measuring partition goodness-into subsystems. We must comment at greater length upon the question of measurement, but two conflicting criteria make the problem
interesting.
realize.
and
as
First,
the smaller
the latter necessitate
the subsystems.
coordination
the more
and compatibility
efforts,
interfaces
we
the system design
task burgeons. Next. however, the larger the subsystems. the greater the task of completing any one of them, and consequently, the less advantage gained by the decomposition, and perhaps. the amount of work remain5 prohibitive. Since considerable research interest
has been shown in the general problem
sets of objects
natural
problem
into subsets. it \eems
we have been mentioning,
applications
to problems
general concepts. approaches
to the problems
systems,
problem
it is necessary
section contains
of partitioning
the system design
that material.
to mention
Afterward
briefly
several
two quite different
will be derived. 2. (‘I~llS’l’ERING
A clustering
with
and that is the topic of the present paper. Before examining
in information
The following
to take that approach
PROHI.EMS
i\ simply that of separating
into subsets so a\ to satisfy some criteria.
or partitioning
The criteria
a
may stipulate
a
finite collection
of objects
best partition,
there being
given a way to measure the goodness of any one; or they may include such properties that are so that one stops searching when the criteria are synonymous with “acceptable.”
deemed
“acceptable”
satisfied. Naturally. all. indistinct.
may be equated to optimal,
so that the two instances are. after
A number of books provide admirable surveys of the variations of the problem, mathematical formulations, the diversity of solution approaches, and informative bibliographies. See, for instance. Interest
[I, 2.7.231. Some formulations. here will
be restricted
for example.
to locating
optimal
fix the number
partitions
of subsets in advance.
in cases where
the number
of
subsets is unconstrained. It is assumed invariably that there is defined a distance between each pair of objects or. equivalently, a proximity or similarity. Here, we prefer to think in terms of distance, and whatever
makes sense in the context
If the object5
were cities,
of the problem
Euclidean
distance
is a reasonable
distance measure.
in the plane would
seem appropriate;
for
locations within a city, rectangular distances may be the proper choice; to cluster the planets in the solar system. the difference between mean distance from the sun may be desirable: to cluster
;I collection
of students based on their performance
are the ten-dimensional
vectors
of scorej.
on a battery
and distance
might
of ten tests, the objects
be Euclidean
in the ten-
dimensional space. A measure of the goodness of a particular partition, then, must be in terms of these distances. and references previously cited contain numerous examples. The diversity of possibilities.
in fact. is
:I
symptom
Let us agree to denote by a within the same cluster different
and by between
of
;I
complicating
fact to which we mu\t return
group
distance
(bgd),
;I
clusters. We shall write d,, for the distance between
Several objectives
a
little later.
group distance (wgd) a distance between two objects within
that have been used in clustering
distance objects
problems,
between
two objects
in
i and i.
then, are the minimization
of
within group sum of squared distances; the minimization of within group sum of squared distances to the cluster centroid; the minimization of maximum cluster diameter; and the minimization of the difference between average wgd and average bgd. There are many, many others. Since the last example function above is related to the work in the sequel. we illustrate it for the specific partition
in Fig. I. ?
Fig. I. A smifll \et partitioned
.Applicationc of clustering IO information
system design
39
The optimization problem would be to find that partition which yields the smallest value of the objective function. Observing the existence of a choice of objective functions, one is naturally led to asking how to tnakt that choice. If interpreting Fig. 1 to portray Euclidean distances, if we were told the solution exhibited is best. we should wonder what best means, because, intuitively, it does not appear that we have clustered the objects naturally. Intuition, in fact, may well suffice to provide optimal clusterings in easy problems. such as in Fig. 2, but what if the objects are as in Fig. 3 or lying in a space of dimension we cannot illustrate? The motivation for an objective function is obvious, then, but much less obvious is how to transform the intuitional sentiments into an objective function that mirrors them perfectly. This accounts for the variety of objective functions, in general, but we will find the path a little easier in problems dealing with clustering in information systems.
Fig. 2. Naturally clustered data.
Fig. 3. Unstructured
data.
It may be instructive to present a brief, hopefully logical, sequence by which one arrives at an objective function. In order to be designated a cluster it seems that objects within a subset should all be relatively close to one another. Furthermore, if we can distinguish between two clusters. it seems they should be relatively far removed. These two criteria should combine to cause the f formerly referenced to be smcrll. Therefore. by minimizing f one hopes that by making f small, he achieves the two criteria. The reader will appreciate, of course, that if the two criteria above are exactly what should be sought, there remains latitude as to how to measure both subset homogeneity and pairwise subset heterogeneity. To reiterate, however, whenever a problem may be tied to a real, physical process, this dilemma will not be so onerous, and we can be more confident in the measurement process. We shall benefit from this fact in our work here. As a final preparatory remark. we must comment on computational difficulty. Mathematical formulations of these optimization problems tend to be integer programs, and quite often the best one can do in reasonable computation time is the achievement of an approximate solution. We shall examine a method that solves clustering problems exactly, but which works upon an approximate problem. One of our tasks is to generate surrogate problems that are not too far removed from that intended. 3. FLUSTERING APPLIED TO INFORMATION SYSTEMS
What we hope to accomplish is the partition of a set of system components into subsystems so as to optimize the creation process. Thus we assume given a list of components and a
4)
l.\KKk E.
Sli\ifi
description of all interactions among them. This information could take the form of a block diagram of the system, as in Fig. 4, where the circles represent components and the arrows. interaction.
Fig. 4. Digraph representation of ;I system
Another alternative is a matrix description, where rows and columns represent system components and a 1 in the (i, j)th position means that component i acts upon component j. Transforming the structure of Figure 4 accordingly we would obtain
It should be apparent at what stage of system development our interest is focused: Fig. 4 we would interpret as output of the design process. Following it must come the work of creating the system represented. i.e. building the components and coordinating their interfaces. As we know, a typical design process may evolve a number of alternative structures and we might well wish to explore in advance aspects of the synthesis of each of them. When partitioning a conceptual system for the work to follow. there are two considerations: the manageability or quantity of work involved in each of the subsystem tasks and the work involved in providing interfaces between different working groups. The idea is that each cluster would represent a subsystem and that different groups build different subsystems. An option for the system in Fig. 4 would be to assign each component to a different cluster. Each subsystem task would be as small as possible in that case, but seven interfaces would result. Another alternative would be to group I , 2. 3 in one cluster and 4, 5, 6 in a second one. Each subsystem contains three components (which may have varying sizes. of course) but there are two connections between subsystems--3-4 and 7-6. Whether these be considered one interface or two is a matter of choice. Let us arbitrarily decide two and in so doing define a rule for counting interfaces. The reader will have perceived that components are our objects to cluster and that our notion of distance must be related to the connections between components. But then we have a conflict if we decide. for example. that in Fig. 4 2 is closer to 1 than I is to 2. Clustering problems demand symmetric distances or similarities. so we must interpret “being acted upon by” to represent the same extent of closeness a< “acting upon”. In short, we must interpret ;I representation such as Fig. 4 to be that of Fig. 5.
Fig. 5. A system represented
by an undirected
graph
Certainly the notion of interface is not damaged and the relationship of one subsystem to another seems. after all. ind,ependent of direction. The lines between pairs of objects in the graphical representation will be called edges.
Applications
of clustering
to information
system
design
41
The adjacency matrix for Fig. 5 is 123456 1011000 2101001 3110100 4,o 0 1 0 5000100 6010100
I-----1 1
(a) A heuristic method for gene& systems We may represent mathematically the finding of a partition in a number of ways. We may, for example, define Xij = 1, if objects i and j are assigned to the same cluster. = 0, otherwise. If we decide to group i and j together and to group j and k together, then we must have decided to group i and I( together. Consequently, the constraints all i
x,+Xj~tXi~#2
01
are equivalent to locating a partition of the elements. While not in the typical constraint form, the inequalities (1) may be converted to 5 form by the introduction of additional (integer) variables and constraints. The clustering problem may then be written as an integer programming problem once an objective function has been chosen. Let us think particularly in terms of information systems and suppose, for example, that we wish to minimize the total number of interfaces. Using the matrix I, we define hj = the number of interfaces (0 or 1) between components i and j. Our problem becomes minimize s.t.
C C (1 - Xi,)l, i
(2)
XijtXj,+Xi,f2,i
1.
To minimize the average number of interfaces between components in different sub-systems, we solve
I: Et1min.
xij)li,
jci z, C(l - xii)
(3)
subject to the same constraints. Presumably, designing a subsystem in which components are closely related would be a simpler task than one in which they are not. For example, the sub-system in Fig. 6 would seem easier to construct than that in Fig. 7. As a result of this consideration, we may prefer an objective function which attributes weight to the homogeneity of subsystems as well as to the interfaces between distinct
Fig. 6. A subsystem
of a system.
Fig. 7. A subsystem
of a system
subsystems. A more general distance measure may then be required. because two components joined by a path of interfaces of arbitrary length must be considered to be related to some extent. A possibility is to define the distance ~ij between components i and j to be equal to the length of a shortest path between them. With reference to Fig. 5. for example. &=3* d, i = dlz = & = I, d,, = 3. This measure is common in graph related problems and is a metric. It should be emphasized that the distance measures mentioned here are offered as possibilities. A user is free to define distance in whatever way seems most appropriate. For an objective, then, one might wish to (a) maximize the total distance
between
subsystems:
(b) minimize the total distance within subsystems:
(c) minimize the average squared distance
within subsystems:
(df minimize the difference between the average distance within subsystems distance between subsystems:
CE
dijXij
G c x,
C2
dij(
and the average
I - X;i)
2 C (1 -Xij)
Finally, if we must constrain the size of subsystems, as mentioned previously, constraints would be added to each of the foregoing problems. Suppose component m, and that no subsystem may be larger than M. It suffices then to add a constraint
additional i has size
(4)
for each value of j = I, 2,. . . , II, there being n components. We intend to describe first a heuristic for solving the problem (d) with the constraints (4) added. The procedure, for arbitrary kinds of objects, and without constraints on cluster size was mentioned in [24], where it was discussed rather thoroughly and in [26], where a counter-example is given along with further discussion. A description of the adapted version with sub-system weight constraints follows.
Appli~~t~~nsof clustering to i~formatioff system design
43
First, ail inter-component distances must be specified or calculated. Components will be selected and assigned to subsystems one-at-a-time. The set of unassigned components after k have been removed is denoted Sk. S,)= the initial, given set. Denote by f the objective function. Step 0. Set k = 0. Step 1. Compute Dj = sum of distances from component i in S, to all other components in S,. Compute
Step 2. Add Mj to the current size of each existing subsystem. If the sum exceeds M, omit that subsystem from consideration in Step 3. Step 3. Tentatively assign object j to each subsystem surviving Step 2 and as the initial object in a new subsystem. For each trial compute the value of f. Step 4. Find the minimum of the f values computed in Step 3. Let sI, be the minimizing subsystem. Step 5. Place component j in subsystem So.Store the now. increased weight of So Step 6. If the new S value is lower than the previous best, store the new solution. (Otherwise, continue to build upon the current solution, anyway.) Step 7. Remove component j from S,. Step 8. If Sk+, = 0, take the last stored solution as best. Otherwise, return to Step 1 with k=k+l.
To elaborate slightly upon Step 6, we point out that a partial solution includes S, -(j] as a subsystem, of course. For details of computational eficiency and theoretical aspects, the reader is referred to [24,26], but the method works in time at most n3 for n components and seems often, though it has been known to fail in contrived circumstances, to provide optimal solutions. Given the number of components in a typical information system, we would expect a solution to be obtained very rapidly. The heuristic has “solved” problems with n = 100 in about five seconds of IBM 370 time. For an example, we consider the system displayed in Fig. 8.
Fig. 8. Subsystems obtained by clustering.
In Fig. 8, the uncircled integers are the identifying labels of the components, whereas the circled ones are the components’ sizes. The distance between two components was taken to be the length of a s~~~tesf path connecting them. Fifty was the maximum subset size allowed. The dashed lines in Figure 8 portray the solution discovered. Since the notion of distance and the objective function were arbitrary, the solution may or may not appeal to a given observer. It should be remembered that the importance of interface is somewhat alloyed within the objective function (d). Naturally, the solution was obtained in less than one second, there being so few objects to cluster.
4-l (h)
I.~KKY
An exrtct method
for specicllized
E. Si\w
I
systems
We propose next a radically different method for clustering problems and apply it to systems having a specific structure. It is necessary first to examine clustering in a different light. Assuming there are II system elements then any subset has an easy representation in terms of an n-component vector of O’s and I’s There is a 1 in component i if and only if system element i belongs to the subset. With lr elements there are 2” - 1 = m such vectors (excluding the one of all zeros). We denote the jth such c~~l~ln~n vector ai and the matrix of all these by A. We define variables X,, . . .X,,, and set Xi = 1, if subset j is taken as a cluster = 0, otherwise If 1 is a vector of tz I’s, finding a partition of system elements is equivalent to finding a solution to Ax=1 Xj-OOr
l,allj=l,....m
(5)
Thus do we arrive at the constr~iints of a set p~~rtitioningproblem (SPP), a representation for clustering problems available at least since [IX]. If we allowed elements to belong to more than one cluster we would have written
Xj=O
Or
I
(6)
and obtained the constraints of the set covering problem, (SCP). Let us assume we have decided upon some intercomponent distance measure, denote those distances iiii, and write the problenl of minimizing the number of interfaces. Embedding our problem in the SPP formul~~tion, we would calculate dj = C of distances from elements in subset j to elements not in subset j. Next. if m; is the size of component i, we assume size is additive and define Mj = the size of subsystem j = C mi . If M = the maximum i in rithwt
j
subsystem size permitted, we may either (i) delete all columns ai whose total weight exceeds M or (ii) add constraints to the SPP constraints, for example MjXj5M,all
j= I ,...,
m.
In (if we would obtain an exact SPP; viz min. 2
d,Xj
i
s.1. ‘4X= I Xj =O or 1. all j where j now excludes all subsets that are too large. In (ii) we obtain a SPP with additional constraints. For good reason, we intend not to pursue either direction. The reader will appreciate the magnitude of the task of generating explicitly all 2” - 1 columns. in the first place, and will also note the absence of methods for solving SPP’s e~ciently[l~], they belonging to the class of hard problems. For purposes of illustration, let us take as an objective the f of the example problem in section a, preceding. Its nonlinearity presents the most complicated case and also that most demanding of computational effort. Easier cases will be mentioned later.
Applications
In the notation
of clustering
to information
system
design
4s
of section a, f
=
C
C
di,Xij
C
C
dij(l
-
Xj)
(7) c
c
x,
c
-
c
where all sums extend over the range i < j, as before. Making the assumption that C 1 Xij = K, a constant,
(1
-
xi,)
(7) becomes
with K,, K, nonnegative constants, so that fixing the number of within group distances provides an equivalent problem with a linear objective function. Equation (8), in fact, is intuitively satisfying, because it says we minimize the restricted function by locating the K smallest distances for wgd’s consistent with an actual partition. Neglecting momentarily restrictions on subsystem size, let us write our linearized problem in the SPP context. It says
s.t. Ax = 1 c WQX, =K
Xi=Oor where wj = # of wgd’s provided by subsystem
(9)
1
j, and dj = total wgd’s there.
Were we to solve a problem (9) for every feasible value of K between could select the best and know we had solved the given clustering
1 and
n-1 2
, we 1 problem. (Notice that K = 0 (
and K =
; , corresponding to each item in a different subset and every item in the same 0 subset, respectively, cause the nonlinear f to be undefined.) Next, we make a stringent assumption, restrictive for a system, from which we may obtain useful results. We assume that the system components may be numbered in such a way that the only subsystems of possible inclusion in an optimal solution would contain consecutively numbered components. A system where this appears straightforward, for example, is in Fig. 9.
Fig. 9. Collinear subsystems. The matrix A in (9) achieves a special structure, then; it is that in each column, the l’s are consecutive. This property is sufficient to make A unimodular [4]; that is, its square submatrices have determinant only * 1, 0, which in turn, guarantees integer solutions under the simplex method. In other words, were it not for CWjXj = K in (9), the integer problem could be solved by the simplex method. Unfortunately, that constraint precludes unimodularity of the coefficient matrix in (9). But there is no actual difficulty because of the necessity of solutions for a range of
K. Lagrangian
relaxation
fits our needs precisely.
min C ((i,
We treat
t hbtl,)X,
s.t. ‘4X = 1
(IO)
A until \‘ rc,,X,(h) assumes all values of K, until we have found an optimal
varying
until we have found an approximation
of known quality.
A value. In [25] this method is described relative are collinear.
to arbitrary
Several results there merit mention
(I) The sequence of optimal
function
appearance.
clustering
or
to be solved per
problems
in which the objects
in practice,
is nearly unimodal:
at this point.
values f*(K)
that is, they decrease to a point then increase
There i\ one problem
solution.
realized
with relatively
Thus. a search over A rather than an exhaustive
small fluctuations enumeration
to spoil that
is attractive.
(2) The method will solve the clustering problem r~snctl!: up to the existence of duality gaps; that is. values of K which no choice of A will generate. In [3], Everett showed that simple linear
approximations
computational
can provide
experience[27]
produce optimal
solutions.
bounds
teaches
that
on the possible rather
wide
error
resulting
sub-ranges
The effect of gaps is thus attenuated,
of
from
gaps. and
K values
generally
if not abrogated.
I1+ I -- 3 columns in the consecutive component case. where C-13 we exclude the column of all O’s and the column of all I’s, Thus explicit storage of the A matrix For tt objects,
A will have
is prohibited unless II is small. In [?7], a column-generating scheme is defined and implemented. Finding a best column for LP basis entry is accomplished by constructing one in a dynamic programming
routine.
It is possible
result of the unit dimensionality
to do this rapidly
of the component\.
and without
dimension
difficulties
a\ a
The process has II stages and no more than
tt states per stage. At each stage a 0 or I is selected as the next component
of a column of ‘4
and the stage returns are defined so that the total return, which is minimal, is the column’s price in the LP. The details are inappropriate for present usage and the interested reader is referred to [27]. It is within
the column
are enforced.
In the DP. of course.
(subsystem)
generation we exclude
process that the subsystem any decision
creation of a column weightier than the maximum, M. As mentioned, the work may be considerably diminished accommodating.
In our one-dimensional
number of interfaces.
Knowing
systems, for example,
the total number of interfaces
d, = number of connections
size restrictions
at a stage which
if the objective
leads to the
function
is more
suppose the only concern is the in the entire system,
we define
itsithin subsystem j.
and then
max 2
L/,X,
.s.t.Ax = 1
(II)
and subsystem size constraints. By maximizing the number of connections within subsystems one, of course, minimizes the number of between subsystem connections; i.e. the number of interfaces. The simplification, of course, is that there is no A in this problem. There is exactly one problem to solve, and it would be amenable to the LP/DP combination to which we alluded above. For our purposes. then, it would be best if the system elements could be represented as lying on a line and be such that the distances between them were still accurately portrayed.
Applications
of clustering to information
system design
47
Consequently, we must seek a way or ways of mapping elements so that they become collinear without intolerable distortion of the true distances. For a complicated system it may be necessary to accomplish several reasonable numberings so as to feel confident regarding the results. As mentioned previously, it is possible to solve a problem exactly, which is unusual for clustering problems, but the problem solved may be an approximate one. Thus, while transforming the problem in this way is challenging and invites further attention, we mention one important class of examples which fits our assumption naturally. As a particular instance let us consider a system with a specific and commonly occurring structure-that of a tree. Let Fig. 10 serve as an example system.
Fig. IO. A system with a tree structure.
The interpretation of such a diagram is typically a little different from our previous examples. Lines in these graphs mean “has the subsystem” or “is a subsystem of” accordingly as we assume a downward or upward orientation, respectively. Thus, the entire system is represented by vertex 8, which has three subsystems, 1,9, and 10; etc. The lowest levels of design, the end subsystems, are I, 2,3,4,5,6. So that lowest design level always corresponds to lowest tree level, it is conventional, and not disruptive, to extend subsystems lacking off spring as in Fig. 1I.
Fig. 1I. Transformation of a tree system The concept of distance or interface must be modified somewhat, because the lines on the graph no longer represent interfaces. A distance concept of much use in trees [9,15] may be defined as follows: d,j = number
of the lowest tree level minus number of the lowest level at which i and j belonged to the same subsystem.
In our example ddc = 3 - 2,
It turns out that not only is dji a metric, it satisfies the further property that dij 5 max [dik, dkj] for all i, j, k and is known as an ultrametric. From what we know of system design the definition seems appropriate to our purposes: the greater d;j, the more complicated the task of design if both i and j were required in one subsystem.
48
L\RR~E.Si\hn.i In Fig. 10 {4,5.6} seems a more reasonable design task than, say, { 1,3,7). The distance matrix for the sample system is as follows
I2
T&k
I.
3
4
5
6
7
IO
3
3
3
3
3
3
2
0
2
3
3
3
3
0
3
3
3
3
!
2
0
2
3 4
0112
5
0
6 7
0
Now our end systems were intentionally labeled from left to right to emphasize the collinearity we may assume in the case of trees. Thre matrix in Table i shows a property we could not preserve along a line-i.e. points to the right of any point form a non-decreasing sequence, but unlike geometrically distinct points, the sequence is not strictly increasing. Still we may consider only subsets composed of consecutive subsystems on the lowest level. The system in Figure 11 is sufficiently small that one could generate the full complement of columns and solve the linear program(s) according to the dictates of the objective function of interest. In Table 2 below is found the entire LP formulation for the problem requiring the Lagrange
Sable
d.
J
+
2
,h. J
multiplier. Though it would be a tedious task, the problem for fixed A could be solved by hand. At any rate, one begins with any convenient basis, and the identity matrix is convenient. Since the problem is a minimization, the identity matrix will be an optimal basis for all A 2 0. Hence one varies A only over negative values, and the optimal basis for one value may be used as an initial basis for the next A value. In practice, as mentioned previously, optimal solutions persist over ranges of h value. For our small sample problem, we would simply purge from Table 2 any columns exceeding the subsystem size constraint. In the column-generating mode we would check the weights of columns being synthesized and terminate those about to become overweight.
Clustering methods have found previous application in the study of information systems. The clustering of index terms[5]; the creation of fdes[20]; and the classification of a collection of documents[l7] are familiar areas in this respect. What we have attempted to illustrate is that there are applications fundamental to the very design of information systems. So far as objectives, constraints, and notions of similarity are concerned, we have offered what seem to be reasonable examples, but, clearly, there is great latitude for additional work.
Application\ of clustering to information system design
39
We must also emphasize what is probably well-known to the reader, and that is that a good methods exist for solving these kinds of problems. An exhaustive review is beyond the intentions of this paper, but a sample of the diversity of techniques is mentioned. Kernighan and Lin[l l] discussed the partitioning of graphs with weighted edges so as to minimize the total weight of those edges cut. A heuristic was developed which found solutions possessing a locally optimal character; viz. interchanges of pairs of points in different clusters produced no better solution. In [lb] Mulvey and Crowder devised a subgradient method, which was compatible with the minimax objective they chose. Rao and Umesh[l9] defined inter-object distance to involve other objects in the neighborhood of the two, selected a rather unusual objective function, and derived an exact optimization method. The method seems best suited for problems of a pattern recognition sort, and no indications of computation time were provided. Lefkovitch[l4] approached the problem in SPP terms, generating first a collection of columns meant to include all the most likely cluster candidates. There have been suggested methods using dynamic programming[8], branch and bound [ 121, and a variety of heuristics. The user, therefore, has a broad choice of methodologies to attack a particular clustering problem. What has been intended here, as well as displaying methods that seem pal,ticularly well-suited, is the establishment of a worthwhile area for application. There are available a variety of both practical and theoretical research lines that promise interesting results and knowledge relevant to the design of information systems. many other
REFERENCES [l] M. R. AKDERBERG,Cluster Anulrsis for Applications. Academic Press, New York (1973). [2] B. S. DURANand P. ODELI., Cluster Analysis: A Survey. Springer-Verlag, Berlin (1974). [3] H. EVERETT,Generalized Lagrange multiplier method for solving problems of optimal allocation of resources, Ops Res. 1963, 11. 399-417. [4] R. G4RFINKELand G. NEMHALISER,Integer Programming. Wiley. New York (1972). [S] C. G~TLIEB and S. KUMAR, Semantic clustering of index terms. J. ACM, 1968, 15, 493-513. [6] S. V. HANSEN et al., ISMS: Computer-aided analysis for design of decision-support systems, Munugement Science, 1979, 25, 1069-1081. [7] J. A. HARTIGAN.Clustering Algorithms. Wiley, New York (1975). [8] R. JE~‘SEN.A dynamic programming algorithm for cluster analysis. Ops Rrs.. 1969,12. 1034-1057. [9] S. C. JOHNSON,Hierarchical clustering schemes. Psyd~ometriku 1967, 32. 241-2.54. [IO] R. M. KARP. Reducibility among combinatorial problems. Compkrity of Computer Compututions (Edited by MII I ER and TH-\rCHER), Plenum. New York (197.0. [Ill B. KERNIGHANand S. LIN, An efficient heuristic procedure for partitioning graphs, Be// Systrm Tech. J. 1970. D-307. [I?] W. KOONT~et ol.. A branch and bound clustering algorithm. IEEE Trans. Comput. 197.5,908-91.5. [ 131 B. LANGEFORS.Theoreticd ,4nulwis of lnformtrtion Systmx 3rd edn. Vol. 2. Student litteratur. Lund, Sweden (1970). [14] L. LEFKO~‘ITCH. Conditional clustering. Biometrics 1980. 36. 43-58. [ 151 S. LUS7CL~wsK.~-ROMAHNoU’-\. Classification as a kind of distance function. Natural classifications. Studiu Logicu 196I, XII. [16] T. MUI VEYand H. CROWDER.Cluster analysis: an application of Lagrangean relaxation. Munugement Sci. 1979, 25, 329-340. [I71 N. PRICE and S. SCHIMINO\‘ICH,A clustering experiment: first step towards a computer-generated classification scheme. Znfortn. Storage Retried 1968, 4, 271-280. [IS] M. Rae, Cluster analysis and mathematical programming, J. ASA. 1971. 66. 622-626. [19] V. RAO and R. UMESH, An optimization clustering algorithm, Proc. 4th 011’1 Joint Conf. on Pattern Recog., pp. 296-300. Kyoto, Japan, IEEE (1979). 1201 G. S.4I.TONand A. WONG, Generation and search of clustered files. ACM Truns. Database Systems 1978, 3. [21] J. SENN, Informution Systems in Munugement, Wadsworth, Belmont. California (1978). 1 [22] P. SNEATHand R. SOKOL, Numerical Taxonomy. Freeman, San Francisco (1973), ((;,
50
LARRYE. SI ihI- I
[23] H. SPANG,Distributed computer systems for control. General Elec. Report No. 76 CRD 049. Schenectady, N.Y. (1976). [24] L. E. STANFEL, Experiments with a very efficient heuristic for clustering problems. Inform. Systems 1979. 4, 285-292. [2S] L. STANFEL, A Lagrangian treatment of certain nonlinear clustering problems. Europew J. Ops Res. 1981, 7. 121-132. [26] L. E. STANFEI. and J. Y. CHEN, A counter-example to a clustering heuristic: a letter to the editor. Inform. Systems 1981, 6, 3. [27] L. E. STANFEL, An algorithm using Lagrangian relaxation and column generation for one-dimensional clustering problems. Optimizution in Statistics (Edited by S. Zanakis). Institute of Management Science. Providence, Rhode Island.