Applications of clustering to information system design

Applications of clustering to information system design

APPLICATIONS OF CLUSTERING TO INFORMATION SYSTEM DESIGN LARRY E. STANFEI. Department of Management and Marketing, Clarkson College. Potsdam, NY 13676,...

1MB Sizes 0 Downloads 71 Views

APPLICATIONS OF CLUSTERING TO INFORMATION SYSTEM DESIGN LARRY E. STANFEI. Department of Management and Marketing, Clarkson College. Potsdam, NY 13676,US.4 (Received for publicution 5 Mup 1982)

Abstract-Given the difficulty of designing and creating information systems of many components and interconnections, it is commonplace to find these tasks accomplished by means of a partition into subsystems. Later the distinct subsystems are made to interface with one another and an overall system is achieved. The purpose of the present paper is to point out the availability of methods for effecting the partition in optimal or approximately optimal ways. Clustering algorithms for the specific case of information systems are obtained and exemplified. I. INTRODUCTION

It is unnecessary to review at length the relative inadequacy of analysis and design techniques when confronted with the task of analyzing or designing complicated systems. Even during the period when it was fashionable to advocate “total systems design,” no one was so bold as to presume that, excepting very special cases of structural simplicity, the state of the art had proceeded to the point of making that objective feasible. This is not to derogate the nobility of the idea, for, ideed, if it is possible to identify an optimal system, if one does not take into account all the existing interactions in the total system, then how can the result be optimal? Surely an amalgamation of parts of a system is doomed to suboptimality. Being a particular sort of system, an information system is subject to the same limitations, of course, and a vast quantity has been written upon the topic of the design of information systems. It is instructive to notice how authors with widely divergent backgrounds, points of view, and reference disciplines arrive, logically, at the conclusion that information system design, if it is to succeed, must take form as an interconnected collection of subsystems. Four brief references illustrate this fact. Within a framework of qualitative postulates, Langefors [13] adduced, via a series of theorems in the same spirit, that an iterative design scheme based upon a subsystem structure was, in fact, the only way an effective complicated system could be designed. Typical texts on management information systems advocate design in terms of subsystems though their approach tends to be less formal. Senn[21], for example, makes mention of design strategies, all of which involve subsystems and which resemble, in addition, the hierarchical strategy of Langefors. As the foundation of an automated design aid, ISMS, various partitioning schemes based upon elementary graph theory are given by Hansen et a/.[6]. These allow a partitioning of system elements into subsets called levels, and within levels more closely related subsets identifiable as circuits, for example, are salient prospects for constituting subsystems. Mentioning system structure, Spang[23] wrote “A good rule of thumb indicates that a system should be divided into functional parts in such a way that there exists maximum independence with well-defined simple interfaces and a minimum of required communication.” Finally, of course, we may cite empirical evidence as experienced by anyone who has ever worked on a large complicated problem, system or not. Feasibility seems to demand a decomposition into subproblems and the conclusions of these and a great number of authors appear, after at most a little reflection, to embody a sort of natural principle, rather the divide and conquer of humankind. Langefors, it will be noted, conjectured that evolution itself might have transpired according to such an iterative scheme as he described.

?r;

1 \KKk k. sI\CllI

To summarize, then, assuming we had a knowledge of the system components and how these were interrelated, a valid problem is to decide how to partition these-optimally if we assume we are capable of measuring partition goodness-into subsystems. We must comment at greater length upon the question of measurement, but two conflicting criteria make the problem

interesting.

realize.

and

as

First,

the smaller

the latter necessitate

the subsystems.

coordination

the more


and compatibility

efforts,

interfaces

we

the system design

task burgeons. Next. however, the larger the subsystems. the greater the task of completing any one of them, and consequently, the less advantage gained by the decomposition, and perhaps. the amount of work remain5 prohibitive. Since considerable research interest

has been shown in the general problem

sets of objects

natural

problem

into subsets. it \eems

we have been mentioning,

applications

to problems

general concepts. approaches

to the problems

systems,

problem

it is necessary

section contains

of partitioning

the system design

that material.

to mention

Afterward

briefly

several

two quite different

will be derived. 2. (‘I~llS’l’ERING

A clustering

with

and that is the topic of the present paper. Before examining

in information

The following

to take that approach

PROHI.EMS

i\ simply that of separating

into subsets so a\ to satisfy some criteria.

or partitioning

The criteria

a

may stipulate

a

finite collection

of objects

best partition,

there being

given a way to measure the goodness of any one; or they may include such properties that are so that one stops searching when the criteria are synonymous with “acceptable.”

deemed

“acceptable”

satisfied. Naturally. all. indistinct.

may be equated to optimal,

so that the two instances are. after

A number of books provide admirable surveys of the variations of the problem, mathematical formulations, the diversity of solution approaches, and informative bibliographies. See, for instance. Interest

[I, 2.7.231. Some formulations. here will

be restricted

for example.

to locating

optimal

fix the number

partitions

of subsets in advance.

in cases where

the number

of

subsets is unconstrained. It is assumed invariably that there is defined a distance between each pair of objects or. equivalently, a proximity or similarity. Here, we prefer to think in terms of distance, and whatever

makes sense in the context

If the object5

were cities,

of the problem

Euclidean

distance

is a reasonable

distance measure.

in the plane would

seem appropriate;

for

locations within a city, rectangular distances may be the proper choice; to cluster the planets in the solar system. the difference between mean distance from the sun may be desirable: to cluster

;I collection

of students based on their performance

are the ten-dimensional

vectors

of scorej.

on a battery

and distance

might

of ten tests, the objects

be Euclidean

in the ten-

dimensional space. A measure of the goodness of a particular partition, then, must be in terms of these distances. and references previously cited contain numerous examples. The diversity of possibilities.

in fact. is

:I

symptom

Let us agree to denote by a within the same cluster different

and by between

of

;I

complicating

fact to which we mu\t return

group

distance

(bgd),

;I

clusters. We shall write d,, for the distance between

Several objectives

a

little later.

group distance (wgd) a distance between two objects within

that have been used in clustering

distance objects

problems,

between

two objects

in

i and i.

then, are the minimization

of

within group sum of squared distances; the minimization of within group sum of squared distances to the cluster centroid; the minimization of maximum cluster diameter; and the minimization of the difference between average wgd and average bgd. There are many, many others. Since the last example function above is related to the work in the sequel. we illustrate it for the specific partition

in Fig. I. ?

Fig. I. A smifll \et partitioned

.Applicationc of clustering IO information

system design

39

The optimization problem would be to find that partition which yields the smallest value of the objective function. Observing the existence of a choice of objective functions, one is naturally led to asking how to tnakt that choice. If interpreting Fig. 1 to portray Euclidean distances, if we were told the solution exhibited is best. we should wonder what best means, because, intuitively, it does not appear that we have clustered the objects naturally. Intuition, in fact, may well suffice to provide optimal clusterings in easy problems. such as in Fig. 2, but what if the objects are as in Fig. 3 or lying in a space of dimension we cannot illustrate? The motivation for an objective function is obvious, then, but much less obvious is how to transform the intuitional sentiments into an objective function that mirrors them perfectly. This accounts for the variety of objective functions, in general, but we will find the path a little easier in problems dealing with clustering in information systems.

Fig. 2. Naturally clustered data.

Fig. 3. Unstructured

data.

It may be instructive to present a brief, hopefully logical, sequence by which one arrives at an objective function. In order to be designated a cluster it seems that objects within a subset should all be relatively close to one another. Furthermore, if we can distinguish between two clusters. it seems they should be relatively far removed. These two criteria should combine to cause the f formerly referenced to be smcrll. Therefore. by minimizing f one hopes that by making f small, he achieves the two criteria. The reader will appreciate, of course, that if the two criteria above are exactly what should be sought, there remains latitude as to how to measure both subset homogeneity and pairwise subset heterogeneity. To reiterate, however, whenever a problem may be tied to a real, physical process, this dilemma will not be so onerous, and we can be more confident in the measurement process. We shall benefit from this fact in our work here. As a final preparatory remark. we must comment on computational difficulty. Mathematical formulations of these optimization problems tend to be integer programs, and quite often the best one can do in reasonable computation time is the achievement of an approximate solution. We shall examine a method that solves clustering problems exactly, but which works upon an approximate problem. One of our tasks is to generate surrogate problems that are not too far removed from that intended. 3. FLUSTERING APPLIED TO INFORMATION SYSTEMS

What we hope to accomplish is the partition of a set of system components into subsystems so as to optimize the creation process. Thus we assume given a list of components and a

4)

l.\KKk E.

Sli\ifi

description of all interactions among them. This information could take the form of a block diagram of the system, as in Fig. 4, where the circles represent components and the arrows. interaction.

Fig. 4. Digraph representation of ;I system

Another alternative is a matrix description, where rows and columns represent system components and a 1 in the (i, j)th position means that component i acts upon component j. Transforming the structure of Figure 4 accordingly we would obtain

It should be apparent at what stage of system development our interest is focused: Fig. 4 we would interpret as output of the design process. Following it must come the work of creating the system represented. i.e. building the components and coordinating their interfaces. As we know, a typical design process may evolve a number of alternative structures and we might well wish to explore in advance aspects of the synthesis of each of them. When partitioning a conceptual system for the work to follow. there are two considerations: the manageability or quantity of work involved in each of the subsystem tasks and the work involved in providing interfaces between different working groups. The idea is that each cluster would represent a subsystem and that different groups build different subsystems. An option for the system in Fig. 4 would be to assign each component to a different cluster. Each subsystem task would be as small as possible in that case, but seven interfaces would result. Another alternative would be to group I , 2. 3 in one cluster and 4, 5, 6 in a second one. Each subsystem contains three components (which may have varying sizes. of course) but there are two connections between subsystems--3-4 and 7-6. Whether these be considered one interface or two is a matter of choice. Let us arbitrarily decide two and in so doing define a rule for counting interfaces. The reader will have perceived that components are our objects to cluster and that our notion of distance must be related to the connections between components. But then we have a conflict if we decide. for example. that in Fig. 4 2 is closer to 1 than I is to 2. Clustering problems demand symmetric distances or similarities. so we must interpret “being acted upon by” to represent the same extent of closeness a< “acting upon”. In short, we must interpret ;I representation such as Fig. 4 to be that of Fig. 5.

Fig. 5. A system represented

by an undirected

graph

Certainly the notion of interface is not damaged and the relationship of one subsystem to another seems. after all. ind,ependent of direction. The lines between pairs of objects in the graphical representation will be called edges.

Applications

of clustering

to information

system

design

41

The adjacency matrix for Fig. 5 is 123456 1011000 2101001 3110100 4,o 0 1 0 5000100 6010100

I-----1 1

(a) A heuristic method for gene& systems We may represent mathematically the finding of a partition in a number of ways. We may, for example, define Xij = 1, if objects i and j are assigned to the same cluster. = 0, otherwise. If we decide to group i and j together and to group j and k together, then we must have decided to group i and I( together. Consequently, the constraints all i
x,+Xj~tXi~#2

01

are equivalent to locating a partition of the elements. While not in the typical constraint form, the inequalities (1) may be converted to 5 form by the introduction of additional (integer) variables and constraints. The clustering problem may then be written as an integer programming problem once an objective function has been chosen. Let us think particularly in terms of information systems and suppose, for example, that we wish to minimize the total number of interfaces. Using the matrix I, we define hj = the number of interfaces (0 or 1) between components i and j. Our problem becomes minimize s.t.

C C (1 - Xi,)l, i
(2)

XijtXj,+Xi,f2,i
1.

To minimize the average number of interfaces between components in different sub-systems, we solve

I: Et1min.

xij)li,

jci z, C(l - xii)

(3)

subject to the same constraints. Presumably, designing a subsystem in which components are closely related would be a simpler task than one in which they are not. For example, the sub-system in Fig. 6 would seem easier to construct than that in Fig. 7. As a result of this consideration, we may prefer an objective function which attributes weight to the homogeneity of subsystems as well as to the interfaces between distinct

Fig. 6. A subsystem

of a system.

Fig. 7. A subsystem

of a system

subsystems. A more general distance measure may then be required. because two components joined by a path of interfaces of arbitrary length must be considered to be related to some extent. A possibility is to define the distance ~ij between components i and j to be equal to the length of a shortest path between them. With reference to Fig. 5. for example. &=3* d, i = dlz = & = I, d,, = 3. This measure is common in graph related problems and is a metric. It should be emphasized that the distance measures mentioned here are offered as possibilities. A user is free to define distance in whatever way seems most appropriate. For an objective, then, one might wish to (a) maximize the total distance

between

subsystems:

(b) minimize the total distance within subsystems:

(c) minimize the average squared distance

within subsystems:

(df minimize the difference between the average distance within subsystems distance between subsystems:

CE

dijXij

G c x,

C2

dij(

and the average

I - X;i)

2 C (1 -Xij)

Finally, if we must constrain the size of subsystems, as mentioned previously, constraints would be added to each of the foregoing problems. Suppose component m, and that no subsystem may be larger than M. It suffices then to add a constraint

additional i has size

(4)

for each value of j = I, 2,. . . , II, there being n components. We intend to describe first a heuristic for solving the problem (d) with the constraints (4) added. The procedure, for arbitrary kinds of objects, and without constraints on cluster size was mentioned in [24], where it was discussed rather thoroughly and in [26], where a counter-example is given along with further discussion. A description of the adapted version with sub-system weight constraints follows.

Appli~~t~~nsof clustering to i~formatioff system design

43

First, ail inter-component distances must be specified or calculated. Components will be selected and assigned to subsystems one-at-a-time. The set of unassigned components after k have been removed is denoted Sk. S,)= the initial, given set. Denote by f the objective function. Step 0. Set k = 0. Step 1. Compute Dj = sum of distances from component i in S, to all other components in S,. Compute

Step 2. Add Mj to the current size of each existing subsystem. If the sum exceeds M, omit that subsystem from consideration in Step 3. Step 3. Tentatively assign object j to each subsystem surviving Step 2 and as the initial object in a new subsystem. For each trial compute the value of f. Step 4. Find the minimum of the f values computed in Step 3. Let sI, be the minimizing subsystem. Step 5. Place component j in subsystem So.Store the now. increased weight of So Step 6. If the new S value is lower than the previous best, store the new solution. (Otherwise, continue to build upon the current solution, anyway.) Step 7. Remove component j from S,. Step 8. If Sk+, = 0, take the last stored solution as best. Otherwise, return to Step 1 with k=k+l.

To elaborate slightly upon Step 6, we point out that a partial solution includes S, -(j] as a subsystem, of course. For details of computational eficiency and theoretical aspects, the reader is referred to [24,26], but the method works in time at most n3 for n components and seems often, though it has been known to fail in contrived circumstances, to provide optimal solutions. Given the number of components in a typical information system, we would expect a solution to be obtained very rapidly. The heuristic has “solved” problems with n = 100 in about five seconds of IBM 370 time. For an example, we consider the system displayed in Fig. 8.

Fig. 8. Subsystems obtained by clustering.

In Fig. 8, the uncircled integers are the identifying labels of the components, whereas the circled ones are the components’ sizes. The distance between two components was taken to be the length of a s~~~tesf path connecting them. Fifty was the maximum subset size allowed. The dashed lines in Figure 8 portray the solution discovered. Since the notion of distance and the objective function were arbitrary, the solution may or may not appeal to a given observer. It should be remembered that the importance of interface is somewhat alloyed within the objective function (d). Naturally, the solution was obtained in less than one second, there being so few objects to cluster.

4-l (h)

I.~KKY

An exrtct method

for specicllized

E. Si\w

I

systems

We propose next a radically different method for clustering problems and apply it to systems having a specific structure. It is necessary first to examine clustering in a different light. Assuming there are II system elements then any subset has an easy representation in terms of an n-component vector of O’s and I’s There is a 1 in component i if and only if system element i belongs to the subset. With lr elements there are 2” - 1 = m such vectors (excluding the one of all zeros). We denote the jth such c~~l~ln~n vector ai and the matrix of all these by A. We define variables X,, . . .X,,, and set Xi = 1, if subset j is taken as a cluster = 0, otherwise If 1 is a vector of tz I’s, finding a partition of system elements is equivalent to finding a solution to Ax=1 Xj-OOr

l,allj=l,....m

(5)

Thus do we arrive at the constr~iints of a set p~~rtitioningproblem (SPP), a representation for clustering problems available at least since [IX]. If we allowed elements to belong to more than one cluster we would have written

Xj=O

Or

I

(6)

and obtained the constraints of the set covering problem, (SCP). Let us assume we have decided upon some intercomponent distance measure, denote those distances iiii, and write the problenl of minimizing the number of interfaces. Embedding our problem in the SPP formul~~tion, we would calculate dj = C of distances from elements in subset j to elements not in subset j. Next. if m; is the size of component i, we assume size is additive and define Mj = the size of subsystem j = C mi . If M = the maximum i in rithwt

j

subsystem size permitted, we may either (i) delete all columns ai whose total weight exceeds M or (ii) add constraints to the SPP constraints, for example MjXj5M,all

j= I ,...,

m.

In (if we would obtain an exact SPP; viz min. 2

d,Xj

i

s.1. ‘4X= I Xj =O or 1. all j where j now excludes all subsets that are too large. In (ii) we obtain a SPP with additional constraints. For good reason, we intend not to pursue either direction. The reader will appreciate the magnitude of the task of generating explicitly all 2” - 1 columns. in the first place, and will also note the absence of methods for solving SPP’s e~ciently[l~], they belonging to the class of hard problems. For purposes of illustration, let us take as an objective the f of the example problem in section a, preceding. Its nonlinearity presents the most complicated case and also that most demanding of computational effort. Easier cases will be mentioned later.

Applications

In the notation

of clustering

to information

system

design

4s

of section a, f

=

C

C

di,Xij

C

C

dij(l

-

Xj)

(7) c

c

x,

c

-

c

where all sums extend over the range i < j, as before. Making the assumption that C 1 Xij = K, a constant,

(1

-

xi,)

(7) becomes

with K,, K, nonnegative constants, so that fixing the number of within group distances provides an equivalent problem with a linear objective function. Equation (8), in fact, is intuitively satisfying, because it says we minimize the restricted function by locating the K smallest distances for wgd’s consistent with an actual partition. Neglecting momentarily restrictions on subsystem size, let us write our linearized problem in the SPP context. It says

s.t. Ax = 1 c WQX, =K

Xi=Oor where wj = # of wgd’s provided by subsystem

(9)

1

j, and dj = total wgd’s there.

Were we to solve a problem (9) for every feasible value of K between could select the best and know we had solved the given clustering

1 and

n-1 2

, we 1 problem. (Notice that K = 0 (

and K =

; , corresponding to each item in a different subset and every item in the same 0 subset, respectively, cause the nonlinear f to be undefined.) Next, we make a stringent assumption, restrictive for a system, from which we may obtain useful results. We assume that the system components may be numbered in such a way that the only subsystems of possible inclusion in an optimal solution would contain consecutively numbered components. A system where this appears straightforward, for example, is in Fig. 9.

Fig. 9. Collinear subsystems. The matrix A in (9) achieves a special structure, then; it is that in each column, the l’s are consecutive. This property is sufficient to make A unimodular [4]; that is, its square submatrices have determinant only * 1, 0, which in turn, guarantees integer solutions under the simplex method. In other words, were it not for CWjXj = K in (9), the integer problem could be solved by the simplex method. Unfortunately, that constraint precludes unimodularity of the coefficient matrix in (9). But there is no actual difficulty because of the necessity of solutions for a range of

K. Lagrangian

relaxation

fits our needs precisely.

min C ((i,

We treat

t hbtl,)X,

s.t. ‘4X = 1

(IO)

A until \‘ rc,,X,(h) assumes all values of K, until we have found an optimal

varying

until we have found an approximation

of known quality.

A value. In [25] this method is described relative are collinear.

to arbitrary

Several results there merit mention

(I) The sequence of optimal

function

appearance.

clustering

or

to be solved per

problems

in which the objects

in practice,

is nearly unimodal:

at this point.

values f*(K)

that is, they decrease to a point then increase

There i\ one problem

solution.

realized

with relatively

Thus. a search over A rather than an exhaustive

small fluctuations enumeration

to spoil that

is attractive.

(2) The method will solve the clustering problem r~snctl!: up to the existence of duality gaps; that is. values of K which no choice of A will generate. In [3], Everett showed that simple linear

approximations

computational

can provide

experience[27]

produce optimal

solutions.

bounds

teaches

that

on the possible rather

wide

error

resulting

sub-ranges

The effect of gaps is thus attenuated,

of

from

gaps. and

K values

generally

if not abrogated.

I1+ I -- 3 columns in the consecutive component case. where C-13 we exclude the column of all O’s and the column of all I’s, Thus explicit storage of the A matrix For tt objects,

A will have

is prohibited unless II is small. In [?7], a column-generating scheme is defined and implemented. Finding a best column for LP basis entry is accomplished by constructing one in a dynamic programming

routine.

It is possible

result of the unit dimensionality

to do this rapidly

of the component\.

and without

dimension

difficulties

a\ a

The process has II stages and no more than

tt states per stage. At each stage a 0 or I is selected as the next component

of a column of ‘4

and the stage returns are defined so that the total return, which is minimal, is the column’s price in the LP. The details are inappropriate for present usage and the interested reader is referred to [27]. It is within

the column

are enforced.

In the DP. of course.

(subsystem)

generation we exclude

process that the subsystem any decision

creation of a column weightier than the maximum, M. As mentioned, the work may be considerably diminished accommodating.

In our one-dimensional

number of interfaces.

Knowing

systems, for example,

the total number of interfaces

d, = number of connections

size restrictions

at a stage which

if the objective

leads to the

function

is more

suppose the only concern is the in the entire system,

we define

itsithin subsystem j.

and then

max 2

L/,X,

.s.t.Ax = 1

(II)

and subsystem size constraints. By maximizing the number of connections within subsystems one, of course, minimizes the number of between subsystem connections; i.e. the number of interfaces. The simplification, of course, is that there is no A in this problem. There is exactly one problem to solve, and it would be amenable to the LP/DP combination to which we alluded above. For our purposes. then, it would be best if the system elements could be represented as lying on a line and be such that the distances between them were still accurately portrayed.

Applications

of clustering to information

system design

47

Consequently, we must seek a way or ways of mapping elements so that they become collinear without intolerable distortion of the true distances. For a complicated system it may be necessary to accomplish several reasonable numberings so as to feel confident regarding the results. As mentioned previously, it is possible to solve a problem exactly, which is unusual for clustering problems, but the problem solved may be an approximate one. Thus, while transforming the problem in this way is challenging and invites further attention, we mention one important class of examples which fits our assumption naturally. As a particular instance let us consider a system with a specific and commonly occurring structure-that of a tree. Let Fig. 10 serve as an example system.

Fig. IO. A system with a tree structure.

The interpretation of such a diagram is typically a little different from our previous examples. Lines in these graphs mean “has the subsystem” or “is a subsystem of” accordingly as we assume a downward or upward orientation, respectively. Thus, the entire system is represented by vertex 8, which has three subsystems, 1,9, and 10; etc. The lowest levels of design, the end subsystems, are I, 2,3,4,5,6. So that lowest design level always corresponds to lowest tree level, it is conventional, and not disruptive, to extend subsystems lacking off spring as in Fig. 1I.

Fig. 1I. Transformation of a tree system The concept of distance or interface must be modified somewhat, because the lines on the graph no longer represent interfaces. A distance concept of much use in trees [9,15] may be defined as follows: d,j = number

of the lowest tree level minus number of the lowest level at which i and j belonged to the same subsystem.

In our example ddc = 3 - 2,

It turns out that not only is dji a metric, it satisfies the further property that dij 5 max [dik, dkj] for all i, j, k and is known as an ultrametric. From what we know of system design the definition seems appropriate to our purposes: the greater d;j, the more complicated the task of design if both i and j were required in one subsystem.

48

L\RR~E.Si\hn.i In Fig. 10 {4,5.6} seems a more reasonable design task than, say, { 1,3,7). The distance matrix for the sample system is as follows

I2

T&k

I.

3

4

5

6

7

IO

3

3

3

3

3

3

2

0

2

3

3

3

3

0

3

3

3

3

!

2

0

2

3 4

0112

5

0

6 7

0

Now our end systems were intentionally labeled from left to right to emphasize the collinearity we may assume in the case of trees. Thre matrix in Table i shows a property we could not preserve along a line-i.e. points to the right of any point form a non-decreasing sequence, but unlike geometrically distinct points, the sequence is not strictly increasing. Still we may consider only subsets composed of consecutive subsystems on the lowest level. The system in Figure 11 is sufficiently small that one could generate the full complement of columns and solve the linear program(s) according to the dictates of the objective function of interest. In Table 2 below is found the entire LP formulation for the problem requiring the Lagrange

Sable

d.

J

+

2

,h. J

multiplier. Though it would be a tedious task, the problem for fixed A could be solved by hand. At any rate, one begins with any convenient basis, and the identity matrix is convenient. Since the problem is a minimization, the identity matrix will be an optimal basis for all A 2 0. Hence one varies A only over negative values, and the optimal basis for one value may be used as an initial basis for the next A value. In practice, as mentioned previously, optimal solutions persist over ranges of h value. For our small sample problem, we would simply purge from Table 2 any columns exceeding the subsystem size constraint. In the column-generating mode we would check the weights of columns being synthesized and terminate those about to become overweight.

Clustering methods have found previous application in the study of information systems. The clustering of index terms[5]; the creation of fdes[20]; and the classification of a collection of documents[l7] are familiar areas in this respect. What we have attempted to illustrate is that there are applications fundamental to the very design of information systems. So far as objectives, constraints, and notions of similarity are concerned, we have offered what seem to be reasonable examples, but, clearly, there is great latitude for additional work.

Application\ of clustering to information system design

39

We must also emphasize what is probably well-known to the reader, and that is that a good methods exist for solving these kinds of problems. An exhaustive review is beyond the intentions of this paper, but a sample of the diversity of techniques is mentioned. Kernighan and Lin[l l] discussed the partitioning of graphs with weighted edges so as to minimize the total weight of those edges cut. A heuristic was developed which found solutions possessing a locally optimal character; viz. interchanges of pairs of points in different clusters produced no better solution. In [lb] Mulvey and Crowder devised a subgradient method, which was compatible with the minimax objective they chose. Rao and Umesh[l9] defined inter-object distance to involve other objects in the neighborhood of the two, selected a rather unusual objective function, and derived an exact optimization method. The method seems best suited for problems of a pattern recognition sort, and no indications of computation time were provided. Lefkovitch[l4] approached the problem in SPP terms, generating first a collection of columns meant to include all the most likely cluster candidates. There have been suggested methods using dynamic programming[8], branch and bound [ 121, and a variety of heuristics. The user, therefore, has a broad choice of methodologies to attack a particular clustering problem. What has been intended here, as well as displaying methods that seem pal,ticularly well-suited, is the establishment of a worthwhile area for application. There are available a variety of both practical and theoretical research lines that promise interesting results and knowledge relevant to the design of information systems. many other

REFERENCES [l] M. R. AKDERBERG,Cluster Anulrsis for Applications. Academic Press, New York (1973). [2] B. S. DURANand P. ODELI., Cluster Analysis: A Survey. Springer-Verlag, Berlin (1974). [3] H. EVERETT,Generalized Lagrange multiplier method for solving problems of optimal allocation of resources, Ops Res. 1963, 11. 399-417. [4] R. G4RFINKELand G. NEMHALISER,Integer Programming. Wiley. New York (1972). [S] C. G~TLIEB and S. KUMAR, Semantic clustering of index terms. J. ACM, 1968, 15, 493-513. [6] S. V. HANSEN et al., ISMS: Computer-aided analysis for design of decision-support systems, Munugement Science, 1979, 25, 1069-1081. [7] J. A. HARTIGAN.Clustering Algorithms. Wiley, New York (1975). [8] R. JE~‘SEN.A dynamic programming algorithm for cluster analysis. Ops Rrs.. 1969,12. 1034-1057. [9] S. C. JOHNSON,Hierarchical clustering schemes. Psyd~ometriku 1967, 32. 241-2.54. [IO] R. M. KARP. Reducibility among combinatorial problems. Compkrity of Computer Compututions (Edited by MII I ER and TH-\rCHER), Plenum. New York (197.0. [Ill B. KERNIGHANand S. LIN, An efficient heuristic procedure for partitioning graphs, Be// Systrm Tech. J. 1970. D-307. [I?] W. KOONT~et ol.. A branch and bound clustering algorithm. IEEE Trans. Comput. 197.5,908-91.5. [ 131 B. LANGEFORS.Theoreticd ,4nulwis of lnformtrtion Systmx 3rd edn. Vol. 2. Student litteratur. Lund, Sweden (1970). [14] L. LEFKO~‘ITCH. Conditional clustering. Biometrics 1980. 36. 43-58. [ 151 S. LUS7CL~wsK.~-ROMAHNoU’-\. Classification as a kind of distance function. Natural classifications. Studiu Logicu 196I, XII. [16] T. MUI VEYand H. CROWDER.Cluster analysis: an application of Lagrangean relaxation. Munugement Sci. 1979, 25, 329-340. [I71 N. PRICE and S. SCHIMINO\‘ICH,A clustering experiment: first step towards a computer-generated classification scheme. Znfortn. Storage Retried 1968, 4, 271-280. [IS] M. Rae, Cluster analysis and mathematical programming, J. ASA. 1971. 66. 622-626. [19] V. RAO and R. UMESH, An optimization clustering algorithm, Proc. 4th 011’1 Joint Conf. on Pattern Recog., pp. 296-300. Kyoto, Japan, IEEE (1979). 1201 G. S.4I.TONand A. WONG, Generation and search of clustered files. ACM Truns. Database Systems 1978, 3. [21] J. SENN, Informution Systems in Munugement, Wadsworth, Belmont. California (1978). 1 [22] P. SNEATHand R. SOKOL, Numerical Taxonomy. Freeman, San Francisco (1973), ((;,

50

LARRYE. SI ihI- I

[23] H. SPANG,Distributed computer systems for control. General Elec. Report No. 76 CRD 049. Schenectady, N.Y. (1976). [24] L. E. STANFEL, Experiments with a very efficient heuristic for clustering problems. Inform. Systems 1979. 4, 285-292. [2S] L. STANFEL, A Lagrangian treatment of certain nonlinear clustering problems. Europew J. Ops Res. 1981, 7. 121-132. [26] L. E. STANFEI. and J. Y. CHEN, A counter-example to a clustering heuristic: a letter to the editor. Inform. Systems 1981, 6, 3. [27] L. E. STANFEL, An algorithm using Lagrangian relaxation and column generation for one-dimensional clustering problems. Optimizution in Statistics (Edited by S. Zanakis). Institute of Management Science. Providence, Rhode Island.