Pattern Recognition Letters 12 (1991) 511-517 North-Holland
September 1991
A knowledge-based clustering algorithm V. S r i d h a r a n d M. N a r a s i m h a
Murty
Department o f Computer Science and Automation, Indian Institute o f Science, Bangalore-560012, India
Received 29 November 1990 Revised 5 April 1991
A bstract Sridhar, V. and M. Narasimha Murty, A knowledge-based clustering algorithm, Pattern Recognition Letters 12 (1991) 511-517. We describe a clustering technique that can exploit a large body of knowledge. An algorithm to cluster input objects, using the knowledge available, is presented. This algorithm is order-independent and can be naturally extended to cluster objects in an incremental way. The usefulness of the knowledge-based clustering algorithm is studied in the context o f a database.
Keywords. Order-independence, incremental clustering algorithm, knowledge-based clustering, knowledge base.
1. Introduction Classification is a process of assigning labels to objects. On the other hand, conventionally, clustering is a process of grouping unlabelled objects. In this paper, we propose a new approach, called deductive clustering approach, that is a combination of both classification and clustering. More specifically, objects are labelled based on their features during the first phase. These labelled objects are grouped to generate clusters along with their descriptions. The clusters so generated can be called human-oriented because the clusters and their descriptions are generated with the help of a knowledge base that is also shared by humans. Practical applications of clustering involve further processing of the clusters and descriptions generated. This processing may include: (1) answering queries about the data; (2) answering queries about the cluster structures; and (3) acquisition of additional data.
Several clustering algorithms have been proposed [1-3]. These algorithms can be broadly classified into the following two categories. (1) Those in which knowledge is implicitly represented. In this case, knowledge is usually represented in the form of features of the input objects, a similarity measure, and the objective function to be optimized. Examples of such algorithms include the Single-Linkage algorithm, and the K-means algorithm [1]. (2) Those in which knowledge is explicitly represented. In this case, the knowledge that is explicitly represented guides the classification process. The Conjunctive Conceptual Clustering algorithm is an example where conceptual and contextual knowledge guides the classification process [4, 5]. The motivation for employing knowledge while clustering comes from the philosophical observation made by Watanabe in [6]. He points out that unless some extralogical information is provided, an ugly duckling and a swan can be put into the same class. The knowledge employed in clustering
0167-8655/91/$03.50 © 1991 - - Elsevier Science Publishers B.V. (North-Holland)
511
Volume 12, Number 9
PATTERN RECOGNITION LETTERS
can be varied, domain specific, and contextual. The clustering algorithm can be flexible enough to choose some subset of the available knowledge depending on the context. We call a clustering technique that employs a large amount of knowledge a Deductive Clustering Approach because, here, clustering can be viewed as a logical consequence of the knowledge. In [7], Hunt states that an object may be assigned to any number of classes but no object may sometimes be assigned to one class and sometimes not be assigned to the same class, thus supporting the deductive clustering approach. We observe that these different classes are not totally unrelated in the sense that they are chained by certain relationships such as ancestral relationship. Once the concepts are attached as class labels to the input objects, it is possible to know the unobserved properties of the input objects. The proposed approach has a natural solution for such an inductive question-answering [8] in the. sense that it can employ the same knowledge which it used for clustering. In this paper, we propose an algorithm for deductive clustering. This algorithm is based on the following observation about the way we, humans, identify categories. For example, when we visit Antarctica, we see mostly penguins where as we see mostly birds at a bird sanctuary. One plausible reason for such a categorization may be given as follows. We may have knowledge of different kinds of birds, say parrots, sparrows, penguins, crows. When we are at Antartica, we see mostly penguins and hence it may be useful to characterize the class as Penguins. On the other hand, we have representatives for most of the kinds of birds in a bird sanctuary and hence, it may be useful to characterize that class at one level higher and hence we may categorize the class as Birds. The efficacy of the proposed algorithm is investigated in the context of databases. More specifically, we use the resulting cluster structures to answer queries about the clusters and the knowledge.
2. An algorithm for deductive clustering The algorithm proposed in this section has a 512
September 1991
sound axiomatic basis [9]. The knowledge employed in clustering is represented in the form of an inheritance network [10]. Each node in the network possesses two kinds of properties: (1) natural properties, and (2) default properties. For example, a node Bird (standing for the category bird) possesses Animal as its natural property and canfly as its default property. Observe that any bird is an animal whereas not all birds can fly. Given (i) a set of input objects in the form of ordered pairs (01,Pl) ..... (O,,,P,,) where Pi is the set of describing properties of (9,. and Oi is the object label, and (ii) the ordered set ~ ' o f all the nodes, in the form (NI,NPI,DPI) ..... (N,,,,NPm,DP,,,) where N P i = N P ( ~ . ) is the natural properties of Ni and D P i = D P ( N i ) is the default properties of N i, of the inheritance network, the problem is one of assigning to each object a set of class labels from the set ~'LI {NOISY}. For the sake of brevity, we also refer to ¢ as the set {Oi ..... O,,} and to o4Pas the set {Nl ..... N,,,}. The algorithm can be briefly described as follows. Each object is initially associated with the best possible nodes in the network based on NODEPREFERENCECOUNTand NODEDIFHDENCECOUNTthat are computed for every node in the network. This computation involves instantiating two weights W1 and W2 that correspond with the natural and default properties respectively. Natural properties play a more important role in the classification process than the default properties. This is because the default properties are defeasible. Based on this observation, it is intuitively appealing to have a larger value for W1 than for 14,'2. In the algorithm presented below, we assume a value of 1.0 for W1 and 0.5 for 1,I,'2. The best nodes obtained for each object correspond to the initial clusters of the object. We assign the class label NOISY tO an object if NODnPR~.FERENCECOUNT of its best node is less than or equal to its NODEDIFHDENCECOUNT.This is because the available knowledge, that can never be complete, provides no basis for classifying such objects in any other way. The initial clusters, thus formed, are merged level by level in a hierarchical fashion to form the final clusters. The merging is controlled by SIBLINGCOUNT that corresponds to the number of children
Volume 12, Number 9
PATTERN RECOGNITION LETTERS
of the node at the immediate higher level, and a prespecified threshold value r. In the proposed algorithm, we select a value of 0.5 for r. The final clusters formed correspond to a covering of the set dr. The motivation for the usage of parameters I4,'1, I4"2, and r are as follows. Given a collection of about a hundred penguins, we intuitively, would like to label it as a collection of PENGUINS. HOWever, if there are different kinds of birds in the collection which include penguins, ostriches, canaries, etc., then we would label the collection with a more general label BIRD. This operation of labeling a collection of objects using a more general label (BIRD) is based on a threshold on the number of subcategories (penguins, ostriches, etc.) that represent the collection. The threshold parameter r captures this property. Let us consider a library L1 of books with different distributions corresponding to various categories. When one is interested in books in a particular area, we would recommend him L1 when it has a reasonably large collection of books in the specified area. More specifically, let us consider a collection of Zoology books in a library. Let W be the total number of published books on Zoology for the node characterizing Zoology. However, any library may have only a subcollection of these books. We would be motivated to search the library for a Zoology book provided the number of books in the library forms a reasonable fraction of IV. W1 is this reasonable fraction. Similarly, the parameter W2 is derived.
Algorithm DCA We make use of three parameters W1, W2, and r in this algorithm. W1 and W2 correspond to the importance given to the natural and default properties of objects respectively. W1 normally assumes a value of 1, whereas W2 might vary from 0 to 1. For example, in a domain where defaults are not to be used, W2 assumes a value of 0. In the algorithm given below, we have chosen typical values for W1 and W2. The value of the parameter r can vary from 0 to 1. This parameter can be thought of as a generalization~specialization parameter. Larger values of r will lead to specialized clusters and their respective descriptions.
September 1991
We also make use of the operator - defined as follows: - A is the complement of the set A with respect to the universe having as its members all elements of A and their negations.
Step 0. W1 = 1.0; W 2 = 0 . 5 ; r = 0 . 5 . Step 1. For every input object (Oi, Pi), do the following in parallel. Step la. For every node (Nj,NPj,DPj) of the inheritance network do the following in parallel. Step lai. Compute NODEPREFERENCECOUNT/j of the object (Oi, Pi) with respect to the node (Nj, NPj, DP)) as follows: NODEPREFERENCECOUNTij
-- WI * IPi n
NP~I + W2* IPi n DPjI,
NODEDIFFIDENCECOUNTij
= WI * IPi f) - N P j l + W2* IPi n - D P j l .
Step laii. Compute MERITij ----NODEPREFERENCECOUNTij -- NODEDIFFIDENCECOUNTij. S t e p l b . MERIT/= 0; BESTNODE i = NOISY.
For every node (N k, N P k, DPk) in the inheritance network do the following: Step lbi. If MERITi,k > MERITi then BESTNODE; = {Ark}; MERIT; = MERITi,k else if MEmT;,k =MERIT i then BESTNODE i = BESTNODEi 13 {Nk}.
Step lc. Set up a dlink, descendant link, f r o m each element of BESTNODEi to the object O i. /. Remark. Each element of BESTNODE i corresponds to one INITIALCLUSTER o f 0 i and noisy properties of 0 i are members of the set P - U,,eB~.rr~oDE,(NPn13DP,,). */ Step 2. INITIALCLUSTERSis the set of BESTNODES. FINALCLUSTERS = ~; TEMPCLUSTERS =
while true do [CURRENTCHILD is the first element o f INITIALCLUSTERS PARENTS = { p ] p is a n ISA-parent o f CURP~NTCHILD}
while true do [if PARENTS is empty then [INITIALCLUSTERS = INITIALCLUSTERS 513
Volume 12, N u m b e r 9
PATTERN RECOGNITION LETTERS -- { CURRENTCHILD}
TEMPCLUSTERS = TEMPCLUSTER U {CURRENTCHILD
}
exit loop] CURRENTPAgENT is the first element o f PARENTS PARENTS = PARENTS
-- { CURRENTPARENT}
SIBLINGS= {C I C is a child o f CURRENTPARENT}
if ISlBLINGSf3 INITIALCLUSTERS I > ISlBLINGSl* Z then [TEMPCLUSTERS = TEMPCLUSTERS
l,.J{ C U R R E N T P A R E N T }
set up dlinks from CURRENTPARENT
to {INITIALCLUSTERSn SIBLINGS} INITIALCLUSTERS = INITIAI.,CLUSTERS -- SIBLINGS
exit-loop]] if INITIALCLUSTERSis empty then [if TEMPCLUSTERS = FINALCLUSTERS then exit-loop else [FINALCLUSTERS = TEMPCLUSTERS
September 1991
Definition. Partition//1 is said to be a refinement o f partition H2 iff each block of H~ is contained in some block o f / 7 2 . Definition. A clustering algorithm is incremental iff Hn = {{N1} ..... {Ark}} is a refinement o f H , + I , where Hn is the partition, generated by the algorithm, o f a set ¢ = { O 1 . . . . . On} and H~+l is the partition, generated by the same algorithm, o f the set ~t.){On+l}. Here {Ni} = ~/ stands for the set of objects Oj such that FINALCLUSTER(Oj,N~).
Observe that the partition Hn is not a partition o f objects in the conventional sense. However, it is a partition of the set of nodes in a conventional sense. In this sense, {N~.} is a block for i = 1..... k. However, ~ . f 3 ~ need not be empty for i = 1. . . . . k. Theorem 1. The algorithm DCA & order-indepen-
dent. Theorem 2. The algorithm DCA is incremental
INITIALCLUSTERS = TEMPCLUSTERS TEMPCLUSTERS = 0 ] ] CLUSTERS = 0
For every N~2FINALCLUSTERS do [0 = {Oi IN is linked to Oi via dlinks} CLUSTERS = CLUSTERS (.J (N, O)] /* Remark. One of the N ' s is NOISY. NOISYis an isolated node in the network. , /
Step 3. end. It can be observed that the maximum number of clusters is m + 1. We define below two properties associated with clustering algorithms and show that the proposed algorithm satisfies both these properties. Definition. A clustering algorithm is order-independent if for every order o f input objects it generates the same covering of the input objects. Definition. A covering of a set is a grouping of elements into subsets of the set such that each element is contained in at least one subset.
Note that covering is a multiset. 514
The proofs of the above theorems are available in [9].
3. An application to query processing
Knowledge-based clustering is concerned with the generation of human-oriented clusters of objects and their descriptions (see for example, [11]). There are several application areas of knowledgebased clustering which have not been explored so far. Further, work has not been reported on the processing of the descriptions o f the clusters to answer queries meaningfully. A typical query processing problem can be explained as follows: given a collection of documents and a query described in a natural language, the aim is to obtain an answer to the query in the natural language. The collection of documents is organized in a suitable way so as to answer queries effectively and efficiently. A flexible query processing system should permit different kinds o f queries to be answered using such an organization. Various kinds of queries include:
Volume 12, Number 9
PATTERN RECOGNITION LETTERS
(1) Queries that initiate search for individual documents. (2) Queries that look for a class of documents. (3) Queries that involve comparison of two databases of documents. (4) Queries that involve identifying subject areas that are adequately represented by the given collection of documents. (5) Queries that determine deficient subject areas. (6) Queries that list down various subject areas in the order of their importance. (7) Queries that determine completion of a database. It is to be noticed that to a large extent questions of the kind listed in (1), (2), and (3) above are datadriven queries and the rest are knowledge-driven queries. Conventional approaches to clustering do not make use of explicit knowledge and hence cannot be used to handle queries of the type (4) through (7) that are knowledge-driven. Even though it is possible to compare two databases using the conventional approaches, work in this direction has not been reported. We employ the deductive clustering approach to organize a collection of documents so as to answer all kinds of queries listed above. A query is an ordered pair: (keyword, parameter). The various keywords and the related parameters include the following: (1) Retrieve. The subject descriptors of interest, connected by the logical connective conjunction, forms the parameter. The purpose is to retrieve the documents matching the conjunctive parameter. (2) Recommend. A subject descriptor is the parameter. The purpose is to recommend based on the associated W1, W2 and z. In other words, based on the area of interest to the user, this query helps in deciding whether it is worth paying a visit to the library. (3) Interested-in. A subject descriptor is the parameter. The purpose is to retrieve a document based on W1, W2. If no such documents are found, then display the relevant subareas in order to guide the user to choose a subarea of his interest. (4) Count. Count the number of documents
September 1991
matching the conjunctive parameter that is a conjunction of subject descriptors. (5) Complete. Verify whether the subtree rooted at the given subject descriptor is complete or not based on W1, W2, and T of all the nodes in the subtree. This is achieved by a comparison with the knowledge tree. (6) Deficient. Compare the subtree of the knowledge base and the subtree of the database to identify missing CR categories. (7) Compare. It takes subject descriptor, libraryl and library2 as parameters. The purpose is to compare the subtree of the libraryl database rooted at the subject descriptor with the subtree of library2 rooted at the same subject descriptor. It is assumed that knowledge trees of libraryl and library2 are identical. The collection of 174 documents [12, 13, 14] is clustered using the deductive clustering algorithm with the threshold values W1 = 1.0, W2 = 1.0, and T = 0.5. This algorithm makes use of the knowledge in the form of a full CR classification scheme [15]. Several representative queries and the answers that are generated with the help of the data tree and the knowledge tree are presented in the appendix. For more details regarding the input data, domain knowledge, and queries, see [9].
4. Conclusion In this paper, we have proposed a clustering algorithm that uncovers classes in the set of input objects by exploiting the knowledge available to it. There are two m a j o r steps in the proposed algorithm. The first step can be executed in parallel as it generates the initial clusters of the objects in an order-independent fashion. The second step generates a covering of the set of input objects. An important property of the proposed algorithm is that it can be naturally extended to cluster objects in an incremental way. The main features of such a knowledge-based clustering algorithm are: (1) It generates human-orientated clusters and their descriptions as the whole process is guided by the domain knowledge. (2) The algorithm is order-independent. 515
Volume 12, Number 9
PATTERN RECOGNITION LETTERS
(3) The algorithm can be used to cluster objects in an incremental way. (4) The resulting cluster structure is a substructure of the knowledge structure. As a consequence, several interesting queries can be answered. (5) The proposed approach can be thought of as a generalization of the semantic model approaches foi" database organizations.
Acknowledgements
We thank the anonymous referee for his/her comments that imparted more clarity to the paper. We are grateful to Prof. G. Krishna for helpful suggestions. We would like to thank the Defence Research Development Organisation, India, for the project grant DRDO/CSA/GK/13 that supported the research reported in this paper.
• Appendix: Sample queries and responses F o r m o r e details, see [9].
Input data: Review documents from [12-14]. Source of knowledge: Full CR classification scheme [15]. (retrieve "data communication") Retrieved document: 88020048. (retrieve "optimization & logic design") No relevant document available. (recommend "software") Required area is poorly represented. Better try some other source. (recommend "vlsi")Sufficient documents are available. (interested-in "input/output devices")The following areas are represented: choose one based on your interest 1. data terminals and printers, 2. image display. (interested-in "software") No relevant documents are available in the area of your interest. (complete "integrated circuits") Subareas in the knowledge structure: 24 subareas in the data: 3 integrated circuits (B7) is 13% complete. 516
September 1991
(deficient "software") software (D) is deficient in the area general (DO), deficient in the area programming techniques (D1), deficient in the area software engineering (D2), deficient in the area programming languages (D3), deficient in the area operating systems (D4), deficient in the area miscellaneous (DM). software (K23) is not deficient. (deficient "vlsi") vlsi (B718) is not deficient. For the following query, COMPARE, we need two libraries Libraryl has the documents data001-datal00, Library2 has the documents data101-data174. (compare "software", libraryl, library2) Comparison of software (D) Level # in Knowledge tree
# of nodes in Knowledge tree
# of nodes in libraryl
# of nodes in library2
I
1
0
0
2 3 4
6 35 127
0 7 30
0 3 10
References [1] Anderberg, M.R. (1973). Cluster Analysis for Applications. Academic Press, London. [2] Duda, R.O. and P.E. Hart (1973). Pattern ClassO~cation and Scene Analysis. Wiley, New York. [3] Jain, A.K. and R. Dubes (1988). Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs, NJ. [4] Michalski, R.S. (1980). Knowledge acquisition through conceptual clustering: a theoretical framework and algorithm for partitioning data into conjunctive concepts. Intl. J. Policy Analysis and Information Systems 4(3) 219-243. [5] Michalski, R.S. (1983). A theory and methodology of inductive learning. In: R.S. Michalski, J.G. Carbonell, and T. Mitchell, Eds., Machine Learning: An AI Approach. Kaufmann, Los Altos, CA. [6] Watanabe, S. (1969). Knowing and Guessing: A Formal and Quantitative Study. Wiley, New York. [7] Hunt, E.B. (1962). Concept Learning: An Information Processing Problem. Wiley, New York. [8] Lee, R.C.T. (1981). Clustering analysis and its applications. In: J.T. Tou, Ed., Advances in Information System Science, Vol. 8. Plenum Press, New York.
Volume 12, Number 9
PATTERN RECOGNITION LETTERS
[9] Sridhar, V. and M. Narasimha Murty (1990), A deductive clustering approach. Technical Report, IISc-CSA-90-10, Indian Institute of Science, Bangalore, India. [10] Touretzky, D.S. (1986). The Mathematics of Inheritance Systems. Kaufmann, Los Altos, CA. [11] Michalski, R.S. and J.G. Carbonell (1984). Learning from observation: conceptual clustering. In: R.S. Michalski,
[12] [13] [14] [15]
September 1991
J.G. Carbonell, and T. Mitchell, Eds., Machine Learning: An AI Approach. Kaufmann, Los Altos, CA. ACM Computing Reviews 29(I) (1988) 49-71. ACM Computing Reviews 29(2) (1988) 75-124. ACM Computing Reviews 29(3) (1988) 131-159. ACM Computing Reviews 29(1) (1988) 1!-20.
517