Expert Systems WithApplications, Vol. 13, No. 4, pp. 259-264, 1997 © 1998 Elsevier Science Ltd. All fights reserved Printed in Great Britain 0957-4174/98 $19.00+0.00
Pergamon
PIh S0957-4174(97)00017-X
Knowledge Discovery in Databases Using Lattices E J.
VENTERt
University of Pretoria, EO. Box 39640, Morelata Park 0044, South Africa
G. D. OOSTHUIZEN AND J. D. R o o s University of Pretoria, Department of Computer Science University of Pretoria, Pretoria 0002, South Africa
Abstract--The rapid pace at which data gathering, storage and distribution technologies are developing is outpacing our advances in techniques for helping humans to analyse, understand, and digest the vast amounts of resulting data. This has led to the birth of knowledge discovery in databases (KDD) and data mining--a process that has the goal to selectively extract knowledge from data. A range of techniques, including neural networks, rule-based systems, case-based reasoning, machine learning, statistics, etc. can be applied to the problem. We discuss the use of concept lattices, to determine dependences in the data mining process. Wefirst define concept lattices, after which we show how they represent knowledge and how they are formed from raw data. Finally, we show how the lattice-based technique addresses different processes in KDD, especially visualization and navigation of discovered knowledge. © 1998 Elsevier Science Ltd. All rights reserved
dependences between elements of very large databases. The KDD process usually comprises a data preparation and definition phase, after which a (semi-) automatic dependence generation phase scrutinizes all data records, producing all induced dependences. This phase is often referred to as 'data mining'. The presentation of these dependences to the analyser and possible more iterations of the whole process will then follow. Algorithms and techniques used in the data mining phase are diverse and mostly have roots in the machine learning field. We will finally discuss a machine learning technique developed by Oosthuizen and McGregor (1988) and show how this technique is well suited for application in KDD. This technique employs a derivation from the mathematical construct called a lattice to form a special type of lattice (which we call a 'concept lattice') of attribute classes from a set of input records. This concept lattice has special characteristics that facilitate inference on the attributes of the data set. The set of dependencies that the lattice contains is presented to the analyser in a graphical format that is easy to navigate and understand. The research has also resulted in some software tools that are used to test the application of lattices in KDD. These tools will also be discussed briefly.
1. INTRODUCTION A FUNDAMENTALAePL~CATIONof information processing is data analysis. Marketing researchers want to detect trends in consumer databases. Managers make strategic decisions based on analysis of their financial and customer databases. Scientific researchers induce models from samples extracted from their research domains. Even databases that were not created for analysis purposes initially, such as the process values from a process control environment, come under the scrutiny of analysers. However, the rapid advancement of information technology has resulted in the accumulation of vast amounts of data that makes analysis thereof increasingly harder. This phenomenon is occurring independently within diverse communities of information technology users, but as the sophistication and levels of interoperability between networked information systems increase, the effect of this phenomenon will increase. To cope with this trend, data analysis techniques need to become increasingly sophisticated. In this paper we elaborate on one of a new breed of database analysis techniques, collectively called 'knowledge discovery in databases' (KDD). The aim of this new research field is to devise intelligent systems that assist analysers in their mammoth task to discover possible
2. A BRIEF OVERVIEW OF KDD According to Frawley et al. (1992), KDD is the extraction of previously unknown and potentially useful
Author for correspondence.
259
260 information from data. In a set of facts (data) D, a language L, and some measure of certainty C, a pattern is a statement S in L that describes relationships among a subset D~ of D with a certainty C,, such that S is simpler (in some sense) than the enumeration of all facts in Ds. All patterns that are interesting to the user and are certain enough according to the user are collectively called knowledge. All patterns according to this definition that a program generates from a given data set are called
discovered knowledge. This definition of KDD concentrates on the format of the output of the process and not the process itself. Key issues arise from the KDD process and therefore each of the elements of the process should be examined to obtain a more holistic view of KDD. 2.1. The K D D Process The KDD process consists of the following major phases: • Data preparation, cleaning and warehousing This is where issues such as noisy data, inconsistent formats, etc. are addressed. Prepared data should be maintained on a distributed database, allowing location transparent distribution up to relational level. • Data-driven exploration The analyser needs to peruse the content of the data before starting the mining phase, to identify possible interesting subsets of the data. This will help to reduce computing costs of later more complex and processing intensive phases. It is often the case that the analyser does not even have a clear vision of the knowledge discovery goal. A first peek at the content and some exposition of the structure of the data can help to trigger deeper exploration into interesting areas. • Requirements analysis As the person who carries out the data analysis is often not the person who eventually uses the discovered knowledge, the requirements of the knowledge user need to be defined. • Search for interesting patterns (data mining) During this phase, the clean data set is mined for possible patterns according to criteria set by the user. These criteria include the level of accuracy of found patterns and the user's biases with respect to relevance or 'interestingness' of possible output patterns. This phase is also referred to as the 'dredging' phase. • Presentation and navigation of patterns When the KDD process is interactive, intermediate results are presented to the user, mining parameters are refined and the search for patterns reiterated, until the desired findings are reported. A more data-driven, or bottom-up, dredging of raw facts can be used to discover dependences with fewer interactions with the user. In this case only the final results will be displayed for the user to interrogate. The user needs to under-
E J. Venter et al. stand the output knowledge. Therefore the discovered patterns need to be depicted in a high-level language. A graphical paradigm should be used to depict the structure of the discovered knowledge space. The analyser should have the ability to 'browse' this space to dissect the content and implications of the discovered patterns incrementally.
3. T H E C O N C E P T L A T T I C E A P P R O A C H Although our approach to KDD stems from a machine induction technique developed by Oosthuizen, we have exploited the graphical nature of the base construct (the 'concept lattice') used in the technique, so that we can address more of the processes mentioned above than just the 'mining' step of KDD. This means that we have taken a machine learning technique as the basis and extended it to a comprehensive KDD solution. Some software tools have been developed to demonstrate some of the points we make in this paper. We try to use a unified lattice-based approach to realize some of the processes of KDD. This approach is in contrast to one where different parts of the KDD cycle are realized using different techniques or tools. When explaining our approach, we start with the basic definitions of concept lattices, underpinning the knowledge representational basis of the approach, after which we explain the machine learning technique that constitutes the 'data mining' part of the process. Our description of concept lattices is brief, as there are other references that give more complete expositions on concept lattices (e.g. Oosthuizen & McGregor, 1988; Godin et al., 1991; Oosthuizen, 1991; Carpineto & Romano, 1993). Finally, we show how we propose to extend the approach to address some of the other processes of KDD.
4. D E F I N I N G T H E C O N C E P T L A T T I C E When discussing any intelligent system, classically the first issue to address is how knowledge is represented. As mentioned above, our technique represents knowledge in terms of concept lattices. A lattice is a directed acyclic graph (DAG) in which every two nodes have a unique nearest common descendant---or meet--and a unique nearest common ancestor their join. The lattices discussed here are of a special kind, called concept lattices (Wille, 1982), which have the following additional properties: (1) apart from the children of the universal node at the top and the parents of the NULL node at the bottom, no other nodes in the graph have exactly one parent or exactly one child; (2) no node has a parent (i.e. no node is directly linked to another node) to which it is also indirectly linked by means of a path that goes via one or more other
Knowledge Discovery in Databases Using Lattices nodes.
261 useful technique for inductive learning (Oosthuizen & McGregor, 1988).
Although Wille (1982) formally introduced the notion of a concept lattice, the particular organization of data described here also corresponds to the so-called cladistic approach to classification used by biologists and linguists for some time (Hoenigswald & Wiener, 1987). In fact, the fundamental idea behind concept lattices dates back to Aristotle, who noted the inverse relation between the number of properties required to define a concept and the number of entities to which the concept applied. This is referred to as the duality of intention and extension (Sowa, 1984). Each node represents a highly repetitive pattern (set of features or attributes of the input data). The concept lattice provides us with one unified structure that contains many tangled, optimally integrated trees (hierarchies) of nodes. For each data set there is a unique, minimal (in number of nodes) concept lattice. These characteristics make it possible to derive all n-ary relationships between all attributes of the input data set. Figure 1 illustrates an example of a very small, simple concept lattice. It should be noted that we distinguish between three types of nodes: attribute (or feature) nodes, concept nodes and entity nodes. The attribute nodes are all singlevalued symbolic assertions about any data-entity, e.g. hair=blonde, eyes=blue, eyes=brown, complexion=dark, etc. Entities are the input data rows that we obtain from the database that the system is parsing. Entities are also referred to as examples or data-tuples or datarecords. Each entity is usually a tuple of attributes, e.g. {eyes=blue, hair=blonde, complexion=fair, classification=positive} and is derived from the raw data set. The internal nodes called 'concept nodes' are created when the lattice is constructed. These are the nodes that relate attributes to each other and constitute the 'knowledge' in the lattice. The concept nodes also have strengths (i.e. how many entities they cover), e.g. in Fig. 1, node * 1 has a strength of three and node *2 has a strength of two. Strengths also give an indication of the confidence of the relationships portrayed by the concept nodes. The universal node that is connected to all attributes at the top and the NULL node that is connected at the bottom have been omitted for readability reasons. Lattice-based knowledge base normalization has been proved as a
Attributes Concept Nodes Entities FIGURE 1. A small concept lattice.
5. MINING FOR KNOWLEDGE (CONCEPTS): FROM DATA TO CONCEPT LATTICES The concept lattice is constructed by creating a node for each data point at the bottom of the graph (e.g. nodes El-E5 at the bottom of Fig. 1 represent the corresponding data records in Table 1) and a node for each attribute-value at the top. During this process internal nodes are created between the data points at the bottom and the attributes at the top (they are marked with '*' in Fig. 1). Each data point is then connected to its respective attributes whilst ensuring that the graph remains a lattice. It can be shown that a given set of entities gives rise to a unique lattice. The exact manner in which the lattice is constructed is beyond the scope of this paper. Algorithms have been given by Oosthuizen and McGregor (1988), Godin et al. (1991), Oosthuizen (1991) and Carpineto and Romano (1993). Each internal node denotes a pattern of attributes that occurred in more than one data point. As all possible combinations of attributes could potentially occur in the data, the number of nodes in a lattice is equal to the size of the powerset of the number of attributes per entity, i.e. 2", where n is the number of attributes per entity. The actual number of nodes encountered in real-world data sets is, however, only a fraction of this amount. This is also why the lattice is useful as a data analysis tool: each of the internal nodes in the lattice represents a regularity in the data. Also, to ensure that the graphs are not cluttered by accidental coincidences, statistically insignificant nodes are removed from the lattice. Let us consider an example. For simplicity we consider a cleaned, integrated database in the form of a single file of variable length tuples of symbolic features (attribute values). Figure 2 shows a lattice generated from the data in Table 1. It depicts underground sample information regarding rock samples collected for laboratory analysis. The lattice in Fig. 2 appears rather irregular, but this is because we are representing an n-dimensional structure in two-dimensional space, and only the patterns of attributes that actually occurred in the data are represented. Each concept in the lattice is described by the attributes (its intention) that are included in its so-called 'upper closure', i.e. all nodes that are transitively ancestral to ('above') the concept node in the lattice. All entities that pertain to this concept can be found in the downward closure of the node. The strength (confidence) of a concept is the number of entities pertaining to it, i.e. the number of entity nodes in its downward closure. Inference on subsets of attributes is possible as follows. To determine the implied set of attributes (we call it R), given an input set of attributes (we call it the set Q), the meet of the set Q is determined (we call it
262
E J. Venter et al.
TABLE 1 Input Data to the Concept Lattice In Fig. 2 Entity no.
Size
Colour
Shape
E1
small
brown
E3 E4 E5 E6
large small large large
E8
small
yellow black black brown black brown
regular irregular regular regular regular regular irregular irregular
E2
large brown
E7
large
node M). The upward closure of this node (M) is then taken. If the upward closure contains any attributes apart from those in Q, then these are implied by the ones in Q, i.e. Q ~ R . (For a more thorough discussion of this inference on a lattice, see Oosthuizen and McGregor 0988).) We have also developed an algorithm that traverses the lattice and generates all possible rules that can be induced from each node in the lattice. We define the attributes that a node spans as all the attribute nodes (at the 'top' of the lattice) that are in its upward closure. The algorithm determines for each node, n, its immediate parents as the set P. It further determines for each element of P, p, the difference between p and n with respect to the attributes spanned by p and n, respectively, as the set S, which then constitutes the left-hand side of a rule that according to the inference definition above would give the remaining attributes of node n. The algorithm further determines whether S does not correspond to the attributes of any of the parent nodes in P, which would inhibit the 'selection' of the attributes in S to be the meet at n. This also makes it possible to
Contains
heavy metals yes no yes no yes yes no no
IE8
I
I
hard brittle brittle hard brittle
brittle
6. EXPLORING KNOWLEDGE: A MAP OF CONCEPTS As opposed to some other KDD techniques where killer queries and/or a highly iterative approach is followed, lattices allow a graphical representation of the discovered knowledge space that can be navigated (browsed) by the user. A refinement of the induction algorithm has also been developed that employs a pruning regime to reduce the size of the lattice and facilitate focusing of the KDD process. We have developed a program called GRAND (GRAph iNDuction) that implements the lattice generation algorithm, and DATAMAP, a program that implements the graphical depiction and navigation of the generated lattices. A very thorough discussion on visualization of categorical data using lattices has been given by Oosthuizen and
Metals
~ I E1
hard
brittle
indicate, when the lattice is displayed as a twodimensional abstraction of the n-space, which nodes are completely inhibited from having rules and probably only represent spurious covariances.
I CoIour I Shape I smallI laroellbr°~n I ,eli°w I blackl I reg" I .rreg" Size
Structure
Structure J I hard/ bdttle
'1(3)
j""21E 3 I E2 i '
FIGURE 2. A concept lattice formed from rock samples.
Knowledge Discovery in Databases Using Lattices
263
Venter (1995). We present here a brief description of how a user can navigate the lattice in DATAMAP. 6.1. Visual Display and Navigation of the Lattice DATAMAP displays the lattice as a two-dimensional abstraction of the n-dimensional structure of the lattice, but unlike the toy example in Fig. 2, real lattices are too large to display on the screen in their entirety. Consequently, only the relevant parts of the lattice are displayed. DATAMAP allows the user to first select an initial set of attributes from the recorded set of attributes in the database. DATAMAP then takes the user to the node that is the meet of the selected attributes. We call this the focus, and it is illustrated by the square node in Fig. 3. The user can then immediately see all the attributes of the focus (all the attributes in the upper closure of the focus). All additional attributes in this list that were not selected initially are the implied attributes as discussed above. The user can then decide to select another start set of attributes, i.e. explore dependences of another set of attributes, or select another node of the lattice. If he/she selects another node in the lattice, that node becomes the new focus. Nodes above the focus are labelled by attribute names preceded by a ' - ' . This means that selecting one of these nodes will imply subtracting one or more attributes from the current set of attributes. Nodes below the focus are labelled by attribute names preceded by a ' + ' . The nodes in the lattice are thus labelled in terms of the difference with respect to their nearest parents (for nodes under the focus) or nearest children (for nodes above the focus). This means that selecting one of these nodes will imply
cot=b,own iI ~ . . . . . . . . . . . . . . . c°l=black
col=yellow
|
adding one or more attributes from the current set of attributes. This helps the user to add or subtract features from the goal concept that he/she is searching for. Clicking on a node also moves the node (the new focus) to the centre of the screen. If node *6 in Fig. 2 is the current focus, then the screen appears as shown in Fig. 3. If the user then selects node *4, the screen is updated as shown in Fig. 4. 7. C O N C L U S I O N KDD is a promising field of research and successful applications prove that industry is already cashing in on substantial return on investment. We believe our technique could prove to be a valuable contribution to the field of KDD. To bring our technique closer to the general KDD goal, we need to relate aspects of our approach to the previously mentioned KDD processes. We therefore conclude by revisiting each process from the lattice-based KDD viewpoint: Data preparation, cleaning and warehousing We employ pruning techniques to eliminate weak concept nodes that could have been caused by noisy data.
Data-driven exploration As we expose the underlying dependences of the raw data graphically, the user has a rich exploration environment in the form of the DATAMAP program. This helps the user to be 'led' by the system through the maze of possible 'interesting' relationships to explore further.
Requirements analysis This process is currently beyond the scope of our aspirations.
4~hp=reO|5) "~\
siz=large siz=smail s~ruc=brlttle struc=hard
j" el
FIGURE 3. The DATAMAP screen after node *6 was selected.
264
E J. Venter et al.
(~siz=largelS]
Oshp=rc9|5|
~)hmet=yes|SI
\\
FIGURE 4. The DATAMAP screen after node *4 was selected.
Search for interesting patterns (data mining) This process is well addressed, as discussed in the major part of this paper. Lattice based machine learning forms the basis of the data mining phase.
Presentation and navigation of patterns This is probably where our technique makes the biggest impact. The graphical depiction of the learned knowledge and navigational and explorational operations that our system facilitates, allows not only a rich interactive environment during the exploration-and-mining cycle, but also presents knowledge as part of the unified view on the normalized knowledge base.
REFERENCES Carpineto, C., & Romano, G. (1993). GALOIS: An order-theoretic approach to conceptual clustering. In Proceedings of the International Machine Learning Conference, Amherst, pp. 33--40. Frawley, W. J., Piatetsky-Shapiro, G,, & Matheus, C. J. (1992).
Knowledge discovery in databases: an overview. AI Magazine, Fall, 57-70. Godin, R., Missauoui, R., & Hassan, A. (1991). Learning algorithms using a Galois lattice structure. In Proceedings of 1991 IEEE International Conference on Toolsfor Al, San Jose, pp. 22-29. Hoenigswald, H. M., & Wiener, L. F. (1987). Biological metaphor and cladistic classification: an interdisciplinary perspective. Philadelphia: University of Pennsylvania Press. Oosthuizen, G. D. (1991). Lattice-based knowledge discovery. In Proceedings of AK41-91 Knowledge Discovery Workshop, Anaheim, pp. 221-235. Oosthuizen, G. D., & McGregor, D. R. (1988). Induction through knowledge base normalization. In Proceedings of the European Conference on Artificial Intelligence, Munich, pp. 396--401. Oosthuizen, G. D., & Venter, E J. (1995). Using a lattice for visual analysis of categorical data. In Perceptual issues in visualization (pp. 142-155). Berlin: Springer. Sowa, J. F. (1984). Conceptual structures: information processing in mind and machine. Reading, MA: Addison Wesley. Wille, R. (1982). Restructuring lattice theory: an approach based on hierarchies of concepts. In I. Rival (Ed.), Ordered sets (pp. 445--470). Dordrecht: Reidel.