Using table lens to interactively build classifiers

Using table lens to interactively build classifiers

Applied Mathematics Applied Mathematics Letters Letters 14 (2001) 663-666 wwti.elsevier.nl/locate/aml Using Table Lens to Interactively Build Clas...

471KB Sizes 0 Downloads 62 Views

Applied Mathematics

Applied Mathematics Letters

Letters 14 (2001) 663-666

wwti.elsevier.nl/locate/aml

Using Table Lens to Interactively Build Classifiers JIANCHAO Department

of Computer Waterloo,

Science,

Ontario,

HAN University

N2L 3G1,

of Waterloo

Canada

j2hanQmath.uwaterloo.ca

(Received

and accepted

Communicated

August 2000)

by N. Cercone

Abstract-Rather than induce classification rules by sophisticated algorithms, we introduce a fully interactive approach for building classifiers from large multivariate datasets based on the table lens, a multidimensional visualization technique, and appropriate interaction capabilities. Constructing classifiers is an interaction with a feedback loop. The domain knowledge and human perception can be profitably included. In our approach, both continuous and categorical attributes are processed uniformly, and continuous attributes are partitioned on the fly. Our performance evaluation with data sets from the UC1 repository demonstrates that this interactive approach is useful to easily build understandable classifiers with high prediction accuracy and no required a prior knowledge about the datasets. @ 2001 Elsevier Science Ltd. All rights reserved.

Keywords-Classification,

Rule induction,

Visualization,

Interaction.

1. INTRODUCTION Building classifiers from large multivariate datasets requires finding a way to assign a new object to one of a number of predetermined classes based on a set of measurements. Many approaches have been proposed for building classifiers in statistics, machine learning; and data mining communities [l]. Usually, a classifier consists of, or is equivalent to, a set of classification rules, and each rule includes a condition part and a decision part. The decision part specifies the class label, while the condition part is composed of attribute-value pairs. For categorical attributes, the value in the pairs is a value of the attribute. However, for continuous attributes, the value is an interval of the attribute domain. Continuous attribute discretization plays an important role in building classifiers because the boundaries must be decided to build classification rules. Most classification approaches are based on sophisticated statistical algorithms [2]. These algorithms, however, are not easily adapted to different datasets and by only changing input parameters may result in widely different requirements. Alternatively, visualization-based approaches for building classifiers have also been developed recently [3], which stress the human-machine interaction and include a role for the user’s perception. Geometrical representations of the data set can be very helpful in this regard because 0893-9659/01/s - see front matter @ 2001 Elsevier Science Ltd. All rights reserved. PII: SO893-9659(01)00025-8

Tyw=t

by -Q&-W

J. HAN

664

these representations the user gain better perception

portray insight

the data items in a two- or three-dimensional into multidimensional data [4,5]. Domain

can also help construct

We present

a novel approach

classification

rules.

of an expert

and discretize

Our experimental

In order to support

diate result are visualized

attributes

with UC1 datasets

interactive

building

classifiers

allows us to integrate

continuous

2. VISUALIZING

process. Basically, cal representations

the classifiers.

for interactively

This approach results

space, and thus, help knowledge and expert

DATA

by visualizing

the domain

on demand

during

the datasets

knowledge

and

and perception

the classifier

construction.

are reported.

AND

construction

CLASSIFICATION

of classifiers,

the original

dataset

with the table lens for the user to observe and control

RULES and the intermethe construction

the table lens integrates the relational data table or spreadsheet with graphito support browsing of the values for hundreds of tuples and tens of columns

on a typical workstation display [5]. Each tuple is represented as a row in the graph, and each attribute (variable) is represented as a column. The table lens can be used to discover correlations among the observed variables. Figure 1 illustrates the table lens that visualizes the IRIS flower dataset, which consists of five attributes that are shown in the columns. The last variable is class label. The attribute values of tuples are drawn as lines proportional to the width of the corresponding columns. Particular rows and columns can be assigned variable widths according to the number of points in attribute domains. Figure 1 also shows the range of attributes above the columns.

MTuple 5g

petal,,,kngth sepalo,,.length petal-width sepal-width :urrent Attribute: Interval Begin:

class

petal-length 1 .OOO

Figure 1. Table lens visualizing the IRIS dataset.

Sorting columns is a major operation on the table lens. Once a column is sorted, properties of the batch of values can be estimated by graphical perception and some amount of display manipulation. In addition, the correlations among the columns can be observed if other columns are also sorted. In our method, tuples are aggregated in terms of their class labels, and the

Table Lens

of tuples

number

the IRIS

dataset

petal-width

contains

are roughly

To construct strongly

in each class is displayed three

sorted

on the right side. and once

class is sorted,

rule, one can interactively

if

the petal-length

or petal-width

1, one can see that

values,

then it belongs to class 1. The user can use a “rubber

conditions

of attribute

two classification that

another

however,

intervals.

corresponding

with

is not included

petal-width

The obtained

to attribute

rules currently

condition

1, one can see that

attributes

petal-length

specify the intervals

to the class labels to form the condition

Figure

this kind

From Figure

and

in the same order.

a classification

correlated

classes,

665

acquired,

around

because

intervals

for the attributes

part of the rule.

For example,

of a tuple are around band”

in the table

to the class label.

and a new rule is being constructed.

minimum

one can observe:

petal-length petal-length

the minimum

to draw a area to represent

rules are also visualized and decision

should

from

lens, with

Figure

2 shows

One may argue

be ORed.

This

is around minimum

condition,

if and only if

is around minimum.

#Tuple 58

sepal-length

petal-length sepal-width

mrent Attribute: Interval . Begin: Interval End:

class petal-width

150 ”.._._.I_.I

petal-width 8.858 .__

Figure 2. Table lens visualizing the classification

rules.

,A In the left side of Figure 2, the rule evaluation is displayed as the area whose height is proportional to the rule coverage and whose width is proportional to the rule accuracy 111. In Figure 2, the first rule has accuracy of lOO%, while the second about 98%. Rules with accuracy less than 80% are ignored. In addition, the color of the rules can be used to represent the rule quality (31.

3. INTERACTIVE

CONSTRUCTION

Building classifiers is to interactively can be described as follows.

and iteratively

construct

OF CLASSIFIERS classification

rules.

This process

Step 1: Visualizing the raw data. By looking into the visualization of the raw dataset, one can perceive the correlations between attributes. Tuples within each class can be sorted by the selected column.

J. HAN

666

Step 2: Constructing a classification rule. l Using a “rubber band”, one can select an attribute this attribute.

The current attribute

and draw interesting intervals for

with interval encompassed by the rubber band

is displayed at the left bottom corner. If the chosen attribute is not appropriate, l

or the current interval

it can be canceled.

Repeat this step until all conditions for the current rule are specified.

Step 3: Updating the dataset and evaluating the current rule. Once a classification created,

its accuracy, coverage, and quality are calculated.

this rule are marked and will not be displayed.

rule is

The tuples covered by

The display space saved by the

marked tuples is used to visualize the rule. The rule condition is displayed as a set of attribute-interval

pairs, see Figure 2.

Step 4: If the current rule evaluation is satisfactory,

one can remove it. Thus, the marked

tuples are restored. Step 5: Repeat Steps 2 through 4 until all produced rules are appropriate and all tuples are covered by these rules. The final classifier is composed of these classification rules.

4. EXPERIMENT

AND

CONCLUSION

We implement the approach described in this paper with Visual C++ 6.0 on Windows 98, and our experiments using the implementation with UC1 datasets [6] included Diabetes, Iris, Monks, Parity5+5,

etc. Compared to other approaches for building classifiers, our method has

the following characteristics. (1) Easy to use: this approach is very easily learned.

We trained two noncomputer

major

students for five minutes. They built classifiers with an average accuracy higher than 90%. (2) Uncertain classifiers: for the same dataset, the generated classifier is uncertain and userdependent. (3) Varying accuracy: the accuracy of generated classifiers also depends on the user. Moreover, the same user may obtain different accuracies of classifiers for different executions of the system. (4) Understandable classifiers: generally, most classification rules in the final classifier contain only one or two condition attributes. (5) On-demand discretization: the continuous attributes are not partitioned in advance. Only the needed intervals are specified on the fly. The ranges not included in any rules are not discretized. (6) Uniformed process of categorical and continuous attributes: categorical attributes are processed in the same way as continuous attributes. The main problem of this approach is that the display window is always limited. It is impossible for the window to accommodate all data items for real applications. Thus, data reduction is necessary. Our next work will focus on how to interactively select features and reduce the size of tuples.

REFERENCES 1. T.M. Mitchell, Machine Leanzing, McGraw-Hill, (1997). 2. G. Nakhaeizadeh and C.C. Taylor, Machine Learning and Statistics: The Interface, John Wiley & Sons, New York, NY, (1997). 3. J. Han and N. Cercone, RuleViz: A model for visualizing knowledge discovery process, In Proc. of KDD-2000, Boston, MA, (August 2000). 4. S.K. Card, J.D. Mackinlay and B. Shneiderman, Readings in Information Visualization: Using Vision to Think, Morgan Kaufmann, San Francisco, CA, (1999). 5. R. Rae and S.K. Card, The table lens: Merging graphical and symbolic representations in an interactive focus and. context visualization for tabular information, In Proc. of ACM Conference on Human Factors in Computing Systems, pp. 318-322, New York, NY, (1994). 6. P.M. Murphy and D.W. Aho, (ICI Repository of Machine Leanzing Databases, URL: http: //WWW. its .uci. edu/wmlearn/MLRepository .html, (1996).