Applied Mathematics
Applied Mathematics Letters
Letters 14 (2001) 663-666
wwti.elsevier.nl/locate/aml
Using Table Lens to Interactively Build Classifiers JIANCHAO Department
of Computer Waterloo,
Science,
Ontario,
HAN University
N2L 3G1,
of Waterloo
Canada
j2hanQmath.uwaterloo.ca
(Received
and accepted
Communicated
August 2000)
by N. Cercone
Abstract-Rather than induce classification rules by sophisticated algorithms, we introduce a fully interactive approach for building classifiers from large multivariate datasets based on the table lens, a multidimensional visualization technique, and appropriate interaction capabilities. Constructing classifiers is an interaction with a feedback loop. The domain knowledge and human perception can be profitably included. In our approach, both continuous and categorical attributes are processed uniformly, and continuous attributes are partitioned on the fly. Our performance evaluation with data sets from the UC1 repository demonstrates that this interactive approach is useful to easily build understandable classifiers with high prediction accuracy and no required a prior knowledge about the datasets. @ 2001 Elsevier Science Ltd. All rights reserved.
Keywords-Classification,
Rule induction,
Visualization,
Interaction.
1. INTRODUCTION Building classifiers from large multivariate datasets requires finding a way to assign a new object to one of a number of predetermined classes based on a set of measurements. Many approaches have been proposed for building classifiers in statistics, machine learning; and data mining communities [l]. Usually, a classifier consists of, or is equivalent to, a set of classification rules, and each rule includes a condition part and a decision part. The decision part specifies the class label, while the condition part is composed of attribute-value pairs. For categorical attributes, the value in the pairs is a value of the attribute. However, for continuous attributes, the value is an interval of the attribute domain. Continuous attribute discretization plays an important role in building classifiers because the boundaries must be decided to build classification rules. Most classification approaches are based on sophisticated statistical algorithms [2]. These algorithms, however, are not easily adapted to different datasets and by only changing input parameters may result in widely different requirements. Alternatively, visualization-based approaches for building classifiers have also been developed recently [3], which stress the human-machine interaction and include a role for the user’s perception. Geometrical representations of the data set can be very helpful in this regard because 0893-9659/01/s - see front matter @ 2001 Elsevier Science Ltd. All rights reserved. PII: SO893-9659(01)00025-8
Tyw=t
by -Q&-W
J. HAN
664
these representations the user gain better perception
portray insight
the data items in a two- or three-dimensional into multidimensional data [4,5]. Domain
can also help construct
We present
a novel approach
classification
rules.
of an expert
and discretize
Our experimental
In order to support
diate result are visualized
attributes
with UC1 datasets
interactive
building
classifiers
allows us to integrate
continuous
2. VISUALIZING
process. Basically, cal representations
the classifiers.
for interactively
This approach results
space, and thus, help knowledge and expert
DATA
by visualizing
the domain
on demand
during
the datasets
knowledge
and
and perception
the classifier
construction.
are reported.
AND
construction
CLASSIFICATION
of classifiers,
the original
dataset
with the table lens for the user to observe and control
RULES and the intermethe construction
the table lens integrates the relational data table or spreadsheet with graphito support browsing of the values for hundreds of tuples and tens of columns
on a typical workstation display [5]. Each tuple is represented as a row in the graph, and each attribute (variable) is represented as a column. The table lens can be used to discover correlations among the observed variables. Figure 1 illustrates the table lens that visualizes the IRIS flower dataset, which consists of five attributes that are shown in the columns. The last variable is class label. The attribute values of tuples are drawn as lines proportional to the width of the corresponding columns. Particular rows and columns can be assigned variable widths according to the number of points in attribute domains. Figure 1 also shows the range of attributes above the columns.
MTuple 5g
petal,,,kngth sepalo,,.length petal-width sepal-width :urrent Attribute: Interval Begin:
class
petal-length 1 .OOO
Figure 1. Table lens visualizing the IRIS dataset.
Sorting columns is a major operation on the table lens. Once a column is sorted, properties of the batch of values can be estimated by graphical perception and some amount of display manipulation. In addition, the correlations among the columns can be observed if other columns are also sorted. In our method, tuples are aggregated in terms of their class labels, and the
Table Lens
of tuples
number
the IRIS
dataset
petal-width
contains
are roughly
To construct strongly
in each class is displayed three
sorted
on the right side. and once
class is sorted,
rule, one can interactively
if
the petal-length
or petal-width
1, one can see that
values,
then it belongs to class 1. The user can use a “rubber
conditions
of attribute
two classification that
another
however,
intervals.
corresponding
with
is not included
petal-width
The obtained
to attribute
rules currently
condition
1, one can see that
attributes
petal-length
specify the intervals
to the class labels to form the condition
Figure
this kind
From Figure
and
in the same order.
a classification
correlated
classes,
665
acquired,
around
because
intervals
for the attributes
part of the rule.
For example,
of a tuple are around band”
in the table
to the class label.
and a new rule is being constructed.
minimum
one can observe:
petal-length petal-length
the minimum
to draw a area to represent
rules are also visualized and decision
should
from
lens, with
Figure
2 shows
One may argue
be ORed.
This
is around minimum
condition,
if and only if
is around minimum.
#Tuple 58
sepal-length
petal-length sepal-width
mrent Attribute: Interval . Begin: Interval End:
class petal-width
150 ”.._._.I_.I
petal-width 8.858 .__
Figure 2. Table lens visualizing the classification
rules.
,A In the left side of Figure 2, the rule evaluation is displayed as the area whose height is proportional to the rule coverage and whose width is proportional to the rule accuracy 111. In Figure 2, the first rule has accuracy of lOO%, while the second about 98%. Rules with accuracy less than 80% are ignored. In addition, the color of the rules can be used to represent the rule quality (31.
3. INTERACTIVE
CONSTRUCTION
Building classifiers is to interactively can be described as follows.
and iteratively
construct
OF CLASSIFIERS classification
rules.
This process
Step 1: Visualizing the raw data. By looking into the visualization of the raw dataset, one can perceive the correlations between attributes. Tuples within each class can be sorted by the selected column.
J. HAN
666
Step 2: Constructing a classification rule. l Using a “rubber band”, one can select an attribute this attribute.
The current attribute
and draw interesting intervals for
with interval encompassed by the rubber band
is displayed at the left bottom corner. If the chosen attribute is not appropriate, l
or the current interval
it can be canceled.
Repeat this step until all conditions for the current rule are specified.
Step 3: Updating the dataset and evaluating the current rule. Once a classification created,
its accuracy, coverage, and quality are calculated.
this rule are marked and will not be displayed.
rule is
The tuples covered by
The display space saved by the
marked tuples is used to visualize the rule. The rule condition is displayed as a set of attribute-interval
pairs, see Figure 2.
Step 4: If the current rule evaluation is satisfactory,
one can remove it. Thus, the marked
tuples are restored. Step 5: Repeat Steps 2 through 4 until all produced rules are appropriate and all tuples are covered by these rules. The final classifier is composed of these classification rules.
4. EXPERIMENT
AND
CONCLUSION
We implement the approach described in this paper with Visual C++ 6.0 on Windows 98, and our experiments using the implementation with UC1 datasets [6] included Diabetes, Iris, Monks, Parity5+5,
etc. Compared to other approaches for building classifiers, our method has
the following characteristics. (1) Easy to use: this approach is very easily learned.
We trained two noncomputer
major
students for five minutes. They built classifiers with an average accuracy higher than 90%. (2) Uncertain classifiers: for the same dataset, the generated classifier is uncertain and userdependent. (3) Varying accuracy: the accuracy of generated classifiers also depends on the user. Moreover, the same user may obtain different accuracies of classifiers for different executions of the system. (4) Understandable classifiers: generally, most classification rules in the final classifier contain only one or two condition attributes. (5) On-demand discretization: the continuous attributes are not partitioned in advance. Only the needed intervals are specified on the fly. The ranges not included in any rules are not discretized. (6) Uniformed process of categorical and continuous attributes: categorical attributes are processed in the same way as continuous attributes. The main problem of this approach is that the display window is always limited. It is impossible for the window to accommodate all data items for real applications. Thus, data reduction is necessary. Our next work will focus on how to interactively select features and reduce the size of tuples.
REFERENCES 1. T.M. Mitchell, Machine Leanzing, McGraw-Hill, (1997). 2. G. Nakhaeizadeh and C.C. Taylor, Machine Learning and Statistics: The Interface, John Wiley & Sons, New York, NY, (1997). 3. J. Han and N. Cercone, RuleViz: A model for visualizing knowledge discovery process, In Proc. of KDD-2000, Boston, MA, (August 2000). 4. S.K. Card, J.D. Mackinlay and B. Shneiderman, Readings in Information Visualization: Using Vision to Think, Morgan Kaufmann, San Francisco, CA, (1999). 5. R. Rae and S.K. Card, The table lens: Merging graphical and symbolic representations in an interactive focus and. context visualization for tabular information, In Proc. of ACM Conference on Human Factors in Computing Systems, pp. 318-322, New York, NY, (1994). 6. P.M. Murphy and D.W. Aho, (ICI Repository of Machine Leanzing Databases, URL: http: //WWW. its .uci. edu/wmlearn/MLRepository .html, (1996).