Personalized knowledge organization and access for the web

Personalized knowledge organization and access for the web

Personalized Knowledge Organization and Access for the Web Xia Lin Drexel University Lois Mai Chan University of Kentucky Traditional information st...

1MB Sizes 1 Downloads 38 Views

Personalized Knowledge Organization and Access for the Web Xia Lin

Drexel University Lois Mai Chan

University of Kentucky Traditional information storage and retrieval methods used by library professionals over the last century have much to offer in the digital environment, particularly when they are combined with recent technology. A device, called Knowledge Class, was developed as a framework to integrate information organizing methods and advanced Web technology. Knowledge Class facilitates information organization based on hierarchical structures similar to those used in thesauri and classification schemes. Furthermore, it adds values to the list of hierarchical terms through builtin vocabulary controlled and pre-stored search strategies. It is coupled with an interactive graphical interface with both dynamic and static links to search engines and related Web sites. Knowledge Class was designed to be both an information-organizing device and an information access tool. The design process discussed in this article represents a new thinking on how to respond to the challenges of organizing and accessing the wealth of information on the Web. While the resources available on the World Wide W e b (Web) provide information professionals with enormous opportunities for meeting users' information needs, they also pose great challenges to those attempting to organize and facilitate access to this wealth of information. There are efforts on many fronts to design mechanisms to facilitate information retrieval on the Web. Many different devices for organizing and accessing information resources have been developed, but they have not kept pace with the rapid proliferation of information on the Web. In particular, some of the sophisticated information retrieval theory and devices developed over the years by information professionals have not been retained or fully integrated into mechanisms for knowledge discovery and information retrieval on the Web.

Direct all correspondence to: Xia Lin, Assistant Professor, College of Information Science and Technology, Drexel University, Philadelphia, Pennsylvania 19104 .

Library & Information Science Research, Volume 21, Number 2, pages 153-172. Copyright © 1999 by Elsevier Science Inc. All rights of reproduction in any form reserved. ISSN: 1041-6080 153

154

Lin & Chan

From the very beginning of the Web, bookmarking has been a popular device for keeping track of interesting Web sites. Essential features of bookmarking are that it is personal; it focuses on specific topics; and it is convenient to the user. As the W e b continues to grow, these features are overshadowed by the overwhelming volume of information on the Web, and users began to rely more on those search engines and subject directories or schemes provided by the search engine and directory services for their browsing and navigation needs. These schemes have attempted to organize a broad range of W e b resources, have been designed with no particular users in mind, and are applicable to a wide spectrum of resources on the Web. They are extremely useful for identifying "some" information in general areas but are less effective in locating specific information in well-defined, finite domains. They are analogous to world maps, which are extremely useful for identifying large areas. But, for someone who needs to get around in specific areas such as Washington, D.C., or Lexington, Kentucky, more detailed and specialized maps of smaller areas are needed. Likewise, few users need "all" the information that is available in a library, much less in the vast universe of electronic information. Researchers and serious information seekers generally work in well-defined, finite domains. For many users, tools that are designed and customized for specific purposes or subject domains might be much more helpful. It seems that a promising design pattern for such domain-specific search tools is a flexible and adaptable mechanism by which personalized maps of links to promising resources may be created and maintained for individual users, the maps that can be easily adapted to evolving information needs and changes in the digital environment. In this context, particularly for sophisticated searchers, detailed tools tailored to specific domains may prove valuable. At the same time, for the average, or even the casual, W e b user, there is also a need for improved means of accessing, navigating, and keeping track of remote electronic resources. This article describes a new method for customizing knowledge organization and access. We present a number of ideas, examples, and procedures for implementations on how traditional organizational tools may be adapted for personalized information organization and access on the Web.

ASSUMPTIONS The purpose of our project is to create and test a new method for customizing knowledge organization and access to information on the W e b for individual users, to supplement, and complement existing devices. We are exploring the possibility of combining existing methods of information organization with advanced W e b technology to create an easy-to-use framework for individual W e b users. One underlying assumption for this research is that information storage and retrieval methods used by library professionals over the last century have much

Personalized Knowledge Organization 155

to offer in the digital environment. Such methods include, for example, classification, controlled vocabulary indexing and searching, and sophisticated retrieval strategies. A second assumption is that, because of their distributed nature, digital resources may be organized from bottom up rather than from top down, as is the case for the organizing tools of the print environment. Instead of beginning with the whole universe of knowledge, one may start with the more specialized areas of knowledge, building smaller blocks that may eventually lead to a larger, more comprehensive structure. Third, we strongly agree that "the librarian's classification and selection skills must be complemented by the computer scientist's ability to automate the task of indexing and storing information. Only a synthesis of the differing perspectives brought by both professions will allow this new medium to remain viable" (Lynch, 1997, p. 52). Reworking traditional devices of intellectual organization and combining them with modern technology should provide means of better organization and management of World Wide W e b resources as well as personalized access to these resources.

BACKGROUND Classification lies at the foundation of knowledge. "Classification," claims Arthur Maltby (1975, p.16), is not only the general grouping of things for location or identification purposes; it is also their arrangement in some sort of rational order so that the chief relationships of the things in question may be ascertained . . . . Classification is a key to knowledge: because it is clear that if we arrange things in a definite order and we know what that order is, we have a very good way of, or key to, these things. Ingetraut Dahlberg echoes the same view (1989, p. 12): "Classification can be understood to be a synonym for knowledge organization." Classification has been designed for organizing knowledge in a systematic and logical fashion, because knowledge is gained through association of ideas. A random or alphabetical list of terms does not reveal relationships; whereas a classified list or a list of terms organized hierarchically shows how each term relates to broader and narrower concepts. It displays relationships among concepts, thus providing a conceptual map of subjects and topics. It allows browsing of topics in the context of their relationships. It also provides a means for systematic navigation from broad subjects to their subordinate topics (Chan, 1995). In the short history of W e b searching, there is increasing use of classification structures to impose order on the vast store of information. Increasingly, popular search engines and directory services are employing hierarchical struc-

156

Lin & C h a n

tures to organize Web resources. Examples include the multi-level subject categories used by Yahoo!, AltaVista, and InfoSeek. One of the recommendations from the panel discussion during the closing session at the thirty-sixth Allerton Institute on classification is to "organize the classification schemes differently for the end-user than for the classifier and provide more than one scheme for users to browse and navigate before and after retrieval" (Cochrane, 1995). Controlled vocabularies have traditionally relied on control of synonyms and homonyms to improve recall and precision. It is predicated on human indexing, assigning pre-determined preferred terms from a thesaurus to documents. Lancaster and Warner (1993, p. 100) state the advantages of using controlled vocabulary: "The thesaurus is able to prevent the separation of related material under synonymous terms, distinguish among homographs, and assist the searcher in comprehensively searching a particular subject area." In the Web environment, it is no longer an option for all or even most information to be indexed "by hand" using controlled vocabulary. Yet, it would be regrettable to forego the advantages of controlled vocabulary indexing and searching, particularly in the control of synonyms and homonyms. What we attempt to explore is to exercise this control through search strategies; in other words, at the point of retrieval rather than storage, an approach similar to postcoordination. Figure 1 illustrates the different focuses of synonym and homograph controls. The traditional approach is to use controlled vocabulary to index documents; and the free-text searching approach is to let the user come up with all the synonyms and homographs. Our approach is to provide synonyms and homographs at the interface level. To put it differently, we attempt to shift the burden of synonym and homograph control from indexing to searching FIGURE 1 Different Focuses of Synonym and Homonym Control

S y n o n y m and H o m o n y m Control

I Documents [

Interface

I

I

sers

Personalized Knowledge Organization

157

while at the same time releasing the burden from the user. What we propose is to create an additional layer between the end-user and the search engine by formulating specific search strategies and storing them in an interface for future use. The strategies would be developed by information professionals or based on previous searches, and they would take synonyms and homonyms into consideration. Users could then call up the pre-stored strategies for searching, and so in a sense would have a personalized and automatic SDI service. Although the distinction between the preferred terms and non-preferred terms is important in a manual environment, it is possible, in an electronic environment, to include synonymous terms in a search algorithm. Synonym control is thus shifted from indexing to searching. In online searching, if we include all synonyms defined in a thesaurus in the search statement, we can capture the advantages of controlled vocabularies, without requiring documents to contain pre-assigned descriptors. In a controlled vocabulary, homonym control is usually achieved through the use of qualifiers that provide a context for the term in question; and, typically, a qualifier consists of a broader term or terms from higher levels in the hierarchy in which the term in question fits. This fact may also be helpful in Web searching; precision can be improved by including contextual terms in the search strategy. In other words, one of the things we wish to explore is whether it is feasible to use search strategies to retain some of the advantages of controlled vocabulary indexing through search strategies instead of assigning controlled vocabularies to individual documents. In a widely cited paper on the design of online search interface, Marcia Bates (1989), speaking in the context of library online systems, laid out the desirable features of an online search system: • A browsing capability should allow for random movement across large amounts of text; • The searcher might move among categories of a classification scheme or follow up leads of related terms; • The searcher should be able, with a single command, to call for a search mode and screen; • The interface design should make i t easy to highlight or otherwise flag information and references; and • Users should be able to search automated information stores in ways that are comfortable and familiar to them. At the time this discussion took place, the Web did not even exist. Nonetheless, the desirable features Bates envisioned for online bibliographic systems are highly relevant and quite applicable in the context of the Web and, with current technology, perhaps even feasible now. We are designing and testing a Web interface, a combination of controlled vocabulary and flee-text search mechanisms, which would enable instant retrieval of documents on the Web as well as creating and maintaining dynamic and static links to remote resources. The interface should be flexible enough

158

Lin & Chan

that users will be able to tailor the scope and the depth of terms stored and the links created; in other words, the branches in the hierarchy should be contractible and expandable as needed. Such an interface should be useful to information professionals who need to design customized digital libraries focusing on specific subject domains for their clients. A n d it should be helpful to individuals who wish to develop personalized digital libraries.

KNOWLEDGE CLASS We propose a device called "Knowledge Class, ''1 which serves as a conceptual building block for Web access. It consists of an information organizing framework and a graphical interface, with both dynamic and static, or soft and hard, links to the Web. It contains a hierarchically structured list of terms on a specific topic or on a particular discipline. It is designed to accommodate individual research needs or personal interests. When fully developed, it is hoped that Knowledge Class will be a useful tool to help individual searchers retrieve, organize, "store," and manage electronic resources for future use. Functions Knowledge Class is designed to provide the following functions: • To organize concepts and terms on a specific subject or topic into a logical structure showing subject relationships; • To facilitate browsing of subject terms and their relationships; • To store useful search terms and strategies and keep them for future use; • To allow the addition of synonyms for better recall and qualifiers to resolve ambiguities or distinguish among homonyms; • To initiate searches using pre-stored terms and strategies in a chosen search engine; and • To store URLs of specific sites for future use. In other words, Knowledge Class facilitates the organization and use of subject terms and search results of these terms. In online retrieval, a great deal of emphasis has been placed on retrieval results, and rightly so. But, after retrieval, there is also the need for organizing related information, "storing" it for future use and re-use by providing the ability to re-visit the sites and, equally important, to retrace the steps for finding the information in the first place.

1 In this article, we use the term "Knowledge Class" to represent the concept and "knowledge class or classes" to represent instances of the proposed device.

Personalized Knowledge Organization

159

Design Objectives As already noted, we begin with the assumption that reworking traditional devices of intellectual organization and combining them with m o d e m technology should lead to improved tools for knowledge organization and management. Specifically, we are referring to (1) the ability to organize knowledge in a systematic and logical fashion to facilitate searching and browsing, (2) control of synonyms and homographs typical of a controlled vocabulary in order to improve recall and precision, and (3) devising search strategies to optimize retrieval results.

DESIGN OF K N O W L E D G E CLASS Knowledge Class contains two basic components: an organizing framework and an interface for access and retrieval. A knowledge class brings related information together in a classified structure in order to facilitate the interaction between the searcher and the information. When users select a term in a knowledge class, the context of the term is also displayed. They will be able to perceive semantically related terms through the hierarchy, and they will be able to access other related terms through cross-references. Thus, when a user finds a knowledge class of interest, he or she can have immediately access not only to the specific information being sought, but also to other related information resources through either search engines or specific links stored in the knowledge class. In short, Knowledge Class provides a dynamic access for information on the Web. An example screen of a knowledge class is shown in Figure 5 later. Major components of Knowledge Class are described in detail in the following subsections.

Organizing Framework The organizing framework serves as the conceptual building block of the Knowledge Class. Basically, it is a classified mini-thesaurus, consisting of a hierarchically structured collection of terms on a specific topic or a particular discipline, such as investment, the solar system, information retrieval, high-school chemistry and physics, etc. The terms may be those gathered from existing thesauri or natural language terms based on one's own knowledge or garnered from previous searches. The hierarchical structure may be a branch from an existing classification scheme, or built from bottom up by categorizing a collection of terms. The depth of the classification, or the number of levels in the hierarchy, is adjustable according to the needs of the individual user. Likewise, the citation order, i.e., the order or arrangement of the facets in the hierarchical structure, is flexible in each instance of Knowledge Class. The emphasis of this framework is on the structure of information and the semantic relationships among terms, topics, branches of subject areas, etc.

160 Lin & Chan

Data Structure As in the case of constructing a thesaurus, the structure of a knowledge class should be carefully designed and constructed to reflect semantic relationships of subject terms. Furthermore, because knowledge classes are to be used on the Web, the terms and the structure of a knowledge class should also facilitate the use of Web searching engines. In our initial design, knowledge class was simply a mini-thesaurus plus Web search engines. Subject terms were hierarchically constructed. When the hierarchical terms were loaded into the framework, clicking any of the terms would evoke a selected search engine to search for information on the clicked term. This was what we called "what you clicked is what you get." Through several iteration of testing, it soon becomes clear that "what you clicked is-what you get" was not always desirable. To be easily understood, display terms should be simple, concise, and intelligible within the hierarchy. Once taken out of the hierarchy, a term may not always carry the context within which it is used; and, as a result, it may not be a very effective search term. Many terms need qualifiers to clarify their context or the concepts they represent. Other terms are only meaningful when combined with broader terms in the upper levels of the hierarchy. Still others, though useful as "display terms," are not effective "search terms." Considering all these, we re-designed a flexible structure that includes five parts for each entry in the knowledge class (see Figure 2). Following is an example of entries in the knowledge class on Investment: -, Collectible,,, 1 - , Antiques,,, 1 - - , Automobiles,classic cars automobiles,,6 FIGURE 2 Components for Each Entry in a Knowledge Class

Data components

Description

Indention Number of dashes "-"indicating the number of levels above the entry in the hierarchy Display term The term that is displayed in the hierarchy Search term The corresponding term that is used for searching only URL Any URLs that may have been collected for the entry Code of search A number indicating how the search term strate~ should be constructed (see next section)

Default

If null, the display term will be used as the search term If null, no direct links are provided

Personalized Knowledge Organization 161

- , Arts,,, 1 - Coins,,,6 - Gems,,,6 - , Precious metals,,,4 - , Stamps,philately,http://www.philatelists.com/, 11

In this example, the entry "Collectible" is one level higher than other entries in the hierarchy except entry "Automobiles," which is under entry "Antiques." When clicked on, the term "Collectible" will be searched by search strategy 1 (see Figure 4 for the meaning of the codes), i. e., "Collectible investment." In this case, "investment" serves as the scope term for the entry "Collectible." (See the next subsection for a discussion on search strategies). Similarly, clicking on the entry "Antiques" will initiate a search with a query combining the display term "Antiques" with the contextual word "Investment," while clicking on the entry "Automobiles" will launch a hierarchical search for "classic (cars or Automobiles) (as) antiques (and) collectible investment." Likewise, the entry "Stamps" stores a synonym "philately;" its "soft link" is a search by the query "(stamps or philately) and investment;" it also has a hard link to http://www.philatelists.com. The first advantage of this structure is the distinction between display terms and search terms. By default, the display term may also be the search term. However, when a display term is not useful or is inadequate as a search term, the search term may be completely different from the display term, thus giving the designer the flexibility to accurately define a search strategy for a display term. For example, several synonyms can be used together as search terms to achieve a degree of synonym control, like "stamps" and "philately" in the above example. Furthermore, by allowing search terms to be different from display terms, a knowledge class is able to accommodate linguistic variations such as different languages or different dialects of the same language. In a particular knowledge class, the display terms may be in a language different from the search terms, as demonstrated in Figure 3, which shows a bilingual knowledge class in English and Welsh. Either language may be used in the display and the user may select search terms in the other language. The second advantage of the data structure is that it includes both "hard links" and "soft links." The search terms connected to the W e b through search engines may be considered as soft links because they are dynamic and flexible. Specified U R L s are considered hard links because they are specific and fixed to a static target. Through soft links, users may collect particularly useful hard links and store them with the entry for future use. If the hard links break later (which happens often in the W e b environment), soft links can be re-activated to identify new hard links. This is one of the advantages Knowledge Class has over the bookmarking approach.

~t,~u~,.,,ur~.~'v~ :

C: Culture:...

..

+

'. :: +~ :.:: " ~:-R~;::~:::

:

:::. _:. _ : : ~ .

•: " :.: :... :..i:: . : :-.:.

:.~-~d,~c

.: :::~:i-i:i: -:..:~:~"::

: .. :; .: ~ :

"i, ~ " " :.. • .: :~ " - " " " :: . . . . . F. ~ :F :][~m'. -... : . . . . .

, .~dKfenUlRec~4

. P ReJh[iOn

.... ::

::

:~S~,=ru . . . . . : :i: i :.~. ~.:lq,hb,= '

I, Fo~B)~H

"

" " "" .

,~.~ •~ S . g ~

::. ? " " :

... :

:l-/,~v,~teo~vebi ~ ':

:. c,::mwymun~:;":

I

~ i .'..~ . . . . l:i, l l d t h

"

.. . . . . . . . . .

l.:

]:v:'1~bO~.:"

:!~: .:~:

::.:: :..!:;:: :.::::." : : : . ¥ . : T ~

.!~ii:::: : i i : : ::!:.:i ::i":~ .: ~ : ~

3. H o m e A n d A w = y - W a l e s Sparta, Cricket, RuCby, Foothill. Slilina i [URL: www,homeandaway.¢om/1,Vale'-=/Waies$ p0rt s, html] i Wellss, including lravel,tourism,news.sports.we8 her.Royellty i and more cazeoones of Welsh inks, Last inedited 31-Dec-97 • ;)agl size 13K - in Engli=h I Translat. ] i 4. Total N e t w o r k S o l u t l o n l Football C l u b . L N m u ~ 9 f i W-leas Freer= Pm"e i [URL: v~w.t n=.co.ul<.,1ooti~sil,lm~ ct=.html] i Facts & Figures From 199~97 Season - Leelgue of Wales. ! Most goals in season Least goals in season Least goells "conceded Most goels conceded Best goal... Last modified 12-Sepg7 - page size 4K- in En~ish [ Translate }

,...u.~'. I : : ~ .

:

:::::.: :'] ,::::-,:::'~

: :~ii :.~ :. ? : .:i:'.!!i ::: ~:":: ~

..

.

ii:careets.:

."~-*"'-"'--"::.: :.::-"~'v~"ii ~

i E~ortair~.em -: ::: : i i~l

iWestgateStreetCardiff CFI IDD

! i!i

" :.: :?-.~J"Fects&FiguresFrom1996/97Seelson-Leagueol'Welles.

J:::: : : :.::. :.".:.~ i Most goals in seelson Least goals in seelson Least goals i ~ . ' . ~ : ' ~ ~ i conceded Most goals conceded Best goa[., ~.w;~?:.y:~~r~ ~Last edihdt2-Sep-9?'-pagaelze4K-inEnghsh[Translate I :.i t stel(Web) ~ ! : ~ ~6. H o m e A n d A w a y - W a l e = S p o r t = : • " i: ::: - | )Football. Salllno .~. . . .: ~ .~- -$-.--~ .~. . . . .: . . ~ ~ ~ ~ . . . ' ~ . . . ~ <

CdckKRugby ~

;

~

ii~]il '' : " :~' ~ [ ~:"

ii~Inance::: . . : . . : : ~I:19_0~,:-, .::,

L=stmodiled27"Mar'98"pigesizeIK-inEn91ish[Tmnnlste

.:~

~ii]

:

:

~ r ~ e ~ . ~ i i N ~ , ~ 'ABc

" :

:

ii~_.~.'~'~'-'~':~,~i ~i:~',:-:~;~~: i: ~-eorRel ~ •~u~ •~om~t!

i[Um.: w~w.clyp~ae.©o.ul(~m/com~t.~ml HowtoCo,to,:tlheFootloel AssocolionofWelles ByPost

:• : ' : . : J " 4 - T o t l l N e t w o r k a o l u t l o n ' F o o t b a l l C l u b - L e l s t l U e O f ' :,, : -i-::: ::::::~ ~ ~ . i:: : . ::.: :~] ![Uiil.:v~w.lnS.¢o.ul(~ootbldL4ow~icte, htrrd] .

AltelVist8 Disc0ve~y :Browse Cateoorie~ .::Create el.Card . . . . . i:Find a Business : i !~Find el Person :- :i:. i i:.F,ree:E~ai,l -; ii:M e p s & DJtedions ::PhololoR.com "i :Translation

: : : : ~ " , "':. . . . ." . :. . :. :" .

?::: :-::: :] ~Foe=.,, ~.o~,=on o~W ~ < ~ : . P ~

l

:::Travel

~iOurSearc~Network: i 6. H I I t o r v o f the Football Aselocil~lonl~i'Wlles [URI.: w~w.cilypages.co.uk/faw/contact.htm] Sear(;hin Ct, n e s e • i: i S~archiri :: H o w t o Cont~ct~e Foolbelll A s s o c •, ~ o- n of Wales. By Post. ¢..= . . . . . . .Japanese ... Football Association of Weles Offices, Plymouth Chembers, 3 ii-~eercn.- m_Koreen . :.i iWest~ateSb'ee(,Cerdiff.CF1 1DD... ii. . . . ~ .i

:::.::| 2 " U X F ° ° t b l l l t P N I e e ' W s l e e - ..,~ •:I IIRL: ,,w,w• ukfoolblili P ' ~ ' i , com/wili ~ 4

,:I

i: (;;~rei~J. • : . . . . : : ::!Enter~linrnerlt :: Finance :.: :..: i ~:: i..Heialt~l !: Ne*f~ bv.ABC

2. UK Fo(~belll PIIl~el. Wlleel [URL: w*w.ulkS~Ibal~agla,com,'wili~ UK Foott)al Pages - Welles ~ modHied1 6 - M ~ - page size 5K- in English [ Trannlat~ ]

! : : : . ..: " ': :: ~ i3. Hleltorv of the Football Anoclatlon orWadn

I:. ~ :

.:::.:.::.G.:M~s:.I::.~

~:`~`~s:~:~..-...:~;:`~:`~;;°;~:~`~..~`~i::::;~F~:~L:~;~.~.i°~.~

il. SoccerWMes.RhylFootbMIClubWWWPagelll "['I ~[UlIL:mlmlbars.eoLcomnhylc/l :~ !List modilied2O-Jun-~e-pagesize3geb~tes-inEnglieh Translate]

J:: * C ~ e ~ o e d ~ : ~ I. : w])ei_:diroed "

:

~: i:::!:;~;: ::::::: ':: : !~i~

:: ::. ~ i Lestrn°ditledtS"Mm'ge'pagesizeSK'inEnglish[Trandat.]

I:: : * : ] ~ ' i ~

J:.

.:;:!:..: .~ :;:D. P0~

::i::: -:::::! : z:;n~,~:

;": i':cym~i

: •-:..: : : ..: .:

I::.,.,~:,,.-....=:, • "~,..vaauw../uu~is...

.

::;:i :: ii. :: :i ::::; :~:! ~:~-:i~:ii.. :!

: . - : . i:]

:

I :" : . :I , . li::: : |:~- : "

r. . . m l l m X l i l i ~

'.~:i~ :.::.:;:;;;:i.::ii::"::~!:!• :i:i:"::: :i.. ::::::-::~::i!i. ;:::i .Wal~$::.

" i i~ ~

: ii i

i:~il

AltaVistaDiSc0,~rv:.~ ii!Br0~F~eCate~lories . i i~Crea~e~i.Card :- i ~iFindelBusiness ~ ii:RrtdePemon iil~ree'Eirnai ~ ~:Maos&Directlons .:~

i!l ~-~I ~:-i~

"" :. :' -~ .... • : , q

: :

"

~i|

~.~i~ i~i~ ~!]i| ~!~iJ

~Tran~l~ti~n: : ": ~ilil ::-"-T:--:. : ):~ :..... ~ ....... .. . . . . . . . . . . . ;....... ;,~....~.~..

Personalized Knowledge Organization 163

FIGURE 3 The Same Knowledge Class in Two Languages, English and Welsh (When the term "Football" is selected, the search results are similar.)

T h e third a d v a n t a g e of the data structure is the flexibility of allowing the c o n s t r u c t i o n of different search strategies for different concepts. F o r each display term, K n o w l e d g e Class can automatically g e n e r a t e a search strategy containing o n e or m o r e c o n t e x t u a l terms; it also allows the user to construct cust o m i z e d strategies if p r e f e r r e d . Search Strategies O n e of the main features of K n o w l e d g e Class is its dynamics. Because of the W e b ' s instant accessibility, n o user will be satisfied with static and isolated thesauri. T h e subject terms must be linked directly to the W e b resources. W h e n the user selects a display term, he or she usually brings along a certain perspective a b o u t the term, which is o f t e n r e f l e c t e d by o n e or m o r e c o n t e x t u a l terms in the u p p e r levels of the hierarchy. W h e n the c o m p u t e r a t t e m p t s to search for inf o r m a t i o n r e l a t e d to the selected term, this b r o a d c o n t e x t should be t a k e n into consideration. This c o n t e x t can be built into the specific q u e r y construction. H o w e v e r , without o v e r l o a d i n g the construction of a k n o w l e d g e class, q u e r y c o n s t r u c t i o n must be simplified. T h r o u g h several steps of iterative testing, we c o m e up with a q u e r y coding system that divides queries into t h r e e search types, and each type has t h r e e variants (see Figure 4). A l t h o u g h the coding

FIGURE 4 Coding System for Search Strategies Terms used to search Coding

Search Type search term

0

key word search

***

1

***

2

***

3

Phrase search

4

***

***

*** ***

Hierarchical search

'7

*** ***

8

9 10 11 12

Branch Scope term

***

5

6

KClass scope term

***

No search Key word search

No link for this display term: Label only same as 0 except with display term also added to the query same as 1 except with display term also added to the query

164

Lin &Chan

scheme, based on the capabilities of current Web search engines, is simple, it provides the needed functionality to facilitate query construction. The three search types are keyword search, phrase search, and hierarchical search. A keyword search acts on all the words in the query individually; a phrase search takes each term in the query as a phrase during the search; the hierarchical search automatically adds one or more of the upper hierarchical terms to the selected term in the query. Within each search type, three variances can be applied, depending on whether to use a scope term of a knowledge class or a scope term of a branch within the knowledge class, or none of them. These scope terms are similar to qualifiers in a controlled vocabulary, which help define the scope of the terms in their respective categories. In the above example, when the user selects the term "Horses," the user is obviously interested in horses as tangible investments, rather than other aspects of horses, since this is a knowledge class on Investment. The query, "horses (as) collectible or tangible investment," reflects this interest. Similarly, in Figure 3, when the display term "football" is selected, the search query being executed will be "Welsh football." Because the scope term "Welsh" is added, search results are more likely about soccer rather than American football. Incidentally, this also solves some of the problems associated with polysemous terms.

INTERFACE DESIGN In order for the framework to be of use in the Web environment, we developed a mechanism capable of linking the structure and the component terms to Web resources. The mechanism serves as an interactive interface between the user and the terms in the organized framework. It provides a relatively stable environment where the user can view previously stored terms, links, and their explicit relationships systematically. The mechanism is also an interface between the user and the Web resources. The user can initiate searches by selecting prestored terms or search strategies or connect to specific sites previously discovered by clicking on special icons. In the authoring mode, the mechanism also provides functions to enable the addition of controlled vocabulary or free-text terms to the hierarchical framework, the collection of and storage of new links found during the retrieval process, and the expansion or modification of the hierarchical structure to accommodate additional levels and branches. A graphical interface in Java/JavaScript has been developed for Knowledge Class. The interface is designed with the following capabilities: • Display of terms in a hierarchical structure representing concepts on various levels. Two levels are always visible in the display: the top level and the branch that the user selects;

Personalized Knowledge Organization 165





• •



Expandable and contractible branches of the hierarchy to allow viewing of terms in varying levels of depth; Different search strategies automatically attached to each term based on given data. The search strategies can be keyword, phrase, or hierarchical searches, with different variances; Storage of selected static links to remote resources and related sites or pages, i.e., bookmarking of specific sites associated with the specific terms; Dynamic links to multiple search engines and directories, such as A1taVista, InfoSeek, Yahoo!, and Lycos, to allow selection of specific target engines and instant switch of search engines from one to another; and Referral links among terms within a knowledge class and potentially among knowledge classes to assist cross-referencing.

Figure 5 shows a screen dump of the knowledge class, "Investment." In this display, four types of information are shown in the four frames. The top right frame provides all six branches of this knowledge class. The top left frame

FIGURE 5 A Screen Dump of the Knowledge Class Viewer

• ~'r..m ...__ -..: • • C CO o ilnns5 i ... 4, .Oelm " .~ Precious



"I1"

: ~ ~ -- -

~

" ~ ~

"• ~ta.l.~. i/~

.

.

~

: ............ :~"~l,~.~.~'~I~,;:,~,~: ,| :i .nJeL: ulJq~diiklor~ihlll¢~Do~de<Jduf~ : : : :| ~Dufly'e Collectible Car Specialist (Deals on Wheels) The Wodd's Favorite ~ ::. :| i Collector Car Storm Reeliaticelly price restored collecfiblea. Trades ~ :| LMitmodl/le417.jui.4~.plges~e36K.lhEnglish~'Tlardlat e ~Pto~41 .

" | i a~. ~ |

~

i ~

LOIIICDQfl V h l l a a l

i.lln£

~'Cer88re ""

• i ~i ~The Otto Collection has • n "ha'de showroom inventory of :: E~rdadammenl i old,claesic.'~ntage, end antique collectible automobiles, trucks and street i F nan~e : • ~"*'i i rode located in i Heaft~ . . . .

• i|~

.

.

.

.

.

.

.

"

. . . . . . . .

.....

: : :: !

::::::::: :~:.~.~:

~::e:~ i!ili:!

~ ~!!~'! ! i ::!:~! i ~

'

-

166

Lin &Chan

shows details of the selected branch, Types of investment. Triangular icons before the terms indicate if the terms can be expanded or contracted when the icons are clicked. Global icons, as those shown after the term "Foreign exchange" and the term "Mortgages," indicate there are hard links to resources related to the terms. Clicking on an icon will lead directly to a specific Web site. Clicking on the term itself will activate a search. The arrow icons after terms "Mutual funds" and "Stocks" indicate there are cross-references to and from these topics within this knowledge class. Clicking on the arrow icons will lead directly to the sections that these cross-references point to. The largest frame is used to display the retrieval result, in this case, through the selected search engine, AltaVista, after the term "Horses" of the sub-branch "Collectibles" is selected. The actual query used for the retrieval is "horses collectible investment," which is automatically constructed based on the hierarchy. If there is a specifically constructed search query stored (transparent to the user) under this term, the search engine would use the pre-stored search query instead of the automatically constructed query. The small bottom left frame shows the currently selected search engine and all available search engines for this knowledge class. The user can interact with this display in several ways. On the term display level, if the user wants to switch to another branch, he or she can simply select that branch in the top right frame, and the selected branch will then be displayed in the top left frame. The user can also follow the cross-reference links to other branches. If he or she would like to see more details of a particular level, a sub-branch or contract other sub-branches can be expanded when needed. In the retrieval mode, the user can select different search engines (in the bottom left frame) and use the same query to access a variety of resources covered by different search engines. He or she can also select other related terms to explore relevant information. Finally, the user can simply construct a new query with a combination of terms displayed in the term list and terms discovered in the retrieved results. If a specific site is considered particularly interesting, it can be stored as a hard link under the selected term for future use. The Global icon after the term indicates the existence of a hard link.

DISCUSSION There are many challenges to organizing Web resources for personal use. In this project, we focus on three problems in particular: (1) how much manual process and how much automatic process should be involved during the organizing process? (2) Which of the two approaches, the top-down or the bottom-up, should be applied to Web information organizing? (3) How to create a relatively stable structured information space in the dynamic, chaotic, and unstructured Web environment.'? In this section, we will discuss Knowledge Class with respect to these three questions and compare it to some other existing approaches to Web information organization.

Personalized Knowledge Organization

167

Manual vs. A u t o m a t i c Process

From the early stage of the World Wide Web, bookmarking has been a popular device for users to keep track of interesting sites (Pitkow & Recker, 1995). While there are many efforts to make bookmarking easier (e.g., East Gate, 1996; Netscape, 1996), it is still largely a manual process - it is the user who initiates every action of information collecting, organizing, and updating. Many intellectual efforts are involved in creating and maintaining bookmarking pages, either for personal use, or for collecting resources on selected topics. These efforts render bookmarking a useful tool for W e b information organizing. However, its use is also limited because it is labor intensive and virtually static. Consequently, it is difficult to keep bookmark pages up-to-date in the dynamic W e b environment. On the other end of the spectrum are software agents that attempt to sense and act on behalf of the user in collecting, retrieving, and organizing information from the W e b (Maes, 1994). Software agents establish user profiles and activate sophisticated AI techniques in searching and filtering W e b information. They attempt to "learn" users' information needs in order to be "competent" and "trusted" when providing automatic and personalized assistance to users (Maes, 1995). While the software agent approach represents an ideal use of personalized computer assistance in the W e b environment, it faces challenges both technologically and philosophically (Wooldridge & Jennings, 1998). A wellknown example from the library and information science community is the representation of user's information needs. User's information needs are "Anomalous States of Knowledge" that might be difficult to describe even by the person himself or herself (Belkin, Oddy, & Brooks, 1982), let alone to be "learned" by software agents. When the agents cannot represent the user's information needs adequately, it would be difficult for them to find all the information that meets the user's information needs. With Knowledge Class, an attempt is made to strike a balance between manual and automatic processes. We try to shift the burden of information organizing from the user by having the information professional provide the knowledge structure based on information organizing principles. The structure can be modified, based on the individual user's interests or information needs. The process of information access becomes transparent to the user through built-in search strategies. In this regard, our approach is similar to the original idea of Verity's T O P I C system (Tong, Applebaum, Askman, & Cunningham, 1987). The user of a knowledge class often does not need to have specific information such as the U R L of the chosen search engine or expertise in query construction. With the built-in search strategies, all that is required from the user is to click on a term to bring new information to the screen. Of course, sophisticated users will always have the option to construct their own queries while maintaining the ability to view and use the terms in the hierarchical list. To minimize manual effort, Knowledge Class makes use of available search

168

Lin & C h a n

engines wherever possible. Because of the connection to search engines, the resources that a knowledge class "contains" is not limited by what has been collected - if there are new resources related to a term, they can be retrieved by simply clicking on the term. This dynamic connection makes links in Knowledge Class less volatile than those stored in bookmarks. If a hard link is broken, the soft link can be activated to either trace the U R L of the broken link or locate the replacement U R L . This feature ensures that efforts put in by information professionals have longer lasting usefulness. Currently, search strategies in Knowledge Class are not yet as sophisticated as those used in software agents. As the agent technology becomes matured, it can be utilized to improve the query construction in Knowledge Class to make the knowledge classes more "intelligent." Top-down vs. Bottom-up Approach Knowledge Class represents a bottom-up approach to W e b information organizing. To be successful, many knowledge classes need to be developed at the grass-root level on many specific topics or well-defined subject domains. As the number of knowledge class increases, efforts can be made to build knowledge classes on top of other knowledge classes. These knowledge classes then can become building blocks for a "conceptual infrastructure reflecting the way knowledge and information are organized" (Soergel, 1996). This approach is different from other top-down approaches reflected in many existing classification approaches, e.g., CyberStacks (McKiernan, 1995), and CyberDewey (Mundie, 1995). In the classification approach, an established classification scheme (such as the Library of Congress classification system, or Dewey classification system) is adopted or adapted to facilitate browsing and identification of relevant Web resources in a variety of scientific and technological disciplines. Because the classification system provides not only comprehensive subject coverage, but also conceptual relationships of the subjects, such organization contributes a new dimension for users to browse Web resources. Users can begin with a general outline of main categories of the classification, and continuously select sub-class or sub-category until desired resources are reached. The challenge for this kind of top-town approach, however, is its broad scope and its depth. While the classification is comprehensive, linking every branch of the classification to Web resources, mostly manually, is often considered an impractical task; to maintain such linkage in the dynamics of the W e b space is even more difficult. Thus, such an approach is often only partially complete, leaving many "empty" or unfilled branches or classes. On the other hand, because the hierarchy of a comprehensive classification system must be many levels deep, the user often has to travel many branches in order to reach specific concepts being sought. The need to traverse deeply into a hierarchy has often discouraged many users to reach their desired resources. The experience of going deep and finding "empty" branches only creates and aggravates frustration for the user.

Personalized Knowledge Organization 169

Another problem for the top-down classification approach is what is the basic unit on which classification should be applied. In libraries, the book, not documents in the book, is the basic unit for cataloging and classification. On the Web, because there is not a universally accepted concept of " W e b book," classification has been applied to many different units; some are on W e b sites or homepages, some on collections (resources), and others on every single W e b page. What we need to have is some standards of useful basic units. The basic unit should "bind" together individual W e b pages that reflect a theme or the author's composition. It should be relatively stable and in good quality. In this regard, Knowledge Class may be considered as one step toward setting up a "basic unit" of the Web. Until we establish some "basic units," the top-down classification approach would not be practical and useful. Structured vs. Non-structured Information Space Knowledge Class provides a stable information space while maintaining dynamic links to adapt to rapid changes in the digital environment. Users of a knowledge class interact with a familiar structure every time they use the knowledge class. This structure reflects how topics are related to each other, and helps users recognize paths that they previously used to access W e b resources of their interest. If they wish for a re-visit, they can "walk" through the same path to a desired W e b site without memorizing the exact U R L or the queries and search engines that they used before. Users will also be able to customize the familiar space to reflect their information needs. They can store new links associated with existing terms, add new terms to the knowledge class by drag-and-drop, and expand the hierarchy to accommodate additional levels. As their interest and their understanding of the topic grow, they will find that their knowledge classes also "grow" with them. Since the W e b is a large, complex, and chaotic information space, imposing a global structure to the entire W e b is an enormous challenge. Instead, many researchers have attempted to define structures for local semantic units of the information space such as "information islands" Waterworth (1996), site maps (Kahn, 1996), and "trees and forests" (Pirolli, Pitkow, & Rao, 1996). One of the purposes of using these structures is to establish some kind of locality on the W e b to assist browsing and navigation. To establish locality, thematic organization, structural cues, and explicit semantic relations are considered essential (Abrams, 1997). Knowledge Class attempts to accomplish all three. A knowledge class organizes terms and concepts of a specific topic, rather than only the links or Web pages on that topic. This is because concepts and their associations are relatively stable, while the links to and among W e b resources are dynamic and volatile. In a knowledge class, the organized concepts are tightly coupled with W e b resources through search engines. Furthermore, by creating the structure based on information organizing principles, information professionals can help provide a coherent and logical structure for basic information units on the Web. It is reasonable

170

Lin & C h a n

to expect that such a structure will be more useful than the structure of bookmarks that users create ad-hoc. One may also expect that, if a searcher finds a knowledge class on a topic of interest, he or she will develop a trust in the class and eventually recognize such a knowledge class as a "landmark" for the topic on the otherwise unorganized, non-structured Web information space.

CONCLUSION This project attempts to devise a new method of knowledge organization in the distributed environment. It inquires how traditional concepts, methods, and organizing tools may be applied to the digital environment. It attempts to combine the skills of the librarian and advanced technology to help organize the anarchy of W e b information resources. Knowledge Class is designed to be both an information-organizing device and an information access tool. Individually, each knowledge class is an organized entity dynamically linked to Web information resources. Collectively, instances of knowledge class can be assembled through classification hierarchies to form a conceptual infrastructure for organizing and managing W e b resources. The concept of Knowledge Class is built on traditional devices of intellectual organization and access to information, rather than attempting to replace them. Classification principles are adopted and existing classification schemes are used to serve as starting points or foundations to build knowledge classes. Existing thesauri also provide rich sources of display terms and search terms. The formulation of search strategies is based on principles and practice of online searching, representing cumulative wisdom and experiences of information professionals. Finally, the interface is built with recent W e b technology and based on human-computer interaction principles. It is hoped that this research project will have both theoretical and practical values. Examples of knowledge classes are provided at http://lislin.gws.uky.edu/kc. They are based on a working prototype developed in NetScape's javaScript. More development work is needed to fully realize the potentials of Knowledge Class as a dynamic W e b information organizing and access tool. Currently, a Java version of Knowledge Class is under development. We also plan to conduct user studies on how subject specialists will use and interact with knowledge classes we provide. We hope that these "real users" will provide valuable feedback about the concept of "Knowledge Class" we are developing.

REFERENCES Abrams, David. (1997). H u m a n factors of personal W e b information spaces. Unpublished master's thesis, University of Toronto, Computer Science Department, Canada.

Personalized Knowledge Organization

171

Bates, Marcia J. (1989). Design of browsing and berrypicking techniques for the online search interface. Online Review, 13, 407-424. Belkin, Nicholas J., Oddy, Robert N., & Brooks, H. M. (1982). ASK for information retrieval: Part I, background and theory. Journal of Documentation, 38(2), 61-71. Chan, Lois Mai. (1995). Classification, present and future. Cataloging & Classification Quarterly, 21(2), 5-17. Cochrane, Pauline A. (1995). New roles for classification in libraries and information networks. Cataloging & Classification Quarterly, 21(2), 3-4. Dahlberg, Ingetraut. (1989). Concept and definition theory. Classification Theory in the Computer Age: Conversations Across the Disciplines; Proceedings from the Conference (pp. 12-24). Albany, NY. East Gate Software. (1996). "Web Squirrel" [Available: http://www.eastgate.com/squirrel/]. Kahn, Paul. (1996). Mapping Web sites. [available: http://www.dynamicdiagrams,com/seminars/mapping/maptoc.htm]. Lancaster, F. Wilfrid, & Warner, Amy J. (1993). Information retrieval today. Arlington, VA: Information Resources Press. Lynch, Clifford. (1997). Searching the Internet. Scientific American, 276(3), 52-56. Maes, Pattie (1994). Agents that reduce work and information overload. Communications of the ACM, 37(7), 31-40. Maltby, Arthur. (1975). Sayers' manual of classification for librarians. (5th ed.). London: Andr6 Deutsch. McKiernan, Gerry (1995, December). CyberStacks(sm): A 'Library-organized' virtual science and technology reference collection. In D-Lib Magazine. [Available: http://www .dlib.org/dlib/december95/briefings/12cyber.html]. Mundie, David A. (1995). CyberDewey: A catalogue for the World Wide Web. Pittsburgh, PA: Polymath Systems [Available: http://ivory.lm.com/ -mundie/DDHC/DDH.html]. Netscape Communications (1996). "SmartMarks" [Available: http://home. netscape.com/comprod/smartmarks.html]. Pirolli, Peter, Pitkow, James, & Card, Ramana. (1996). Silk from a sow's ear: Extracting usable structures from the Web, in CHI '96, A C M Conference on Human Factors in Computing Systems (pp. 118-125). New York: ACM. Soergel, Dagobert. (1996). Proposal for an open, multifunctional, multilingual system for integrated access to knowledge base about concepts and terminology. Proceedings of the 4th International ISKO conference (pp. 165-173). Frankfurt: Indeks Verlag. Tong, Richard M., Appelbaum, Lee A., Askman, Victor N., & Cunningham, James F. (1987). Conceptual information retrieval using RUBRIC. Proceedings of the 10th A CM SIGIR Conference on Research and Development in Information Retrieval. (pp. 247-253). New York: ACM. Waterworth, John A. (1996). A pattern of islands: exploring public information

172

Lin & Chan

space in a private vehicle. In: P. Brusilovsky, P. Kommers and N. Streitz (eds.): Multimedia, hypermedia and virtual reality: Models, systems, and applications (pp. 266-279). Berlin: Springer-Verlag. Wooldridge, Michael, & Jennings, Nicholas R. (1998). Pitfalls of agent-oriented development. Proceedings of the 2 nd Conference on Autonomous Agents (Agents'98), May, 1998. [Available: http://www.cs.umbc.edu/agents/introduction/paop.pdf].