ICE – Intelligent Clustering Engine: A clustering gadget for Google Desktop

ICE – Intelligent Clustering Engine: A clustering gadget for Google Desktop

Expert Systems with Applications 39 (2012) 9524–9533 Contents lists available at SciVerse ScienceDirect Expert Systems with Applications journal hom...

627KB Sizes 0 Downloads 46 Views

Expert Systems with Applications 39 (2012) 9524–9533

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

ICE – Intelligent Clustering Engine: A clustering gadget for Google Desktop Lando M. di Carlantonio a,⇑, Bruno A. Osiek a, Geraldo B. Xexéo a, Rosa Maria E.M. da Costa b a

Universidade Federal do Rio de Janeiro – UFRJ, COPPE – Programa de Engenharia de Sistemas e Computação, Cidade Universitária, CT, H–319, CEP 21941-972 Rio de Janeiro, RJ, Brazil Universidade do Estado do Rio de Janeiro – UERJ, IME – Departamento de Informática e Ciência da Computação, Rua São Francisco Xavier, 524, B-6o, CEP 20550-013 Rio de Janeiro, RJ, Brazil

b

a r t i c l e

i n f o

Keywords: Information retrieval Text mining Document clustering Genetic Algorithms

a b s t r a c t In light of the increased capacity and lower prices of computer hard drives, a new universe to be explored emerges, the microcosm of personal files. Although search and information retrieval techniques are already widely used in the Internet, its application in personal computers is still incipient. This paper describes a new tool for document clustering in the desktop, whose effectiveness in obtaining groups with similar documents is evidenced by the experimental results.  2012 Elsevier Ltd. All rights reserved.

1. Introduction Despite the increasing amount of information available in the Internet, storing files in personal computers is a common habit among Internet users, which is essentially justified for three reasons:  availability is not always permanent – a shortcut in the favorites folder that points to a document that no longer exists is useless;  although the information is probably available in the Internet in more than one site, the user bypasses having to locate it again;  obtaining information is not always immediate – the time involved depends on the file size and connection speed. But this habit creates a new problem for the user, when the available storage space on their machines becomes abundant: how to find the desired information in a simple, fast and efficient way? Even users who do not have this habit, when they need to find in the scores of shortcuts saved in their favorites folder, are faced with the question: which one leads to the page where the desired information can be found? In fact the information in the Internet does not have a rigorous organization. The impossibility of maintaining order in a vast and diversified structure is not considered an obstacle to its global acceptance as a useful knowledge repository. But little time and effort to search for specific information represent very valuable aspects (Liu, Wu, & Liu, 2011). Tools that strive for simplicity and agility in information retrieval have been prominent among those ⇑ Corresponding author. E-mail addresses: [email protected] (L.M. Carlantonio), [email protected] (B.A. Osiek), [email protected] (G.B. Xexéo), [email protected] (R.M.E.M. Costa). 0957-4174/$ - see front matter  2012 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2012.02.101

offered by the Internet. Google (Google Inc., 2011g) and Gmail (Google Inc., 2011f) are great examples. The philosophy adopted by Gmail (Google Inc., 2010) poses certain questions whose answers can corroborate the relevance of the topic covered in this work:  Why waste time deciding which files can be discarded, relinquishing files that can be useful in the future, if disk space is no problem?  Why spend time sorting documents, if we can retrieve them quickly, whenever we need, through a simple search?  Why adopt a rigid structure for classifying documents, if they can be perceived as similar by other criteria other than those imposed by the single hierarchy of a directory architecture?

Without the use of efficient techniques for search and information retrieval, a great deal of time is consumed in organizing and obtaining the information needed. In the Internet, the use of such techniques is now widespread (Song, Choi, Park, & Ding, 2011), but in terms of personal computers, the tools are quite limited. The objective of this paper is to present a new tool, based on a system created by Carlantonio and Costa (2009): a clustering gadget to be used for file searches in desktop computers, called Intelligent Clustering Engine (ICE). In comparison to the Carlantonio and Costa system, we highlight the main contributions of the ICE system:      

new approach to a desktop indexer; new weight in the ordination; new compact interface; new visualization; comparative tests with other software; test results made with public domain database.

L. M. Carlantonio et al. / Expert Systems with Applications 39 (2012) 9524–9533

9525

This paper is organized as follows: in Section 2 we present an overview of some interesting tools available for desktops. In Section 3 we describe the ICE gadget, and in Section 4 we discuss the experimental results, followed by the conclusions and future works in Section 5.

(‘‘site:’’), the operator to limit the search to a particular folder or directory and its subdirectories (‘‘under:’’), and the operator to limit the search to a specific computer (‘‘machine:’’) (Google Inc., 2011b). Google Desktop also provides a history of all files and Web sites accessed, sorted by date and time, through the item ‘‘Timeline’’.

2. Desktop tools

2.3. Carrot2

Among the few tools available to desktop search, four stand out: Aduna AutoFocus (Aduna, 2009a) (2.1), Google Desktop (Google Inc., 2011h) (2.2), Carrot2 (Carrot Search, 2011a) (2.3) and Ergo (Invu Services Ltd, 2009) (2.4). Among them, only Ergo is not a free tool. Next we present some information about them.

Carrot2 (Carrot Search, 2011a) is an open source framework for building clustering engines, to group, in thematic categories, the results provided by sites and search programs. In the context of text mining, the clustering technique has this goal, i.e., automatically cluster texts (or documents) on the same subject and separate texts of different subjects (Manning, Raghavan, & Schtze, 2008; Wikimedia Foundation, 2011a). As a formal definition of the problem, we have: From a set of n documents, X = {X1, X2, . . . , Xn}, where each Xi that belongs to Rp is a vector with p dimensions that measures the attributes of the document. They must be grouped so that groups C = {C1, C2, . . . , Ck} are disjoint, where k is a priori an unknown value and represents the number of groups (adapted from (Hruschka & Ebecken, 2003)). The following conditions must be found:

2.1. Aduna AutoFocus Aduna AutoFocus is a desktop search application that uses a guided exploration strategy (Aduna, 2009a). There are versions available for Windows, Linux, Mac OS, and other platforms with Java support. Basically, the user selects the sources to be indexed, submits the query and then proceeds to explore the results, dismembering them, through the selection of new terms and features. After a source to be selected, which can be a directory, a network drive, IMAP, HTTP or HTTPS, the program carries out files indexation, identifing the 10 most significant terms of each document. If a document is among those returned by the search, their words will be offered to the user (column with significant terms of the detail table) during the exploitation of results, allowing to filter the results through these terms. Exploitation of results can also be made by the selection of facets, among which the keyword suggestions stand out. In general, the program offers up to 50 words considered as the most relevant. The visualization of results is done through diagrams, called Cluster Map, which are very similar to the Venn diagrams or Euler diagrams, and whose main objective is to show whether and how the groups formed during the exploration overlap (Aduna, 2009b). This software supports various types of files, for example, MS Office, OpenOffice, TXT, HTML, PDF, XML, etc. The search can be refined by choosing particular fields, such as text and title/subject, among others (Aduna, 2011). The program offers several operators that allow creating complex queries, such as fuzzy operator (‘‘~’’) and proximity operator (‘‘~number’’). These two operators also exist in the Lucene (The Apache Software Foundation, 2011b), with which the Aduna AutoFocus has a certain similarity in terms of operators and query syntax. 2.2. Google Desktop Google Desktop (Google Inc., 2011h) is a desktop search application that provides a sidebar similar to Windows 7 and Windows Vista, where gadgets can be included. There are versions for Windows 7/Vista/XP/2000, Linux, and Mac. Gadgets, in the software industry, are small programs that can be aggregated to a larger system (Wikimedia Foundation, 2011c). In addition to Google Desktop, gadgets are available in Windows 7/Vista, Mac OS X, KDE, Gnome, and iGoogle (Google Inc., 2011l). Google Desktop indexes many types of text files, besides music and videos files (Google Inc., 2011i). We can also add others plugins (Google Inc., 2011j) that are specific to source codes of programming languages. Google Desktop has some specific operators, other than those of Google’s site, such as the historic Web operator for a specific site

(a) C1 [ C2 [    [ Ck = X; (b) Ci – ;, "i, 1 6 i 6 k; (c) Ci \ Cj = ;, "i – j, 1 6 i 6 k e 1 6 j 6 k. By definition, a document can only belong to one group (Yin, Hu, Yang, Li, & Gu, 2011), but there are also definitions in the literature that allow an object to belong to more than one group (Yi-Ouyang, Yun-Ling, & AnDing-Zhu, 2007). The clustering problem is often considered an optimization problem, where, through measures of entropy or silhouette, for example, it seeks to determine the point of the search space which maximizes the differences between groups and the similarities within groups (Agustín-Blas et al., 2012; Madylova et al., 2009; Xu & Wunsch, 2005). The ideal number of groups where documents should be divided is one of the challenges of the problem. There are some proposed solutions that do not require initial values for its determination (Chang, Zhao, Zheng, & Zhang, 2012; Cura, 2012; Xiao, Yan, Zhang, & Tang, 2010). The clustering technique allows the user to find document groups of interest instead of individual documents. This allows a reduction in result overhead, favoring also semantic gains, since the context (the other words included in the document) in which the word is contained influences the inclusion of the document in one or another group. This contributes to distinguishing documents that contain the word jaguar (car) from those that contain the word jaguar (animal). In the Internet we currently find many sites that offer the clustering technique, highlighting for instance: the official website of the US government (USA.gov, 2009), whose search was developed by Vivisimo (Vivisimo Inc., 2010); Allplus (WebLib, 2011); Grokker (Groxis, 2009); KartOO (Kartoo, 2009); Yippy (Yippy Inc, 2010); and Carrot2 itself, which also offers a search site (Carrot Search, 2011b). Carrot2 is implemented in Java and has components for Google (Google Inc., 2011g), MSN (Microsoft Corporation, 2011a), Yahoo! (Yahoo! Inc, 2011b), Google Desktop (Google Inc., 2011h), Solr (The Apache Software Foundation, 2011d) and Lucene (The Apache Software Foundation, 2011b). Carrot2 is not a search engine, nor does it have crawlers or indexers. For these roles, it suggests using Nutch (The Apache Software Foundation, 2011c) for the first, and Lucene or Solr for the second. A relevant aspect involving Lucene is the fact that it

9526

L. M. Carlantonio et al. / Expert Systems with Applications 39 (2012) 9524–9533

suggests, reciprocally, Carrot2 to the function of clustering (The Apache Software Foundation, 2011a). Carrot2 provides several algorithms for clustering: Lingo, STC (Suffix Tree Clustering), Rough K-means and Fuzzy Ants. The first two are especially designed for clustering search results. Carrot2 offers a library and a set of support applications. The library, besides the clustering functions themselves, provides tokenizers, stemmers and lists of stop words for several languages. The application suite contains the following options (Carrot Search, 2011c):

and navigation, the program offers an annotation feature, with a menu similar to Microsoft Office 2007, from which text, tables, images, etc. can be inserted, and then export the result as an XPS file (XML Paper Specification) (Microsoft Corporation, 2011d). As for clustering, it is important to highlight that the program does not create groups with unique content, the same document can belong to several groups.

 Carrot2 Document Clustering Workbench – desktop application that allows quick experiments, besides being useful in identifying the appropriate values of the various existing parameters for each algorithm available;  Carrot2 Document Clustering Server – Web server that offers the function of clustering as a Web service REST (Representational State Transfer) (Wikimedia Foundation, 2011g);  Carrot2 Web Application – Web application for end users equal to the one available online (Carrot Search, 2011b).

The ICE gadget was created following a similar structure to the SAGH (Genetic Analytical System of Grouping Hypertexts), a system created by Carlantonio and Costa. Fig. 1 (from Carlantonio & Costa, 2009) shows the seven modules of the SAGH system, as well as their input and output files. Among the peculiarities of SAGH, we highlight:

In this work, we will use the application Carrot2 Document Clustering Workbench to perform comparative tests between the ICE gadget and the Carrot2 framework. This application/framework was chosen due to the fact that it is able to interact with Google Desktop, the platform chosen for developing the ICE gadget. Carrot2 has two types of visualization: a circle-shaped (Circles Visualization), made with the use of Adobe Flash Player; and another one using the Aduna Cluster Map, in an older version than that used by Aduna AutoFocus. The applications to be developed using the Carrot2 framework can be done in two ways:  software development in Java – are used JAR (Java Archive) (Wikimedia Foundation, 2011f) files and calls are made to Carrot2 API;  software development in other languages – Carrot2 Document Clustering Server is installed and configured, and then, calls are made to the server using the REST protocol. 2.4. Ergo Ergo (Invu Services Ltd, 2009) is a software for search results clustering that can work with search programs or sites, similar to Carrot2. Unlike the three previous tools, Ergo is a proprietary software. Until very recently, it was possible to download an evaluation copy of this software. But in September 2010, the software was patented under the name Wagumo, available on a new website (http://www.wagumo.com). The program is written in J# (J Sharp) and requires several additional programs for its operation, such as, .Net Framework and SQLServer Compact, besides virtual printers being installed. Ergo runs on Windows XP and Vista. Like Carrot2, several search sources can be used: Google (Google Inc., 2011g), Yahoo! (Yahoo! Inc, 2011b), Flickr (Yahoo! Inc, 2011a), YouTube (Google Inc., 2011q), Wikipedia (Wikimedia Foundation, 2011i), among others. For document clustering in the desktop, we must also install the Windows Desktop Search. Perhaps, Windows Search, his successor, can also be used. Ergo has a strong visual appeal, especially in result navigation, in particular when using Flickr (Yahoo! Inc, 2011a) as data source, when only the photos are displayed. The program uses the Windows Presentation Foundation, graphical subsystem of the .NET Framework 3.0 (Microsoft Corporation, 2011b; Wikimedia Foundation, 2011j). There are several options to display the groups formed, some with 3D effects. Besides the functions of search

3. ICE gadget

 expanding the concept of stop words for empty stems, where any word that has a stem like some of those obtained by stemming of the list of stop words is dropped;  super-powered population – a resource that aims to increase the quality of found clusters, where the clustering algorithm is carried out several times, in order to obtain a set of ‘‘evolved’’ individuals, which will be used as initial population for the last run of the algorithm;  creation of differentiated p-dimensional space, where each document is entitled to supply its most frequent term, according to the sorting type chosen (tf – term frequency, idf – inverse document frequency or tf⁄idf, or in the case of the ICE gadget, tf⁄tf or tf⁄idf) for the composition of this space, discarding the repeated terms. As for the sorting criterion idf (inverse document frequency), it is calculated by Eq. (1).

idf ¼ log

  n df

ð1Þ

where n is the number of documents to be grouped and df is the number of documents that contain the term. The clustering module (based on the technique proposed by Hruschka & Ebecken (2003)) uses the technique of Genetic Algorithms, Artificial Intelligence technique that aims to find exact or approximated solutions to optimization problems, through inspired mechanisms in evolutionary biology (Jain, Murty, & Flynn, 1999; Song, Wang, & Li, 2009; Wikimedia Foundation, 2011d), which is why we chose the name ICE – Intelligent Clustering Engine for the gadget. As characteristics of the clustering algorithm, we can highlight:  partitioning method;  chromosomes of constant size during execution (see Fig. 2 – from (Carlantonio & Costa, 2009));  fitness function based on the silhouette (Wikimedia Foundation, 2011h);  cosine similarity (Wikimedia Foundation, 2011b);  stop criterion based on the number of generations;  use of the elitism;  roulette-wheel selection;  crossover and mutation operators oriented to groups;  random initial population;  does not require any input parameter;  provides the number of groups and their contents. Regarding the fitness function, we have the following equations:

L. M. Carlantonio et al. / Expert Systems with Applications 39 (2012) 9524–9533

9527

Fig. 1. The SAGH system.

From the above definition, we have:

1 6 sðiÞ 6 1

ð4Þ

and finally:

OF ¼

Fig. 2. A chromosome: partitioning of the documents + number of distinct groups.

sðiÞ ¼

bðiÞ  aðiÞ maxfaðiÞ; bðiÞg

ð2Þ

Which can be rewritten as:

8 > < 1  aðiÞ  bðiÞ; if aðiÞ < bðiÞ sðiÞ ¼ 0; if aðiÞ ¼ bðiÞ > : bðiÞ  aðiÞ  1; if aðiÞ > bðiÞ

ð3Þ

n X sðiÞ n i¼1

ð5Þ

where a(i) is the average distance of the document i 2 cluster A to others documents of the A; b(i) is the minimum of d(i, C), with C –A, where d(i, C) is the average distances of the document i 2 cluster A to the documents of the C; s(i) = 0, if the cluster has only one document. And having as objective function (OF), the arithmetic mean of s(i), where n is the total number of documents. The cosine similarity is calculated by Eq. (6).

cosðhÞ ¼

AB kAk  kBk

ð6Þ

where A and B are vectors that represent the documents in which we want to evaluate the similarity. The ICE gadget is designed to run on the platform offered by Google Desktop (Google Inc., 2011h). In this first version, the system can sort and group HTML documents.

9528

L. M. Carlantonio et al. / Expert Systems with Applications 39 (2012) 9524–9533

Among some existing visualization techniques, Hyperbolic Tree (Bouthier, 2011; Bou, 2011a), Jung (O’Madadhain, Fisher, Nelson, White, & Boey, 2011), Guess (Adar, 2011), and Network Workbench (NWB Team, 2011), we adopted the hyperbolic tree technique, also called hypertree, by an implementation called Treebolic (Bou, 2011a), to show the groups found by ICE gadget. This technique offers a graph visualization that is based on hyperbolic geometry. It considerably reduces the necessary space for the display of a tree, because it highlights the nodes that are in focus, while the others have their size compressed on the borders (Wikimedia Foundation, 2011e). This form of compact view is crucial in the case of the gadgets, because we are dealing with applets with reduced screens. The following sections give an overview of the main themes related to this work, we describe the structure of the Google gadget (3.1), the Treebolic suite (3.2) and finally, we discus the details of the ICE gadget (3.3). 3.1. Structure and creation of a gadget A Google Desktop gadget consists of JavaScript code, XML files and objects and functions provided by Google Gadget API (Google Inc., 2011c). The default file extension of the gadget is GG, i.e., Google gadget. This file type is, in fact, a zipped file that contains the following elements: (a) an XML file called gadget.gmanifest, which contains metainformation about the gadget (name, version, author, APIs used, etc.); (b) another XML file called main.xml, which defines the main view with the user (interface objects, their appearance properties, and function names to be called during the occurrence of certain events); (c) a JavaScript file called main.js, where the functions mentioned in the previous item are encoded; (d) images for the various states of interface objects and of icons of the gadget (formats: BMP, JPG, PNG, and GIF); (e) and, finally, another XML file called strings.xml, with information that will be displayed on the ‘‘about dialog box’’ of the gadget. Other gadgets, such as those for Windows 7/Vista, also have similar structures (Lee, 2008). In addition to those basic components, we can include other XML files to specify an options view (Avram, 2007), a details view (Stucki, 2007), as well as JavaScript files, or VBScript (Visual Basic Scripting Edition) files, to define the functionality of these interfaces or to organize the code (Filimon, 2008). A sophisticated and original visual interface can be created, as we can set images for each state of the interface objects, in addition to easily define their transparency and rotation effects (Filimon, 2007; Schirmer, 2007; Thangaraj, 2007). One can also use a dynamic-link library in the gadgets, encapsulating native code inside ActiveX automation objects, creating the so-called hybrid gadgets (Olczyk, 2007), which have as a limit of functionality only those defined by the operating system. For this simple yet powerful structure, the number of gadgets available (Google, Microsoft Windows, Yahoo!) is quite considerable. Only in the Windows Live Gallery (Microsoft Corporation, 2011c), we found 5502 gadgets in English language. Realizing the potential of these applets, big companies have created gadgets to promote their online content (Amazon.com, Inc, 2011; Infoglobo Comunicação e Participações, 2011). To create gadgets, Google Desktop provides the RAD (Rapid Application Development) software, called Gadget Designer (Google

Inc., 2001e; Google Inc., 2011o). It creates the basic files needed, allows debugging and viewing the gadget running. After the creation of the interfaces and encoding of the functions, the program enables generating the GG file, through the option ‘‘build package’’ in the gadget menu. To learn how to build the gadgets quicker, nothing better than examples and there are several in the Gadget Designer download package. Another possibility is the large number of gadgets offered on the Google Desktop site (Google Inc., 2011k), as well as the tutorials (Google Inc., 2011n), articles (Google Inc., 2011a) and documentation (Google Inc., 2011p), in particular, the references to the gadgets API (Google Inc., 2011d) and to query API (Google Inc., 2011m). 3.2. The Treebolic suite Treebolic is a Java suite that implements hyperbolic trees (Bou, 2001b). It offers several features, as well as a very practical navigation. To use it we incorporate the Java applet into an HTML file and describe the groups using the XML format defined by the project. Two programs stand out among those in the installation package to understand the use of Treebolic: the Treebolic Demo, to help become familiarized with the functionality offered; and the Treebolic Generator, to understand how the features are stored in XML files, which is fundamental for the creation of trees at runtime. The specification of the tree includes several items: statusbar, toolbar, pop-up menu, nodes, etc. The tree can be split into several separate files, which enables to assemble or dismantle the subtrees during the visualization. In the pop-up menu, we can include several options, among which we highlight the option to search for a node with the specified text, by the criteria: ‘‘start with’’, ‘‘includes’’, or ‘‘equals’’. In relation to the nodes, besides the label and the content, we can also set colors, images and links to sites or local files. 3.3. The ICE gadget The ICE gadget is a document clustering tool in desktop computers. It interacts with Google Desktop, grouping the results returned by this indexer. Its compact and rich visualization interface provides much information about the clusters formed and their contents. Besides the operators offered by Google Desktop, we can choose from the tf⁄tf or tf⁄idf weights, and the application of the superpowered population to improve the results. Figs. 3 and 4 show the ICE gadget interface and its visualization in a search example. The concept of visualization is based on the overview-detail idea, where files can be loaded easily, being displayed in separate

Fig. 3. ICE gadget.

L. M. Carlantonio et al. / Expert Systems with Applications 39 (2012) 9524–9533

Fig. 4. ICE – Visualization.

windows, not changing the visualization of the tree, enabling the user to not lose the context in which the document is placed. The nodes of the tree can be put into focus by a simple click, or automatically by enabling the option ‘‘hovering triggers focus’’, and then, briefly positioning the mouse cursor on the node. Cluster nodes can be mounted and unmounted on demand, enabling a compact display and targeted to user interests, making the navigation easier and identification of relevant files faster. In its visualization, the ICE gadget offers, on a color scale, a measure of relative distance for the documents, which takes into account two extremes: the centroid of the group and the document farthest from the centroid. This information is also available in the tooltip through the parameter RCU (ratio centroid to ultimate document), which ranges from 0 to 1. The visualization also presents the principal terms of the cluster and the number of documents within it. For documents, we also have the title, the snippet that Google Desktop has provided for the file, its principal terms and their location (link). The terms used in the search appear in the Google Desktop logo (in this example, cocoa, in the center), they are incorporated into the list of stop words at runtime, because they become irrelevant to the clustering process, since they will be in all documents returned by Google Desktop. It is possible to perform searches for a node (cluster or document) containing certain text in the label or in the content (principal terms, snippet, RCU and link, in the case of documents; and principal terms, in the case of clusters). The position of the statusbar and the toolbar can be changed on the interface or even detached from the window, extending the usable area available to the tree in the visualization window. We decided to not show in the visualization the groups that have only one document, nor include them in the group ‘‘others’’. The activity diagram of the ICE gadget can be seen in Fig. 5. The ICE gadget is more demanding in terms of processor than in terms of memory. The testing machine used was a Core 2 Quad Q6600, 2 GB RAM, offering a very good run time, of course this depends on the number of files returned by the query. The most sensitive stages of the process are:  creation of vectors of terms – when the amount of files is large and/or when they have many words;  genetic analysis of clusters – when the number of files is large. A possible improvement of the gadget could be the creating vectors of terms of all HTML files available in the computer, at the

9529

intervals when the processor is idle, similar to what Google Desktop does with regard to indexing. That would eliminate much of the sensitivity of this step. Of course, it would be necessary to adapt this module to the creation of the dictionary file only after the query is submitted, because to use the tf⁄idf weigh it is necessary that this file is created taking into account just the words (and their occurrences) of the documents retrieved by the query. The clustering algorithm has more difficulty in separating the documents if they have many terms in common, so the weights that generate a greater distance between the documents are more suitable. During some tests, we found that using a non-trivial weight, the tf⁄tf weight, usually provides interesting results. What happens in practice when using this weight is that the terms that occur only once in the documents have their relevance reduced even more. This weight was chosen as default due to the good results achieved, its simplicity and lower computational cost when compared to tf⁄idf. When the tf⁄tf weight does not separate the documents, one can try using the tf⁄idf weight or the feature superpowered population with that or this weight. One feature that differentiates these weights is that the tf⁄tf weight tends to separate the documents more, providing a greater number of groups.

4. Experimental results For the tests, a subset of Reuters data base was used, the Reuters-21578 (Jiang, Pang, Wu, & Kuang, 2012; Lewis, 1997). As the Reuters-21578 data set is very large, with 21,578 documents in 135 categories (topic field), we promote the following cuts:  from the data set, we calculated the average number of characters in the body of the documents (body field) and selected those documents with a number of characters greater than the average (835.5719) in this field, yielding 10,369 documents;  then, we excluded the documents that do not have a topic field filled out, reducing the number to 3263 documents. With this procedure, the number of distinct categories changed from 135 to 73;  next, we eliminated those categories that have less than about 30 documents and those with more than about 100 documents;  from this subset, we divided the documents into three parts, trying to compose a set with the categories that have about 30 documents, another one with about 50, and the third one with about 100 documents. With this, we chose the following categories, limited to three, to form the test data, depending on the number of documents in the categories (in brackets), shown in Table 1. The reasoning involved in choosing the categories that compose the varied data set was based on the fact that cocoa and coffee are exportable agricultural products and, certainly, the ‘‘ship’’ category is closely related to export, possibly constituting a data set more difficult to be clustered, since the categories bear some similarities. After identifying these subsets, we created HTML files, one for each document, containing their body and title fields. We waited for Google Desktop to index the 4 folders that defined the bases and then we started with the testing. Early on, we encountered the following question: which term, or terms, to choose as query? We realized that the choice that could influence less the results would be, precisely, to not choose any term, but allow the programs to evaluate the documents in their entirety, of course, ignoring the stop words (Cutting, Karger,

9530

L. M. Carlantonio et al. / Expert Systems with Applications 39 (2012) 9524–9533

User

ICE

Google Desktop

Specifies the Query

Submits the Query

Searches Results

Reports that there are no Results

Records the Snippets Records the File List Creates Vectors of Terms Classifies the Vectors Creates the Dimension Creates the Matrix Normalizes the Matrix Cluster

Visualizes the Results

Generates the Visualization

Fig. 5. ICE – Activity diagram.

Table 1 Test data created. Small (4.1)

Medium (4.2)

Large (4.3)

Varied (4.4)

Oilseed (28) Bop (32) Cocoa (35)

Gold (50) Coffee (67) Sugar (68)

Gnp (92) Ship (92) Interest (115)

Cocoa (35) Coffee (67) Ship (92)

Total: 95

Total: 185

Total: 299

Total: 194

Pedersen, & Tukey, 1992). So, the searches were made in the format: ‘‘under:path’’. Because of the difference in the type of clustering conducted by programs, where the ICE gadget creates groups with unique content and Carrot2 does not (overlapping the results), we adopted the following methodology:  we decided that the cluster belongs to a category according to the predominance of documents;  in case of a draw, the cluster is ignored;  we accepted the division of a category in more than one group, according to the previous items;  groups that have only one element are not considered;  the group labeled ‘‘other topics’’ in Carrot2 is also not considered;

 we calculated the percentage of accuracy in the largest cluster of the category and the percentage of accuracy involving all groups of the category, i.e., the percentage of documents of the X category that is in the highest group assigned to it and the percentage of documents of the X category that is in all the groups that were assigned to it. We used the default values of Carrot2. As for ICE, we used the default weight, tf⁄tf, but we used the resource of super-powered population. 4.1. Small data set For the data set containing the small categories, the ICE gadget found five groups, grouping 89 of the 95 existing documents. Carrot2 found 28 groups (not including the group ‘‘other topics’’). Of the 28 groups, seven had only two documents. The counting of the grouped documents cannot be done easily in Carrot2, because its clustering can be overlapping. This feature makes the comparison of results difficult. Ignoring the documents placed in the group ‘‘other topics’’ (13), the number of documents that are inside other groups is 145, although there are only 95 documents in the data set. Table 2 shows the results, according to the adopted methodology, where the numbers in brackets represent:

9531

L. M. Carlantonio et al. / Expert Systems with Applications 39 (2012) 9524–9533 Table 2 Small data set.

ICE – LG Carrot2 – LG ICE – Total Carrot2 – Total

Table 4 Large data set. Oilseed (28)

Bop (32)

Cocoa (35)

100% (26) 80% (8) 100% 98.90%

85% (17) 100% (10) 90.32% 88.88%

100% (32) 100% (17) 100% 100%

 in column headers: number of documents in each category;  others: the greater number of documents, of a given category, that were put together, i.e., its largest group (LG). 2

In comparison, Carrot performed better than the ICE gadget only in one of six items, besides the great difference in the number of clusters found, 28 to 5 of the ICE. As for the largest group, the ICE gadget grouped 70% more documents than Carrot2 (‘‘bop’’ category).

ICE – LG Carrot2 – LG ICE – Total Carrot2 – Total

Gnp (92)

Ship (92)

Interest (115)

82.61% (76) 100% (19) 82.61% 89.58%

98.86% (87) 100% (8) 98.86% 90.16%

86.32% (101) 100% (23) 86.32% 88.24%

Cocoa (35)

Coffee (67)

Ship (92)

100% (30) 100% (18) 100% 97.36%

96.97% (64) 100% (29) 97.06% 95.35%

96.66% (87) 100% (8) 96.73% 96.55%

Table 5 Varied data set.

ICE – LG Carrot2 – LG ICE – Total Carrot2 – Total

4.2. Medium data set

4.4. Varied data set

In the data set containing the medium categories, the ICE gadget created seven groups, clustering 181 of the 185 existing documents. Carrot2 created 40 groups, whose sum provides 168 documents (not including the group ‘‘other topics’’). Of the 40 groups, 17 had only two documents. The group ‘‘other topics’’ had 60 documents. Table 3 summarizes the results of this data set. Comparing the results, the ICE gadget was worse than Carrot2 only in three of six items. In relation to the largest group, it joined 120% more documents than Carrot2 in the ‘‘coffee’’ category, and 464% more than in the ‘‘sugar’’ category. Again, the difference in the number of clusters found was significant, seven of ICE against 40 of Carrot2. One issue (not shown) that caught our attention was the fact that Carrot2 grouped only 26 (or less, because of repetition) of the 50 documents in the ‘‘gold’’ category.

In the test case involving the data set containing categories with varied sizes, ICE produced five groups, using 190 of the 194 existing documents. As for Carrot2, it provided 40 clusters (not including the group ‘‘other topics’’), summing 192 documents. In the group ‘‘other topics’’ 61 documents were placed. Seventeen of the 40 groups had only two documents. The results are presented in Table 5. In this last test, Carrot2 performed better than ICE only in two of six items, but again with large differences in relation to documents included in the larger groups. The ICE gadget grouped 121% more documents, in the case of the ‘‘coffee’’ category, and 988% more, in the case of the ‘‘ship’’ category. Again, we noted that Carrot2 grouped few documents of one category, the ‘‘ship’’ category, where only 56 (or less, because of repetition) of the 92 documents were taken into account (not shown). The number of groups found also deserves to be highlighted, five of ICE against 40 of Carrot2, the second largest difference that was obtained.

4.3. Large data set The ICE gadget created exactly three groups, clustering 297 of the existing 299 documents, in the test involving the large categories. Carrot2 generated 50 groups (not including the group ‘‘other topics’’), whose sum was 288 documents. Thirteen of the 50 groups had only two documents. The group ‘‘other topics’’ aggregated 103 documents. Table 4 shows the results for this data set. Analyzing the results, it was noticed that the ICE gadget had a better result when compared with Carrot2 in only one of the six items. But the difference in the number of documents grouped for the largest group was significant, with the ICE obtaining 300% more documents than Carrot2 for the ‘‘gnp’’ category, 988% more in the ‘‘ship’’ category, and 339% more in the case of ‘‘interest’’ category. We emphasize the difference in the number of groups found, which was the largest of all, three of ICE against 50 of Carrot2.

Table 3 Medium data set.

ICE – LG Carrot2 – LG ICE – Total Carrot2 – Total

Gold (50)

Coffee (67)

Sugar (68)

100% (44) 100% (7) 97.96% 92.86%

98.46% (64) 100% (29) 98.51% 91.67%

98.41% (62) 100% (11) 98.46% 100%

5. Conclusions In this work, we proposed, presented and evaluated a clustering gadget for Google Desktop, called ICE – Intelligent Clustering Engine. The main contribution of this work was to develop a new tool to improve the quality of results offered by Google Desktop, by using the clustering technique. Comparing the results of ICE with those offered by Carrot2, it was shown that the ICE gadget can find a number of groups much closer to the reality of the bases tested, not spreading the documents among many small groups, promoting understanding of the relationship between the groups more clearly than Carrot2, and speeding up the desired information obtained. Our assessment shows that in the experimental results the ICE gadget is able to group similar documents. The weight tf⁄tf, embedded in the gadget, proved to be very useful to obtain large similar groups. The fact that Carrot2 only considers the text snippets returned by Google Desktop, although fundamental in targeted searches to sites in the Internet, is a disadvantage when it comes to desktop search because, as the files are easily accessible, a clustering that takes into account the entire contents of the file tends to provide far more accurate results.

9532

L. M. Carlantonio et al. / Expert Systems with Applications 39 (2012) 9524–9533

Another disadvantage of Carrot2 is that the technique employed does not generate clusters with unique elements, allowing the same document to belong to more than one cluster. The same happens with Ergo, and compared poorly with what Aduna AutoFocus is proposed to do. Suggestions for future works involve extending the system for the clustering of other types of files, as well as other languages. Another possibility would be to obtain access to the indexing archives of Google Desktop, which would allow clustering any type of indexed document with this application (Broder, Glassman, Manasse, & Zweig, 1997).

References Adar, E. (2011). GUESS: The graph exploration system. Visited September 2011. Aduna (2009a). Aduna – AutoFocus. Visited July 2009. Aduna (2009b). Aduna – Cluster Map Library. Visited July 2009. Aduna (2011). Search – Aduna open source wiki. Visited September 2011. Agustín-Blas, L. E., Salcedo-Sanz, S., Jiménez-Fernández, S., Carro-Calvo, L., Del. Ser, J., & Portilla-Figueras, J. A. (2012). A new grouping genetic algorithm for clustering problems. Expert Systems with Applications, 39, 9695–9703. Amazon.com, Inc. (2011). Amazon.com Associates Central – Widgets. Visited September 2011. Avram, C. (2007). Using the options dialog – Google Desktop APIs – Google Code. Visited September 2011. Bou, B. (2011a). Treebolic. Visited September 2011. Bou, B. (2011b). treebolic j Download treebolic software for free at SourceForge.net. Visited September 2011. Bouthier, C. (2011). Hypertree Java Library. Visited September 2011. Broder, A. Z., Glassman, S. C., Manasse, M. S., & Zweig, G. (1997). Syntactic clustering of the web. Computer Networks and ISDN Systems, 29, 1157–1166. Carlantonio, L. M., & Costa, R. M. E. M. (2009). Exploring a genetic algorithm for hypertext documents clustering. In N. Nedjah, L. de Macedo Mourelle, J. Kacprzyk, F. M. G. França, & A. F. de Souza (Eds.), Intelligent text categorization and clustering. Studies in computational intelligence (Vol. 164, pp. 95–117). Berlin/Heidelberg: Springer. Carrot Search (2011a). Carrot2 – open source search results clustering engine. Visited September 2011. Carrot Search (2011b). Carrot2 clustering engine. Visited September 2011. Carrot Search (2011c). Carrot2 user and developer manual for version 3.6.0-dev. Visited September 2011. Chang, D., Zhao, Y., Zheng, C., & Zhang, X. (2012). A genetic clustering algorithm using a message-based similarity measure. Expert Systems with Applications, 39, 2194–2202. Cura, T. (2012). A particle swarm optimization approach to clustering. Expert Systems with Applications, 39, 1582–1588. Cutting, D. R., Karger, D. R., Pedersen, J. O., & Tukey, J. W. (1992). Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval (pp. 318–329). New York, NY, USA: ACM. Filimon, T. (2007). Desktop gadgets: Rotating objects – Google Desktop APIs – Google Code. Visited September 2011. Filimon, T. (2008). Using parameters in desktop gadget programming – Google Desktop APIs – Google code. Visited September 2011. Google Inc. (2010). Ten ways Gmail makes email easy and efficient. And maybe even fun. Visited September 2010. Google Inc. (2011a). Articles – Google Desktop APIs – Google Code. Visited September 2011. Google Inc. (2011b). Basics: search operators – desktop for windows help. Visited September 2011. Google Inc. (2011c). Creating a gadget – Google Desktop APIs – Google code. Visited September 2011. Google Inc. (2011d). Gadget API reference – Google Desktop APIs – Google code. Visited September 2011. Google Inc. (2011e). Gadget designer – Google Desktop APIs – Google code. Visited September 2011.

Google Inc. (2011f). Gmail: Email from Google. Visited September 2011. Google Inc. (2011g). Google. Visited September 2011. Google Inc. (2011h). Google Desktop. Visited September 2011. Google Inc. (2011i). Google Desktop – features. Visited September 2011. Google Inc. (2011j). Google Desktop gadgets. Visited September 2011. Google Inc. (2011k). Google Desktop gadgets. Visited September 2011. Google Inc. (2011l). iGoogle. Visited September 2011. Google Inc. (2011m). Query API developer guide – Google Desktop APIs – Google code. Visited September 2011. Google Inc. (2011n). Tutorials – Google Desktop APIs – Google Code. Visited September 2011. Google Inc. (2011o). Using gadget designer – Google Desktop APIs – Google code. Visited September 2011. Google Inc. (2011p). Welcome – Google Desktop APIs – Google code. Visited September 2011. Google Inc. (2011q). YouTube – broadcast yourself. Visited September 2011. Groxis (2009). Grokker – enterprise search management and content integration. Visited July 2009. Hruschka, E. R., & Ebecken, N. F. F. (2003). A genetic algorithm for cluster analysis. Intelligent Data Analysis, 7, 15–25. Infoglobo Comunicação e Participações S.A. (2011). O site O Globo:: Widgets. Visited September 2011 (in Portuguese). Invu Services Ltd. (2009). Ergo download. Visited July 2009. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31, 264–323. Jiang,S.,Pang,G.,Wu,M.,&Kuang,L.(2012).AnimprovedK-nearest-neighboralgorithm for text categorization. Expert Systems with Applications, 39, 1503–1509. Kartoo S. A. (2009). KartOO: The first interface mapping metasearch engine. Visited July 2009. Lee, W.-M. (2008). Professional Windows Vista gadgets programming. Indianapolis, Indiana, USA: Wiley Publishing, Inc. Lewis, D. D. (1997). Reuters–21578 text categorization test collection distribution 1.0. Visited September 2011. Liu, Y. C., Wu, C., & Liu, M. (2011). Research of fast SOM clustering for text information. Expert Systems with Applications, 38, 9325–9333. Madylova, A., & Ög˘üdücü, Sß. G. (2009). A taxonomy based semantic similarity of documents using the cosine measure. In Proceedings of the 24th international symposium on computer and information sciences (pp. 129–134). Washington, DC, USA: IEEE. Manning, C. D., Raghavan, P., & Schtze, H. (2008). Flat clustering. Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press. chapter 16. Microsoft Corporation (2011a). MSN.com. Visited September 2011. Microsoft Corporation (2011b). The official Microsoft WPF and Windows forms site. Visited September 2011. Microsoft Corporation (2011c). Windows live gallery. Visited September 2011. Microsoft Corporation (2011d). XML paper specification: Overview. Visited September 2011. NWB Team (2011). Network workbench j welcome. Visited September 2011. Olczyk, K. (2007). Going beyond script: Developing hybrid desktop gadgets – Google Desktop APIs – Google code. Visited September 2011. O’Madadhain, J., Fisher, D., Nelson, T., White, S., & Boey, Y.-B. (2011). JUNG – Java Universal Network/Graph Framework. Visited September 2011. Schirmer, B. (2007). Let the user choose your gadget’s opacity – Google Desktop APIs – Google code. Visited September 2011. Song, W., Wang, S. T., & Li, C. H. (2009). Parametric and nonparametric evolutionary computing with a content-based feature selection approach for parallel categorization. Expert Systems with Applications, 36, 11934–11943. Song, W., Choi, L. C., Park, S. C., & Ding, X. F. (2011). Fuzzy evolutionary optimization modeling and its applications to unsupervised categorization and extractive summarization. Expert Systems with Applications, 38, 9112–9121. Stucki, Y. (2007). Details views and YouTube videos in desktop gadgets – Google Desktop APIs – Google code. Visited September 2011. Thangaraj, B. (2007). Animation: Add life to your desktop gadget – Google Desktop APIs – Google code. Visited September 2011.

L. M. Carlantonio et al. / Expert Systems with Applications 39 (2012) 9524–9533 The Apache Software Foundation (2011a). LuceneFAQ – Lucene-java Wiki. Visited September 2011. The Apache Software Foundation (2011b). Welcome to Apache Lucene! Visited September 2011. The Apache Software Foundation (2011c). Welcome to Apache Nutch. Visited September 2011. The Apache Software Foundation (2011d). Welcome to Solr. Visited September 2011. USA.gov (2009). USA.gov: The US Government’s official web portal. Visited July 2009. Vivisimo Inc. (2010). Enterprise search provider – Federated search, social search, clusteringjVivisimo, Inc. Visited September 2010. WebLib (2011). AllPlus – Universal meta search and discovery engine. Visited September 2011. Wikimedia Foundation (2011a). Clustering – Wikipédia, a enciclopédia livre. Visited September 2011 (in Portuguese). Wikimedia Foundation (2011b). Cosine similarity – Wikipedia, the free encyclopedia. Visited September 2011. Wikimedia Foundation (2011c). Gadget – Wikipédia, a enciclopédia livre. Visited September 2011 (in Portuguese). Wikimedia Foundation (2011d). Genetic algorithm – Wikipedia, the free encyclopedia. Visited September 2011. Wikimedia Foundation (2011e). Hyperbolic tree – Wikipedia, the free encyclopedia. Visited September 2011.

9533

Wikimedia Foundation (2011f). Java archive – Wikipédia, a enciclopédia livre. Visited September 2011 (in Portuguese). Wikimedia Foundation (2011g). REST – Wikipédia, a enciclopédia livre. Visited September 2011 (in Portuguese). Wikimedia Foundation (2011h). Silhouette (clustering) – Wikipedia, the free encyclopedia. Visited September 2011. Wikimedia Foundation (2011i). Wikipedia. Visited September 2011. Wikimedia Foundation (2011j). Windows presentation foundation – Wikipedia, the free encyclopedia. Visited September 2011. Xiao, J., Yan, Y., Zhang, J., & Tang, Y. (2010). A quantum-inspired genetic algorithm for k-means clustering. Expert Systems with Applications, 37, 4966–4973. Xu, R., & Wunsch, D. (2005). Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16, 645–678. Yahoo! Inc. (2011a). Welcome to Flickr – Photo sharing. Visited September 2011. Yahoo! Inc. (2011b). Yahoo! Visited September 2011. Yin, M., Hu, Y., Yang, F., Li, X., & Gu, W. (2011). A novel hybrid K-harmonic means and gravitational search algorithm approach for clustering. Expert Systems with Applications, 38, 9319–9324. Yi-Ouyang, Y.-O., Yun-Ling, Y.-L., & AnDing-Zhu, A.-Z. (2007). EHM-based web pages fuzzy clustering algorithm. In Proceedings of the 2007 international conference on multimedia and ubiquitous engineering (pp. 561–566). Washington, DC, USA: IEEE. Yippy Inc. (2010). Yippy. Visited September 2010.