Knowledge-Based Systems 21 (2008) 466–470
Contents lists available at ScienceDirect
Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys
An intelligent information retrieval agent R. Dhanapal Department of Information Technology, Vel Multi Tech Sri Rangarajan Sakunthala Engineering College, Avadi, Chennai 600 062, Tamil Nadu, India
a r t i c l e
i n f o
Article history: Received 11 December 2007 Accepted 11 March 2008 Available online 25 March 2008 Keywords: Information retrieval Phrase indexing Retrieval agent Next word index Query valuation Inverted index
a b s t r a c t To augment the information retrieval process, a model is proposed to facilitate simple contextual indexing for a large scale of standard text corpora. An edge index graph model is presented, which clusters documents based on a root index and an edge index created. Intelligent information retrieval is possible with the projected system where the process of querying provides proactive help to users through a knowledge base. The query is provided with automatic phrase completion and word suggestions. A thesaurus is used to provide meaningful search of the query. This model can be utilized for document retrieval, clustering, and phrase browsing. Ó 2008 Elsevier B.V. All rights reserved.
1. Introduction There is an explosive increase in the amount of information in the modern world. A large amount of text corpora have to be sifted through to retrieve information useful to the user. The process of information retrieval from it can be made intelligent by providing help to the users in their search for information. Any intelligent information retrieval system consists of a graphical user interface that helps the user to interact with the system, a search and inference engine that performs the retrieval and a knowledge base that acts as the source of information [1]. For the edge index graph model proposed, the root index and the edge index that provides the information forms a knowledge base. The search and inference engine is simulated to perform intelligent information retrieval from the knowledge base. An index is used to facilitate the location of a document swiftly. An edge index graph is used to retrieve the information in a very short time successfully. Clustering helps in grouping the data into similar clusters that help in uncomplicated retrieval of data [2]. 2. Background The common form of indexing a document was found to be forward indexing. A forward index consists of the document identifier and the words occurring in the document. But this was found to be inconvenient when the number of documents increased. To cope up with the increased number of documents, the inverted index has been the standard method for supporting queries
E-mail addresses:
[email protected],
[email protected]. 0950-7051/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.knosys.2008.03.010
on large text databases. There is no practical alternative to the inverted index that is fast and efficient [3]. An inverted index consists of a vocabulary, containing each distinct word v that occurs in the document, and a set of inverted lists, one for each v. For ranking, the list must contain each frequency of occurrence of v in d. To allow phrase querying, it must also contain the sequence of ordinal positions at which v occurs in d [4]. Thus the inverted list for the term v is d1; c1 : o1; o2; . . . ; on . . . dm; cm : o1; o2; . . . ; on where di is the document identifier, ci is the count of the term occurrence in document di, and oi is the position of the term in the document di. Using simple coding techniques this information can be compressed to around 25 percentage of the size of the indexed data that is, one-sixth of the size of an uncompressed representation [5]. For document indexing applications, with the current trends in computer technology, in which the ratio of processor speed to disk access time is increasing, inverted files are favored. Inverted files are superior in speed, space, and functionality [3]. The next word index is a special-purpose structure for phrase querying that is faster than an inverted index. It has no additional in-memory space overheads [7]. It is a search structure designed to accelerate processing of word pairs. Basically, the commonest number of words in a phrase is two. In addition, a phrase query can never be less than two words in length, and therefore all phrase queries can be decomposed into pairs of words. These pairs of words can be combined to give the contextual meaning of a phrase. A next word index consists of a vocabulary of distinct words and, for each word v1, there is a next word list and a position list. The next word list consists of each word v2 that succeeds v1 any-
467
R. Dhanapal / Knowledge-Based Systems 21 (2008) 466–470
where in the database, interleaved with pointers into the position list. For each pair v1 v2 there is a sequence of locations (document identifier and position within document) at which the pair occurs; these sequences are concatenated to give the position list. The vocabulary is held in a structure such as a B-tree [6]. 3. Root index Root words are the content words of the document that are on hand after the document cleanup process is completed. The document cleanup process consists of the following standard methods. 3.1. Stemming
Connection
ion
Connected
ed Connect ¼ Connecting ðminusÞ ing ¼ Connect Connects
s
Connections
ions
3.2. Case folding It changes all the words into lower case. Magnet is changed to magnet.
3.3. Stop list A Stop list is a list of words that are not considered as root words as they are not informative or potentially misleading. Usually they are words like conjunctions, determiners, prepositions, etc. A Stop List dictionary is maintained to check and remove the stop words. This removes the words that do not contribute to the content of the document like she
for
We call the root word v1 as a first word, and v2 as the next word of v1. The edge index organizes the first words and next words such that each first word is associated with a set of edge index for the next words. Finding the word Magnetic and retrieving the edge index is thus equivalent to answering the phrase query Magnetic Lines. In textual documents, the index terms are usually the root words in the root index occurring in the document. The edge index for root word v in a document d is: fd; fd; v; ed; v; ½o1; o2; . . . ; od; vg
The words are reduced to its stem by removing the suffixes, as terms with a common stem will usually have similar meaning. The suffix stripping process will reduce the total number of terms in the root index, and hence reduce the size and complexity of the data in the system, which is always advantageous. These words contribute the same meaning in the content of the document and hence can be ignored. Stemming can be done using a stem dictionary that consists of a dictionary created manually or using a popular porter-stemming algorithm [7] that reduces a word to its root word. The stem dictionary can be used to reduce the word to its root word as it gives meaningful stemming. Stemming for the word Connect is
the
v1 = Magnetic v2 = Lines
and
you
The words that have less than three letters are also ignored. After the clean up, the resulting root words that form the content words of the document are indexed as a root index with a root identifier given for each root word. This helps in the effective retrieval of the root words by the help of the root identifier wherever required.
where d is the identifier of document containing the root word v, fd,v the frequency of appearance of v in d, ed,v is the identifier of root word v occurring in document d, and oi is the location in d at which v is observed. This edge index stores the link information of the edge index graph. It is used to help in the easy retrieval of information. 5. Edge index graph The edge index graph is a directed graph based on graph theory. It consists of a set of nodes that represent the root words in the document collection and a set of edges that represent the next word link between words. It is defined as follows (see Figs. 1 and 2). The edge index graph is a directed graph G = (V, E) where V: is a set of nodes {v1, v2, . . . , vn}, where each node vi represents a root word in the root index. E: is a set of edges {e1, e2, . . . , em}, such that each edge ei has an ordered pair of nodes (vi, vj). There will be an edge from vi to vj if and only if, the word vj appears successive to the word vi in any document [8]. The definition of the graph suggests that the number of nodes in the graph is the number of root words in the document. Edges in the graph carry information about the documents they appear in and the next word information. If a phrase appears more than once in a document, the frequency of the next word making up the phrase is increased. Assume that a sentence of m words appears in a document consisting of the following word sequence {v1, v2, . . . , vm}. The sentence is represented in the graph by a path from v1 to vm, such that {(v1, v2), (v2, v3), . . . , (vm 1, vm)} are edges in the graph. Path information is stored in the edges along the path to uniquely identify each next word. To better illustrate the graph structure, Fig. 3 presents a simple example graph that represents three documents. Each document contains a number of sentences with some overlap between the documents.
v1
v2
magnetic
lines
Fig. 1. Root words.
4. Edge index The edge index consists of any distinct word pair consisting of two words each. They are called the first word and next word pair. For example, the word pair Magnetic Lines is composed of two words:
v1 magnetic
e1 Fig. 2. Simple edge index graph.
v2 lines
468
R. Dhanapal / Knowledge-Based Systems 21 (2008) 466–470
As seen from Fig. 3, an edge is created between two nodes only if the words represented by the two nodes appear successively in any document. Thus, phrases map into paths in the graph. Random dotted lines represent sentences from document 1, dotted lines represent sentences from document 2, and vertical dashed lines represent sentences from document 3. A phrase is two words occurring together in a document. If a phrase appears more than once in a document, the frequency of the individual words making up the phrase is increased. The edge index graph is built incrementally by processing one document at a time. When a new document is introduced, it is scanned to form the root index, and the graph is updated with the new sentence information as necessary. New words are added to the graph as necessary and connected with other nodes to reflect the sentence structure. The graph building process becomes less memory demanding when no new words are introduced by a new document or very few new words are introduced. At this point the graph becomes more stable, and the only operation needed is to update the sentence structure in the graph to accommodate the new sentences introduced. The root index maintains the node information of the graph. The edge index structure represents the edge index. The phrase can be traced by moving along the next word link in the edge index. 6. Graph construction The edge index graph is built incrementally by processing one document at a time. When a first document is introduced, it is scanned in a sequential fashion, and the root index is updated for new content words and the edge index stores the link information of the next word following the content word. When the next document is added to the graph, the root words are compared with the nodes already created. If it exists, then a link to the word is created. If not, a new node is created and added to the graph by a link. For example in Fig. 3, the first document is processed and the nodes magnetic, field, earth and lines are created and the edge index is created for the nodes. The count of the link is increased when an already existing node is processed. When the second document is read, the new node diamagnetic is created and a link with the node magnetic is made. Likewise the node force is created and the link with the node lines is created. Thus each document is added to the graph incrementally. The incremental construction process is emphasized here, where new nodes are added and new edges are created incrementally and easily upon introducing a new document.
Unlike traditional phrase matching techniques that are usually used in information retrieval literature, the edge index graph provides complete information about full phrase matching between every pair of documents. While traditional phrase matching methods are aimed at searching and retrieval of documents that have matching phrases to a specific query, this also provides information about the degree of overlap between every pair of documents in the set. The edge index graph also provides the facility of automatic completion of a phrase based on the next word that is to follow it and word completion on incomplete entry of the word in the query by analyzing the edge index. 7. Indexing algorithm The algorithm is used for indexing a set of documents to create the edge index graph. The documents are cleaned and the content words are retrieved by Stemming Case folding Stop word elimination Punctuation elimination A root index is formed from the content words. An edge index graph model is created for all the root words in the document to be indexed. Each word and its next word are retrieved from the content words of the document.The corresponding nodes in the graph are located. If it does not exist, then new nodes are formed. An edge is created between the nodes, if the edge does not exist. If an edge already exists, then the frequency of the edge is incremented by one. The input is the set of documents. This algorithm creates an edge index graph for the set of documents. Two objects, the root index and the edge index are created and stored in the database for future reference. 8. Challenges The number of nodes in the graph will be exactly the same as the number of unique root words in the data set. As the number of words increases the size of the network also increases incrementally. The maximum number of edges in a graph is v2 when v is the number of root nodes in a graph. In terms of memory usage, com-
poles
like
attract
earth magnetic
lines
force
diamagnetic field Document 1:
Document 2:
Document 3:
Magnetic field
Diamagnetic magnetic field
Magnetic poles
Earth magnetic field
Magnetic lines force
Like poles attract
Magnetic field lines
Earth magnetic poles
Fig. 3. Edge index graph for three documents.
469
R. Dhanapal / Knowledge-Based Systems 21 (2008) 466–470
pared to the vector space model, the model will use memory as large as the number of non-empty entries in a term-by-document vector space model matrix. Assume that if n: is the number of documents in the data set m: is the number of unique terms in the data set The memory requirement of the model is approximately: Sðn; mÞ ¼ m2 n The retrieval of the information is faster as only the nodes in the query are accessed and the processing becomes easier. It is also easily expandable based on the addition or deletion of nodes and edges in the index. Thus the model is scalable and dynamic though the memory requirement increases with the number of unique terms in the document.
9. Performance In order to evaluate the quality of the edge index graph model proposed a simulation of the model was developed and the performance of the model was studied. A simulation is a partial implementation of an algorithm, complete enough to allow measurement of performance but easier to undertake than a fullscale experiment. The system was developed in Microsoft Visual Basic 6.0 and a sample database was created from a collection of records from the web. The created root index and edge index objects were stored in the Microsoft Access database. A retrieval system was developed to determine the functionality and query efficiency of the index. 9.1. Applicability The edge index graph structure can be used to make intelligent retrieval of information from documents. It can be applied for various schemes of retrieval. 9.1.1. Word suggestions Given a word, the next word that has the probability of succeeding it can be suggested based on the edge index. This functionality cannot be provided with a conventional inverted index, other than by the costly mechanism of accessing all documents in which the word occurs. For example, if the word magnetic is given then word suggestions of field or poles or lines can be suggested based on the edge index. 9.1.2. Phrase completion It can be used for phrase browsing. Given a phrase of several words, the edge index graph can be used to identify all next words for the phrase, as follows. The phrase is processed to find all locations of the root words. The nodes of the graph are retrieved as a cluster. The next word and position of the nodes in the phrase are pruned to give only the next words that follow the phrase. An edge index graph can be used for phrase completion. It is used to find if two given words occur in the same document k words apart. The edge index can be used to fill in the gap. For example, a user might be interested in finding information about magnetic force. All of these query terms are common in a hypothetical database of articles on Physics, so the query might find no relevant documents, which matches the phrase. However, exploration of the word Magnetic with the index reveals that Magnetic force does not exist, but that the phrase Magnetic lines
of force is common. The system can alternatively use the phrase to retrieve documents related to the initial query. 9.1.3. Word completion An edge index graph can also be used for word completion in the content of a phrase. That is, given a phrase such as Magnetic Fi, the index can suggest completion such as Field. This is based on the fact that the next words of the word magnetic are analyzed and the most probable word is selected to complete the word. 9.1.4. Contextual retrieval A thesaurus is used to retrieve the words related to the query word to enable meaningful retrieval. The given word is compared to the thesaurus to get the root word present in the document collection. For example, a word operate is stored in the root index. Any related words like perform, proceed, work or act will result in the word operate being accessed in the document collection as they are not present in the collection. It also enables meaningful retrieval of information. The context of the word can be expanded to the sentence or the paragraph or the document as a whole based on the indexing of the words and we can obtain a semantic network of words that are closer to each other. 9.2. Scalability The edge index graph is built incrementally by processing one document at a time. As new documents are added, the graph is expanded to add new root words. A new document will only require the scrutiny or creation of those root words that appear in that document, and not every node in the graph. When no new words are introduced by a new document or very few new words are introduced, then we need to only update the edge information of the graph. Granularity of the index is the resolution to which the term locations are recorded within each document. The location of the words in the documents is recorded and hence the edge index is granular to the resolution of the root words in the document. 9.3. Extensibility The query can be extended to facilitate intelligent retrieval of data from the edge index graph. The storage and retrieval from the graph is easier. We can extend the length of the phrase to be browsed by specifying the length desired. The query retrieval time is reduced as only the node of the root word is accessed and the edge index verified. In an inverted index, the whole list has to be processed or a meta data created to enable fast retrieval. Thus the documents are located in a very short interval of time using an edge index graph model (see Figs. 4 and 5). The edge index graph was created for a number of documents and it was found to be efficient. As the number of documents increases, the time for indexing reduces, as there is only modification of the edge index to update the frequency and no new node is created.
poles
magnetic
lines field Fig. 4. Phrase completion.
force
470
R. Dhanapal / Knowledge-Based Systems 21 (2008) 466–470
The retrieval of information is made intelligent by using automatic phrase completion, word suggestions and word completions. Contextual retrieval is made possible with the use of the thesaurus as a knowledge base. The documents could also be retrieved based on the context of the query given in the phrase of words. It can be extended to include monitored phrase retrieval on any length and semantic indexing of data.
poles
magnetic
lines field
Fig. 5. Word completion.
11. Future work
Time in Sec
Indexing Time 25 20 15 10 5 0 1
5
10
15
20
25
30
35
40
45
50
Documents Fig. 6. Indexing time.
This model can be tested for feasibility of web document clustering and indexing. The feasibility of using the edge index graph, as a latent semantic indexing structure can be further studied instead of a mathematical structure that is not dynamic. The edge index can be modified to contain the weight information of the words with a term weight based on the importance of the word to the document, assigned to the nodes. The documents can be ranked based on the other weight factors too. A spell checker could be included to enable automatic correction of words. Document clustering based on the edge index graph can be applied for different similarity measures and the result analyzed with the existing clustering models. References
Fig. 6 gives the time taken to index the documents in the system. There is an increase in time only if new nodes are added. The edge index graph is very dynamic and helps in easy updation of the data stored. 10. Summary The root and edge index were accessed to retrieve documents from the document store. The document retrieval was found to be precise and fast. Word completion, word suggestion, phrase completion and meaningful search were possible. The retrieved documents could be arranged based on the frequency of occurrence of the phrase in the document. There is no false match elimination in this retrieval.
[1] K. Meena, R. Dhanapal, Artificial Intelligence and Expert Systems, International Books, 2001. [2] A.K. Jain, M.N. Murty, Data clustering: a review, ACM Computing Surveys 31 (3) (1999). [3] J. Zobel, A. Mofiat, K. Ramamohanarao, Inverted Files Versus Signature Files for Text Indexing, ACM, 2004. [4] D. Bahle, H.E. Williams, J. Zobel, Efficient phrase querying with an auxiliary index, in: Proceedings of ACM-SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, August 2002, pp. 215–221. [5] H.E. Williams, J. Zobel, D. Bahle, Fast phrase querying with combined indexes, ACM Journal V (N) (2004) 1–21. [6] H.E. Williams, J. Zobel, P. Anderson, What’s next? Index structures for efficient phrase querying, in: M. Orlowska (Ed.), Proceedings Australasian Database Conference, Springer-Verlag, Auckland, New Zealand, 1999, pp. 141–152. [7] M.F. Porter, An algorithm for suffix stripping, Program 14 (3) (1980) 130–137. [8] Narsingh Deo, Graph Theory with Applications to Engineering and Computer Science, Prentice Hall of India Pvt. Ltd., New Delhi, 1997.