Applied Soft Computing 13 (2013) 959–966
Contents lists available at SciVerse ScienceDirect
Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc
Probability based document clustering and image clustering using content-based image retrieval M. Karthikeyan ∗ , P. Aruna Department of Computer Science and Engineering, Annamalai University, Annamalai Nagar, Chidambaram, Tamil Nadu, India
a r t i c l e
i n f o
Article history: Received 28 September 2010 Received in revised form 6 August 2012 Accepted 18 September 2012 Available online 3 October 2012 Keywords: Document clustering Word frequency Content-based image retrieval Major colour set Global colour signature Distribution block signature Hue saturation value Region of Interest RGB histogram-based image retrieval
a b s t r a c t Clustering of related or similar objects has long been regarded as a potentially useful contribution of helping users to navigate an information space such as a document collection. Many clustering algorithms and techniques have been developed and implemented but as the sizes of document collections have grown these techniques have not been scaled to large collections because of their computational overhead. To solve this problem, the proposed system concentrates on an interactive text clustering methodology, probability based topic oriented and semi-supervised document clustering. Recently, as web and various documents contain both text and large number of images, the proposed system concentrates on content-based image retrieval (CBIR) for image clustering to give additional effect to the document clustering approach. It suggests two kinds of indexing keys, major colour sets (MCS) and distribution block signature (DBS) to prune away the irrelevant images to given query image. Major colour sets are related with colour information while distribution block signatures are related with spatial information. After successively applying these filters to a large database, only small amount of high potential candidates that are somewhat similar to that of query image are identified. Then, the system uses quad modelling method (QM) to set the initial weight of two-dimensional cells in query image according to each major colour and retrieve more similar images through similarity association function associated with the weights. The proposed system evaluates the system efficiency by implementing and testing the clustering results with Dbscan and K-means clustering algorithms. Experiment shows that the proposed document clustering algorithm performs with an average efficiency of 94.4% for various document categories. © 2012 Elsevier B.V. All rights reserved.
1. Introduction With the rapid development of information technology, the number of electronic documents and digital content of documents exceeds the capacity of manual control and management. People are increasingly required to handle wide ranges of information from multiple sources [1]. As a result, document clustering techniques are implemented by enterprises and organizations to manage their information and knowledge more effectively. Document clustering can be defined as the task of learning methods for categorizing electronic documents into their automatically annotated classes based on its contents [2]. It is widely applicable in areas such as search engines, web mining, information retrieval and topological analysis. Document clustering is a critical component of research in text mining. Traditional document clustering includes: (a) extracting feature vector of a document and (b) clustering of documents by parameters including similarity threshold and number of clusters,
∗ Corresponding author. Tel.: +91 9443665646. E-mail address:
[email protected] (M. Karthikeyan). 1568-4946/$ – see front matter © 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.asoc.2012.09.013
etc. Traditional document clustering, however, is the unsupervised learning; it cannot effectively group documents under the need of the user [3]. So, the proposed system concentrates on probability based topic oriented and semi-supervised document clustering approach. Recently, as web and various documents contain a large number of images and text, it is necessary to cluster the images also. Content-based image retrieval (CBIR) applications are greatly needed for these applications [4]. With the increased emphasis on multimedia applications, the production of large information has resulted in a large volume of images that need to be properly indexed for retrieval in the future. Literature reports various techniques for CBIR and the most commonly utilized features are colour, shape and texture. The proposed system concentrates on image clustering by using the content-based image retrieval system to give more meaning to the proposed probability based topic oriented and semi-supervised document clustering method. The structure of this paper is as follows: Section 2 discusses some related research work regarding document clustering and content-based image retrieval. Section 3 describes how document
960
M. Karthikeyan, P. Aruna / Applied Soft Computing 13 (2013) 959–966
clustering is done by the probability based topic oriented and semisupervised document clustering algorithm. Section 4 provides an image clustering by using content-based image retrieval method. The experimental results are given in Section 5. Finally some conclusion and discussion are given in Section 6.
2. Related works Document clustering is a powerful technique to detect topics and their relations for information browsing, analysis and organization. In recent studies, many new technologies are introduced. Tseng [5] proposed an algorithm for cluster labelling to create generic titles based on external resources such as wordNet. This method first extracts category-specific terms such as cluster descriptors, and then these descriptors are mapped to generic terms based on a hypernym search algorithm. Trappey et al. [1] developed a document classification and search methodology, based on a neural network technology, by extracting key phrases from the document set, by means of automatic text processing and determining the significance of key phrases according to their frequency in text. Hao et al. [6] proposed a novel hierarchical classification method that generalizes support for vector machine learning. Aliguliyev [7] developed a method to show assignment weight to documents that improves clustering solution because a document clustering has been traditionally investigated as a means of improving the performance of search engines by pre-clustering the entire corpus. Gong et al. [8] proposed a validity indexbased method of adaptive feature selection, incorporating a new text stream clustering algorithm. Saracoglu et al. [9] developed a method for finding similar documents that uses predefined fuzzy clusters to extract feature vectors of related documents. Similarity measure is based on these vectors, and in 2008, they proposed a new approach on search for similar documents with multiple categories using fuzzy clustering that uses fuzzy similarity classification method and multiple categories vector method. Aliguliyev [7] again proposed a technique for automatic text summarization. He proved that summarization result not only depends on optimized function, and but also depends on a similarity measure. Horng et al. [10] proposed a hierarchical fuzzy clustering decision tree for the classification problem with large number of classes and continuous attributes. Song et al. [11] developed a method that uses genetic algorithm for text clustering based on ontology and evaluating the validity of various semantic measures. Karray and Kamel [12] proposed a new concept-based mining model that analyzes terms on the sentence, document and corpus levels. It can effectively discriminate between non-important terms with respect to sentence semantics. Chim and Deng [13] developed a method for efficient phrase-based document similarity for clustering documents. They used phrase-based document similarity to compute the pair wise similarities of documents based on suffix. Frolov et al. [14] introduced a neural-network-based algorithm for word clustering. Image classification deals with the problem of identifying an image in large database. It is desirable to classify and categorize image content automatically. Liu et al. [15] developed a region-based retrieval system with high-level semantic learning. It supports both queries by keyword and queries by region of interest. Liu and Hua [16] proposed a new index structure and query processing technique to improve retrieval effectiveness and efficiency. Gosselin and Cord [17] provided an algorithm within a statistical framework to extend active learning for online contentbased image retrieval. Li et al. [18] proposed a framework based on multi-label neighborhood propagation for region-based image retrieval.
Pradhan and Prabhakaran [19] proposed an efficient indexing approach for 3-D human motion capture data, supporting queries involving both sub-body motions as well as whole-body motions. Aptoula and Lefevre [20] presented two morphology-based approaches, one making use of granulometries, independently computed for each sub quantized colour and another employing the principle of multi resolution histograms for describing colour, using morphological levelling and watersheds. Zhang and Ye [21] proposed new scheme to handle the noisy positive examples, by incorporating the methods of data cleaning and noise tolerant classifier. 3. System overview of probability based semi-supervised document clustering Probability based topic oriented and semi-supervised document clustering is defined as follows: given a set S of n documents and a set T of k topics, the proposed system likes to partition the documents into k subsets S1 , S2 , . . ., Sk , each corresponding to one of the topics, such that (i) the documents assigned to each subset are more similar to each other than the documents assigned to different subsets, and (ii) the documents of each subset are more similar to its corresponding topic than the rest of the topics. The functional components and data flow of proposed probability based topic oriented and semi-supervised document clustering method and image clustering using content-based image retrieval are depicted in Fig. 1. The proposed method concentrates also on image clustering by adapting the CBIR method. Literature reports various techniques for CBIR and the most commonly utilized features are colour, shape and texture. The major steps involved in the proposed system are given below: 1. Documents of various categories are collected and stored in the database. 2. All the words and the images which appear in the documents are extracted and stored in a separate words database and image database with their corresponding categories. 3. From the words database distinct words are identified and probability is calculated. 4. During the clustering process, based on the higher probability of words, the documents are classified and clustered. 5. From the image database, the similar images are retrieved based on the given query Image by using content-based image retrieval system. 6. During image clustering, from the database image, global colour signature and distribution block signature are extracted. 7. For the query image, major colour set and distribution block signature are extracted. 8. MCS and DBS of the query image are compared with GCS and DBS of the database images, based on the similarity, and image clustering is done. 9. Finally, results are analyzed and compared with the results obtained by the existing algorithms Dbscan and K-means for document clustering and with RGB histogram-based image retrieval method for image clustering. 3.1. Document clustering by probability based topic oriented and semi-supervised clustering algorithm The proposed new document clustering method, groups the documents according to the user’s need. The main steps include: (1) design a multiple-attributes topic structure to represent user’s need. (2) Make topic-semantic annotation for each document, and then compute topic-semantic similarity between documents. (3)
M. Karthikeyan, P. Aruna / Applied Soft Computing 13 (2013) 959–966
961
Select Category or Topic
Text Documents
Extract Words & Images
Tokenizaon
Structural Filtering
Images Find disnct words
Image database
GCS & DBS of database Images
MCS & DBS of Query Images
Similarity Computaon
Calculate disnct words frequency
Calculate Probability of disnct words
Display Results
Determine maximum Probability of disnct words
Calculate Topic Count Display Results Fig. 1. Functional components of the proposed probability based topic oriented and semi-supervised document clustering and image clustering using content-based image retrieval.
Compute distinct words probability. (4) Group documents based on maximum probability of distinct words. The main objective is to reduce dimensionality of feature vectors. Dimensionality reduction of feature vectors is a hotspot of research in text mining. The dimensionality of document vectors may reach to thousands and even tens of thousands. It results in huge time cost on documents clustering. But in the proposed system distinct words of maximum probability are mapped to attribute of the topic directly. So, dimension is reduced effectively. The proposed system consists of three different modules. They are: - Training. - Probabilistic scanning. - Testing. 3.1.1. Training Five different categories of documents are used for training purpose as a sample document corpus. They are business documents, education documents, politics documents, medical documents and sports documents. But it is possible to extend the categories. The aim is to derive higher level concepts from the words of the different categories of document corpus in order to populate the knowledge base (database). In the first task, words are extracted
from the training documents and they are matched with other words, or with that already existing in the database. To achieve this goal, a pipeline of analysis that contains two stages is defined. The two stages of task include: Tokenization: This is the very first linguistic analysis step. It consists of breaking the free text into a sequence of separate words and punctuation symbols (tokens). Its input consists of natural language text and the output contains a list of the tokens extracted. Structural filtering: This stage uses the output of the tokenization, keeps, discovers or discards words according to contextual information. The actual module is a rule compiler which applies filtering based on rules like, words length less than 3 characters and greater than 20 characters, etc. Option is given in the proposed system to omit stop words like “can”, “are”, “has”, “with”, “the”, “they”, “which”, “have”, etc. Similarly, words above 15 or 20 characters are also omitted because these words will not be distinct words for document clustering process. Then, the filtered text (words) is called distinct words and these words are stored in the database based on selected category by the user, then, the probability of occurrence is calculated for distinct words, for example, a set S of N documents and a set T of K topics, then the probability of distinct words will be calculated by εWi /Wj , where Wi
962
M. Karthikeyan, P. Aruna / Applied Soft Computing 13 (2013) 959–966
Input
: Set S of N documents and a set T of K topics
Output : Array of distinct words and array of distinct words count based on K Method : 1. read S, N ; // Read the training documents one by one to split all words 2. read T, K; // Read the topics 3. preprocess D to get Wi // Preprocess to identify distinct words 4. for each Wi in D 5. for each K of T
ε
6. Pi = Wi /W j// Probability calculation of distinct words 7. count = count + 1 // Based on the topic count will be calculated 8. end 9. end 10.return wi ; 11.return count; Fig. 2. Probability based topic oriented and semi-supervised document clustering algorithm.
denotes the number of occurrences of a distinct word in a training document and Wj denotes the total number of distinct words in a training document. Similarly for K topics of set T, the probabilities of occurrences of distinct words are calculated for a set S of N documents. 3.1.2. Probabilistic scanning The probabilistic scan works as a classification of a document. It is done on the basis of comparing the probability of occurrence of distinct words in the database. Based on the probability, the count for each category is calculated. Proposed method calculates a probability for distinct words by measuring similarity between documents and evaluating clustering partitions. In this context, and more generally throughout information retrieval, a commonly used measure of similarity is obtained by representing documents as normalized vectors. Each dimension of the vector corresponds to a distinct word in the union of all words. A document is then represented as a vector containing the normalized frequency count of the words in it. Intuitively, this measure tries to capture the degree of word overlap between two documents. Fig. 2 shows the proposed probability based topic oriented and semi-supervised document clustering algorithm. 3.1.3. Testing The proposed probability based document clustering method is compared with existing Dbscan and K-means clustering algorithms. K-means is a partition clustering algorithm based on iterative relocation that partitions a dataset into k clusters. The objective function locally minimizes sum of squared distance between the data points and their corresponding cluster centers to form the clusters. The key idea of the Dbscan algorithm is that, for each point of a cluster, the neighbourhood of a given radius has to contain at least a minimum number of points, that is, the density in the neighbourhood has to exceed some predefined threshold. This algorithm needs three input parameters: k, the neighbour list size; Eps, the radius that delimitate the neighborhood area of a point (Eps neighborhood); and MinPts, the minimum number of points that must exist in the Eps-neighborhood. The clustering process is based on the classification of the points in the dataset as core points, border points and noise points, and on the use of density relations between points (directly density-reachable, density-reachable, density-connected) to form the clusters.
3.1.4. Efficiency For comparing the proposed method with the existing algorithms Dbscan and K-means, efficiency is calculated by using the formula: Number of documents classified correctly Total number of documents stored in database 4. System overview of image clustering using content-based image retrieval method Query technique and content comparison technique are the two types of technique used in CBIR. The common features extracted from images are colour, shape and texture. Retrieving images based on colour similarity is achieved by computing a colour histogram for each image. Shapes do not refer to the shape of an image but to the shape of particular region that is being sought out. Shapes will often be determined by first applying segmentation or edge detection to an image. 4.1. Image clustering using signature-based colour spatial approach 4.1.1. Major colour sets Hue saturation value (HSV) colour space is used in signaturebased colour spatial approach. HSV is intuitive colour space in the sense that each component contributes directly to visual perception. Human visual system is more sensitive to hue than saturation and value. So, hue should be quantized more finely than saturation and value. In the experiments, the HSV space is uniformly quantized into 18 bins for hue, 3 bins for saturation and 3 bins for value, and by adding 4 grey, a total 166 colours are used to represent the images. The query image is partitioned into 32 × 32 cells of equal size in each cell, and quantized joint HSV histogram is computed to extract the most frequent [highest peak] bin as dominant colour in the cell. In general, many images tend to have a small object area [i.e. region of interest] ROI and a large background. The proposed system extracts two major colours MCb from the background area and two major colours MCc from an image which are not duplicated from MCb from the ROI. The four hues from four major colours are represented in bit stream called colour signature and it contains one-dimensional array with 19 bits. The first 18 bits are used to represent four hues
M. Karthikeyan, P. Aruna / Applied Soft Computing 13 (2013) 959–966
of the MCS and the final bit is used to represent grey information of the MCS. Initially, all 19 bits are set to 0 MCS colour signatures obtained by assigning bits associated with the four hues of MCS to 1 and is used in query side. GCS colour signature is used in database side and captures global colour information in the image. It contains 19 bits like the MCS. The GCS signature constructed with all hues in 32 × 32 cells. Each bit of the GCS is assigned to 1 when the corresponding hue exists in the cells. 4.1.2. Distribution block signature Second filtering process is performed to compute further similarity by considering colour spatial information. First an image is partitioned into 4 × 4 blocks of equal size. Each block will contain 64 cells since we partitioned the image into 32 × 32 cells. Then, with respect to the first colour in the MCS, the block is assigned to 1 when it contains same first colour, and 0 when it does not. It produces a 16 bit stream for first major colour. The same process is repeated for the rest of the MCS the result in a total of 4 × 16 bit streams. 4.1.3. First filtering of images Image is first entered into a database and then two image features like GCS and DBS are extracted. During retrieval process, the query images are presented to the retrieval system which extracts the MCS (major colour sets) and the DBS (distribution block signature) of the query image. In the first filtering, MCS of query image and the GCS of database images are compared using a bitwise logical AND operation. Database images, which do not contain all MCS of a query are filtered out. Let Qi and Di denote the signature of colour i for query image Q and database image D, respectively. Then, two images have the colour i at the same particular region only if the corresponding bits in both signatures are set; otherwise the two images are not similar at the region. Let the colour set Q and D be Cq and Cd , respectively. Then, the similarity measure SIMbasic between Q ∈ D for a colour i, Cq can be determined as:
SIMbasic(Q, D, i) =
⎧ ⎨ BitSet(Qi ∧ Di ) ⎩
BitSet(Qi )
0, otherwise
if colour I ∈ Cd
where bit set denotes the number of bit in bit stream that are set and rˆepresent the bitwise logical AND operation. Now, if a large part of cells in Q has the same colour as that in D then the similarity computed will be close to 1. The similarity measures between two images Q and D is then given by, SIMbasic(Q, D) =
963
from 0 to 63. In the experiments, a threshold 43 was a good choice for filtering out dissimilar images. 4.1.5. Quad modelling technique The quad modelling technique is used for final similarity computation of images. Four 32 × 32 cells are created for all images in the databases. The cell has a value 1 if a major of a query image is the same as that of major colour of the image in the database. The query image is partitioned recursively and an associated QM weight matrix was formed. 4.2. RGB colour histogram-based image retrieval 4.2.1. RGB colour model In RGB model, each colour appears in its primary spectral components of red, green, and blue. Image represented in the RGB colour model consists of three component images, one for each primary colour. The number of bits used to represent each pixel in RGB colour space is called pixel depth. Consider an RGB image in which each of the red, green, blue component images are stored as an 8-bit image. Under these conditions each RGB colour pixel is said to have depth of 24 bits. The term full-colour image is used to denote a 24-bit RGB colour image. The total number of colours in a 24-bit RGB image is 16,777,216. Colour histogram is a multidimensional histogram of the distribution of colour in an image, and used to compare the image in many applications. An image histogram is a graphical representation of the number of pixels in an image as a function of their intensity. Histograms are made up of bins, and each bin represents a certain intensity value range. The histogram is computed by examining all pixels in the image and assigning each to a bin depending on the pixel intensity. The final value of a bin is the number of pixels assigned to it. 4.2.2. The algorithm 1. Read images from database and extract RGB format pixel information from images. 2. Create 48 bin normalized histograms for each of the RGB components of images read from database. Thus, each image will have 3 histograms associated with it. 3. Read a query image and extract RGB format pixel information. 4. Create histograms for each of the RGB components of the query image. 5. Compute a Euclidean distance by comparing the query image histograms to that of each image in the database. 6. Retrieve the image in the database similar to the query image based on histogram count for each colour map (intensity) entry.
SIMbasic(Q, D, i)
∀ i ∈ Cq
MCS of query and the GCS of database images are compared using a bitwise logical AND operation. Only database images that must have all the MCS of a query are qualified for further computing. 4.1.4. Second filtering of images In second filtering, the database images that do not have a similar spatial location to that of the query images are filtered out by using a bitwise logical-XOR operation. Note that the XOR operation is 0 only when the two compared bits are both 1 s and 0 s. Images with similar spatial distribution are chosen as potential candidates. For one major colour, the XOR operation produces the sum of all 16 bits ranging from 0 to 15. As a result 0 means the two images have completely the same colour-spatial information, whereas 15 means the two images have completely dissimilar colour-spatial information. Considering all four major colours, it produces the sum ranging
Colour histogram is the one of the most popular image indexing and retrieval method. All colour images have three channels and each has RGB component. Once extracted, the bits presenting each component of the RGB pixel are used to create a histogram. The histogram consists of 48 bins where each bin defines a small range of pixel values. The value stored in each bin is the number of pixels in the image that are within the range. These ranges represent different levels of intensity of each RGB component. The values in each bin are normalized by dividing with the total number of pixels in the image. 4.2.3. Efficiency For comparing the two retrieval methods the system efficiency is calculated by using following formula, Total number of similar images retrieved Total number of similar images in the database
964
M. Karthikeyan, P. Aruna / Applied Soft Computing 13 (2013) 959–966
Table 1 A partial list of stop words.
Table 3 Accuracy of probability based topic oriented and semi-supervised document clustering algorithm.
Stop words list are all be by during from how
as after because can each have if
at around been could else be in
a anyone besides beyond every her it
about again both between few here last
all always but below for him me
more was is and any he then
Table 2 Training process results of probability based topic oriented and semi-supervised document clustering algorithm. Sl. no.
Category
Number of documents used for training
1 2 3 4 5
Business Education Politics Medical Sports
400 400 400 400 400
Category
Number of documents used for testing
Number of documents classified correctly
Percentage
Business Education Politics Medical Sports
125 125 125 125 125
123 122 118 115 112
98.40% 97.60% 94.40% 92.00% 89.60%
Table 4 Accuracy of Dbscan algorithm.
Total number of distinct words extracted 630,836 652,416 548,010 562,046 612,486
Category
Number of documents used for testing
Number of documents classified correctly
Percentage
Business Education Politics Medical Sports
125 125 125 125 125
120 119 114 110 109
96.00% 95.20% 91.20% 98.00% 87.20%
Table 5 Accuracy of K-means algorithm.
5. Experimental results 5.1. Probability based topic oriented and semi-supervised document clustering algorithm In experiments, totally 2000 documents of five different categories are used for training purpose. The system was trained according to the category selected by the user. The categories include: (1) business documents, (2) education documents, (3) political documents, (4) medical documents, and (5) sports documents. Proposed system omits nearly 500 stop words. The stop words in document corpus are removed before distinct words extraction. A partial list of stop words is presented in Table 1. Then, distinct words are extracted and their frequency of occurrence is calculated. Based on the frequency, probability is calculated. Table 2 shows training process, results of probability based document clustering algorithm. Totally, during the experiments, 3,005,794 distinct words are identified. According to the category selected by the user, the probability was calculated. Based on the probability of distinct words clustering is done during the testing phase. For testing, 600 documents (125 documents in each category) are used. Table 3 shows the accuracy achieved by probability based topic oriented and semi-supervised document clustering algorithm for each category of the documents. The same documents are used as test case for Dbscan and K-means clustering algorithms respectively. Table 4 shows accuracy achieved by Dbscan algorithm and Table 5 shows the accuracy achieved by the K-means algorithm. When comparing the existing algorithms Dbscan and K-means, the proposed probability based topic oriented and semi-supervised document clustering algorithm yields better results. Fig. 3 shows overall performance comparison of all the three clustering algorithms. The
Category
Number of documents used for testing
Number of documents classified correctly
Percentage
Business Education Politics Medical Sports
125 125 125 125 125
121 116 115 111 110
96.80% 92.80% 92.00% 88.80% 88.00%
Fig. 3. Overall performance comparison of all the three clustering algorithms.
Dbscan algorithm performs with an average efficiency of 93.52% for all categories. K-means algorithm performs with an average efficiency of 91.68% for all categories. Time complexity analysis is given in Table 6 and the number of iterations taken by each algorithm is compared in Table 7. The proposed document clustering algorithm performs with an average efficiency of 94.4% for all categories of documents. The result shows that the proposed probability based
Table 6 Time complexity analysis. Sl. no.
1 2 3 4 5 5
Original data points (distinct words)
500 1000 2000 3000 4000 5000
Time taken (ms) Dbscan
K-means
Probability based topic oriented and semi-supervised document clustering algorithm
40,250 55,133 70,874 82,586 92,562 99,582
42,150 55,245 69,589 82,982 93,586 99,879
39,112 53,457 68,875 80,895 91,587 98,457
M. Karthikeyan, P. Aruna / Applied Soft Computing 13 (2013) 959–966
965
Table 7 Number of iterations. Sl. no.
1 2 3 4 5 5
Original data points (distinct words)
500 1000 2000 3000 4000 5000
Number of iterations Dbscan
K-means
Probability based topic oriented and semi-supervised document clustering algorithm
189 545 780 1024 1280 1553
192 538 775 1028 1290 1560
40 131 181 215 244 293
Fig. 4. Partial list of database images retrieved from the training documents.
Fig. 5. Process of query image loading.
966
M. Karthikeyan, P. Aruna / Applied Soft Computing 13 (2013) 959–966
measure for QM matrix is to be calculated to capture the amount of overlap between query and database images. Retrieval effectiveness may be defined in terms of precision and recall rate. For more accurate retrieval, relevance feed back approach and to evaluate the performance of the system using precision and recall measures may be implemented. Experiment shows that the new approach is feasible and effective. References
Fig. 6. Overall performance comparison of two Image retrieval methods.
topic oriented and semi-supervised document clustering algorithm outperforms other two existing algorithms. 5.2. Image clustering by content-based image retrieval method In experiments, 1000 images, stored in the image database during the training phase are considered. Fig. 4 shows the partial list of database images retrieved from the training documents. Fig. 5 shows the process of query image loading. Fig. 6 shows the overall performance comparison of image clustering using CBIR with signature-based colour spatial approach and CBIR with RGB colour histogram-based approach. 6. Conclusion Traditional unsupervised document clustering approaches often fail to obtain good clustering solution when users want to group documents according to their need. Focusing on this problem, the proposed method uses the topic-oriented semi-supervised and probability-based document clustering and Image clustering using CBIR to fulfil the user requirement. Further results obtained from the proposed method were compared and checked with the famous clustering algorithms like Dbscan and K-means. Experiment shows that the proposed document clustering algorithm performs with an average efficiency of 94.4% for various document categories. Probabilistic similarity score to include arbitrary functions over words in documents (such as phrases and logical operations) may be implemented in future to improve the probability accuracy. Additionally, in the proposed system, signature-based colour spatial approach for addressing the CBIR was used for Image clustering. MCS, GCS and DBS colour signatures are used for efficient retrieval of similar images from the image database. Further, HSV colour model to represent an image was used. The results are compared with RGB histogram-based image retrieval method. In future, similarity
[1] A.J.C. Trappey, F.-C. Hsu, C.V. Trappey, C.-I. Lin, Development of a patent document classification and search platform using a back-propagation network, Expert Systems with Applications 31 (2006) 755–765. [2] D. Isa, V.P. Kallimani, L.H. Lee, Using the self organizing map for clustering of text documents, Expert Systems with Applications 36 (2009) 9584– 9591. [3] J. Qiu, C. Tang, Topic Oriented Semi-supervised Document Clustering, Workshop on Innovative Database Research, Proceedings of the SIGMOD, 2007. [4] H.-W. Yoo, H.-S. Park, D.-s. Jang, Expert system for colour image retrieval, Expert Systems with Applications 28 (2005) 347–357. [5] Y.-H. Tseng, Generic title labeling for clustered documents, Expert Systems with Applications 37 (2010) 2247–2254. [6] P.-Y. Hao, J.-H. Chiang, Y.-K. Tu, Hierarchically SVM classification based on support vector clustering method and its application to document categorization, Expert Systems with Applications 33 (2007) 627–635. [7] R.M. Aliguliyev, Clustering of document collection – a weighting approach, Expert Systems with Applications 36 (2009) 7904–7916. [8] L. Gong, J. Zeng, S. Zhang, Text stream clustering algorithm based on adaptive feature selection, Expert Systems with Applications 38 (2011) 1393–1399. [9] R. Saracoglu, K. Tutuncu, N. Allahverdi, A fuzzy clustering approach for finding similar documents using a novel similarity measure, Expert Systems with Applications 33 (2007) 600–605. [10] S.-C. Horng, F.-Y. Yang, S.-S. Lin, Hierarchical fuzzy clustering decision tree for classifying recipes of ion impanter, Expert Systems with Applications 38 (2011) 933–940. [11] W. Song, C.H. Li, S.C. Park, Genetic algorithm for text clustering using ontology and evaluating the validity of various semantic similarity measures, Expert Systems with Applications 36 (2009) 9095–9104. [12] F. Karray, M.S. Kamel, An efficient concept-based mining model for enhancing text clustering, IEEE Transactions on Knowledge and Data Engineering 22 (10) (2010). [13] H. Chim, X. Deng, Efficient phrase-based document similarity for clustering, IEEE Transactions on Knowledge and Data Engineering 20 (9) (2008). [14] A.A. Frolov, D. Husek, P.Y. Polyakov, Recurrent-neural-network-based Boolean factor analysis and its application to word clustering, IEEE Transactions on Neural Networks 20 (7) (2009). [15] Y. Liu, D. Zhang, G. Lu, Region-based image retrieval with high-level semantics using decision tree learning, Pattern Recognition 41 (2008) 2554–2570. [16] D. Liu, K.A. Hua, Fast query point movement techniques for large CBIR systems, IEEE Transactions on Knowledge and Data Engineering, 21 (5) (2009). [17] P.H. Gosselin, M. Cord, Active learning methods for interactive image retrieval, IEEE Transactions on Image Processing 17 (7) (2008). [18] F. Li, Q. Dai, W. Xu, G. Er, Multilabel neighborhood propagation for region-based image retrieval, IEEE Transactions on Multimedia 10 (8) (2008). [19] G.N. Pradhan, B. Prabhakaran, Indexing 3-D human motion repositories for content-based retrieval, IEEE Transactions on Information Technology in Biomedical 13 (5) (2009). [20] E. Aptoula, S. Lefevre, Morphological description of colour images for contentbased image retrieval, IEEE Transactions on Knowledge and Data Engineering 22 (10) (2010). [21] J. Zhang, L. Ye, Content based image retrieval using unclean positive examples, IEEE Transactions on Image Processing 18 (10) (2009).