Supporting content-based image retrieval and computer-aided diagnosis systems with association rule-based techniques

Supporting content-based image retrieval and computer-aided diagnosis systems with association rule-based techniques

Data & Knowledge Engineering 68 (2009) 1370–1382 Contents lists available at ScienceDirect Data & Knowledge Engineering journal homepage: www.elsevi...

765KB Sizes 0 Downloads 12 Views

Data & Knowledge Engineering 68 (2009) 1370–1382

Contents lists available at ScienceDirect

Data & Knowledge Engineering journal homepage: www.elsevier.com/locate/datak

Supporting content-based image retrieval and computer-aided diagnosis systems with association rule-based techniques q,qq Marcela X. Ribeiro a,*, Pedro H. Bugatti a, Caetano Traina Jr. a, Paulo M.A. Marques b, Natalia A. Rosa b, Agma J.M. Traina a a b

Department of Computer Science, University of São Paulo at São Carlos, Brazil School of Medicine of University of São Paulo at Ribeirão Preto, Brazil

a r t i c l e

i n f o

Article history: Available online 7 July 2009 Keywords: Association rules Content-based image retrieval Computer-aided diagnosis Feature selection Associative classifier Discretization

a b s t r a c t In this work, we take advantage of association rule mining to support two types of medical systems: the Content-based Image Retrieval (CBIR) systems and the Computer-Aided Diagnosis (CAD) systems. For content-based retrieval, association rules are employed to reduce the dimensionality of the feature vectors that represent the images and to improve the precision of the similarity queries. We refer to the association rule-based method to improve CBIR systems proposed here as Feature selection through Association Rules (FAR). To improve CAD systems, we propose the Image Diagnosis Enhancement through Association rules (IDEA) method. Association rules are employed to suggest a second opinion to the radiologist or a preliminary diagnosis of a new image. A second opinion automatically obtained can either accelerate the process of diagnosing or to strengthen a hypothesis, increasing the probability of a prescribed treatment be successful. Two new algorithms are proposed to support the IDEA method: to pre-process low-level features and to propose a preliminary diagnosis based on association rules. We performed several experiments to validate the proposed methods. The results indicate that association rules can be successfully applied to improve CBIR and CAD systems, empowering the arsenal of techniques to support medical image analysis in medical systems. Ó 2009 Elsevier B.V. All rights reserved.

1. Introduction In medicine, two types of resources are becoming widely used: the Content-based Image Retrieval (CBIR) and the Computer-Aided Diagnosis (CAD) systems. The purpose of CAD is to increase the accuracy of diagnosis, as well as to improve the consistency of image interpretation by using the computer results as a second opinion. Similar to CAD systems, CBIR uses information extracted from images to represent them. However, the main purpose of a CBIR system is to retrieve ‘‘cases” or images similar to a given one. Analyzing past similar cases and their reports can improve the radiologist’s confidence on elaborating a new image report, besides making the training and the diagnosing process faster. Moreover, CAD and CBIR systems are very useful in medicine teaching. Currently, image mining has been focused by many researchers in data mining and information retrieval fields and has achieved prominent results. A major challenge in the image mining field is to effectively relate low-level features (automatically extracted from image pixels) to high-level semantics based on the human q

Preliminary results of this work were presented at IEEE CBMS’08 in [1,2]. This work has been supported by FAPESP, CNPq and CAPES. * Corresponding author. E-mail addresses: [email protected] (M.X. Ribeiro), [email protected] (P.H. Bugatti), [email protected] (C. Traina), [email protected] (P.M.A. Marques), [email protected] (N.A. Rosa), [email protected] (A.J.M. Traina).

qq

0169-023X/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.datak.2009.07.002

M.X. Ribeiro et al. / Data & Knowledge Engineering 68 (2009) 1370–1382

1371

perception. Association rules has been successfully applied to other research areas, e.g. business, and can reveal interesting patterns relating low-level and high-level image data as well. In this work, association rules are employed to support both CAD and CBIR systems. Comparisons among the usage of association rules and other mining techniques are performed in the section of experiments. The results show that association rules outperform, in most cases, the other mining techniques and their use is simpler and more robust than other techniques. Our results corroborate the affirmation of Holte [3] that, if they are properly set, the simplest mining techniques work well in most cases. This paper is structured as follows. Section 2 presents the background. Section 3 details the proposed methods FAR and IDEA. Section 4 discusses the experiments and results achieved. Finally, Section 5 summarizes the conclusions and the future direction for this work. 2. Background and related work CBIR techniques use intrinsic visual features of color, shape and/or texture to represent the images, which are indexed and compared instead of the images in the content-based image retrieval. When working with image datasets, performing exact searches is not useful, since looking for the same data already under analysis has very few applications. Thus, the retrieval of complex data types, such as images, is mainly performed regarding similarity. The most well-known and useful types of similarity queries are the k-nearest neighbor (e.g. ‘‘given the Thorax-XRay of John Doe, find in the image database the 3 images most similar to it”), and range queries (e.g. ‘‘given the Thorax-XRay of John Doe, find in the image database the images that differ up to 2 units from it”). Similarity search is performed comparing the feature vectors using a distance – or dissimilarity – function to quantify how close (or similar) each pair of vectors is. From the image processing point of view, it is important to gather as much features as possible to represent the images, yielding vectors with hundreds or even thousands of features to represent the images. However, the large number of features actually represents a problem. It leads to the ‘‘dimensionality curse” problem [4], where the indexing structures degrade and the significance of each feature decreases, making the process of storing, indexing and retrieving extremely time consuming. Moreover, in several situations, many features are correlated, meaning that they bring redundant information about the images that can deteriorate the ability of the system to correctly distinguish them. To avoid this problem, feature selection techniques can be employed to reduce the feature vector size. Another problem is the ‘‘semantic gap”, where the low-level features automatically extracted from images do not satisfactorily represent the semantic interpretation of the images. In fact, several challenges in CBIR systems are still opened and researchers are endeavoring to solve them, e.g. ‘‘what features best represent a given set of images?”, and ‘‘what distance function most approximates the human perception of similarity among the images of a given dataset?” Actually, most research activities on CBIR systems are focused on the aspects of determining the features to represent the images, relegating the distance function to a second level of importance. However, the efficiency and efficacy of an image retrieval technique is significantly affected by the inherent ability of the distance function to separate data. Considering two feature vectors F ¼ ff1 ; . . . ; fn g and G ¼ fg 1 ; . . . ; g n g, some representative distance functions from the literature are summarized in Table 1. In this paper we propose to employ association rules to weight features according to their significance, promoting continuous feature selection of the feature vectors employed to represent the images. Continuous feature selection techniques assign continuous weights to each feature, allowing the most important features to have the highest weight to compute the similarity between two images. The continuous feature selection approach significantly improves the precision of contentbased queries.

Table 1 Descriptions of some relevant distance functions. Name Minkowski family Lp

Equation

Weighted Minkowski

dLp ðF; GÞ ¼

Jeffrey divergence

dJ ðF; GÞ ¼

v2 Canberra

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pn p p i¼1 wi ðfi  g i Þ , where wi is the

weighting vector wi ¼ ðw1 ; w2 ; . . . ; wn Þ  Pn  gi fi i¼1 fi log mi þ g i log mi , where

mi ¼ Statistic value

Usage

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qP n p Lp ððFÞ; ðGÞÞ ¼ p i¼1 jfi  g i j

fi þg i 2

dv2 ðF; GÞ ¼ dC ðF; GÞ ¼

Pn

i¼1

Pn

ðfi mi Þ2 mi ,

jfi g i j i¼1 jfi jþjg i j

i where mi ¼ fi þg 2

The members of the Lp distance are widely employed in the literature. The Euclidean distance (L2 ) corresponds to the human-being notion of spatial distance. The L1 distance (City Block or Manhattan) corresponds to the sum of the differences along the coordinates. The L1 (Linf or Chebychev) gets the maximum difference of any of its coordinates Used when there are different influences between the features that affect the similarity comparison It is symmetric and presents a better numerical behavior. Also, it is stable and robust with regard to noise and the size of histogram bins [5] It emphasizes the elevated discrepancies between two feature vectors and measures how improbable the distribution is It is a comparative Manhattan distance, since the absolute difference in the feature values is divided by their absolute sum

1372

M.X. Ribeiro et al. / Data & Knowledge Engineering 68 (2009) 1370–1382

One of the most well-known feature selection algorithms in the literature is Relief [6]. The general principle of Relief is to measure the quality of features according to how their values distinguish instances of different classes. One limitation of the Relief algorithm is that it works only for datasets with binary classes. This limitation is overcome by Relief-F [7], which also tackles datasets with multi-valued classes. Another well-known feature selection technique is the Decision Tree Method (DTM) proposed by Cardie in [8]. Cardie proposes to employ the algorithm C4.5 [9] to select features. C4.5 adopts a forward search to generate feature subsets, using the entropy criterion to evaluate them. DTM selects the features that appear in the pruned decision tree as the best subset, i.e., the set of features appearing in any path to a leaf node. Association rule mining is one of the most important tasks in the data mining field and it has been extensively studied and applied to market basket analysis. The problem of mining association rules was firstly stated in [10], as follows. Let I ¼ fi1 ; . . . ; in g be a set of literals called items. A set X # I is called an itemset. Let R be a table with transactions t involving elements that are subsets of I. An association rule is an expression of the form X ! Y, where X and Y are itemsets. X is called the body or antecedent of the rule. Y is called the head or consequent of the rule. Support is the ratio between the number of transactions of R containing the itemset X [ Y and the total number of transactions of R. Confidence is the fraction of the number of transactions containing X that also contain Y. The problem of mining association rules, as it was firstly stated, consists of finding association rules that satisfy the restrictions of minimum support minsup and minimum confidence minconf specified by the user. When dealing with image features that consists of continuous attributes, a type of association rule that considers continuous values is necessary. A recent type of continuous association rules is the statistical association rules, which are rules generated using statistical measurements. Our proposed method employs the StARMiner algorithm [11] to mine statistical association rules from features of a training dataset. The mined rules are used to weight the features according to their relevance, making a new and enhanced representation of the images. The StARMiner algorithm associates classes to features with the highest power to distinguish the images. Let T be a dataset of medical images, xj an image class, T xj 2 T the subset of images of class xj and fi the ith feature of the feature vector F. Let lfi ðZÞ and rfi ðZÞ be, respectively, the mean and standard deviation of the values of feature fi in the subset of images Z. The algorithm uses three thresholds defined by the user: cmin – the minimum confidence to reject the hypothesis H0 : lfi ðT xj Þ ¼ lfi ðT  T xj Þ that the means lfi ðT xj Þ and lf ðT  T xj Þ are statistically different; Dlmin – the minimum difference allowed between the average of the feature fi in images from a class xj and the average of fi in the remaining dataset; and Drmax – the maximum standard deviation of fi allowed in a given class. StARMiner mines rules of the form xj ! fi , if the hypothesis H0 is rejected and the conditions given in Eqs. (1) and (2) are satisfied.

jlfi ðT xj Þ  lfi ðT  T xj Þj P Dlmin

ð1Þ

jrfi ðT xj Þj 6 Drmax

ð2Þ

A rule xj ! fi , returned by the algorithm, relates a feature fi with a class xj , where the values of fi have a statistically different behavior in images of class xj . This property indicates that fi is an interesting feature to distinguish images of class xj from the others. The features returned in the rules mined by StARMiner have a particular and uniform behavior in images of a given category. This is important, because the features presenting uniform behavior to every image in the dataset, independently of the image category, do not contribute to categorize them and should be eliminated. Hence, the StARMiner rules are useful to reveal the relevance of the image features. The rules obtained by StARMiner are employed to perform continuous feature selection as explained in Section 3. CAD systems employ image mining, a more complex process than the traditional data mining. Similarly to CBIR systems, image mining employs image processing algorithms to extract relevant features from the images, organizing them in feature vectors. The feature vectors are employed to model the images as transactions, which are used in the mining process. Most algorithms used in medical image analyses are based on classification methods. However, relevant research that applies association rules in CAD systems have also been successfully developed. An associative classifier was presented in [12]. It works as described as follows. In the pre-processing phase, images are cropped and enhanced using histogram equalization. The features mean, variance, skewness and kurtosis are extracted from the images, and combined to other descriptors (e.g. breast position and type of tissue) to compose the image feature vectors that are used in the process of association rule mining. The algorithm Apriori [13] is applied to the records. The rules are mined using low confidence values and the classifier label is restricted to occur only in the head of the rules. The rules that can be generalized are pruned. Given a new example, the classifier counts the number of rules that are satisfied and chooses its class. The major drawback in this method is the low confidence allowed to mine the rules. Low confidence can generates many rules that mislead the classifier results and make the classification process extremely slow. Association rules are used in [14] to classify mammograms as follows. First, features of shape are extracted from each image. A record combining the image features and the image classification (benign or malignant) is generated for each image. The features are discretized in equal-sized intervals. Association rules are mined with the restriction of not having a classification item in the body part. A new image is classified according to the number of rules matched and the confidence of the rules. A drawback of this technique is the discretization process, which uses intervals of the same size. This can imply the loss of significant information to mine rules with high confidence. A framework to obtain association rules relating objects to categories of brain tumors was presented in [15]. It uses a method guided by a specialist to detect the Regions of Interest (ROIs). For each ROI, textual descriptions of features of shape

M.X. Ribeiro et al. / Data & Knowledge Engineering 68 (2009) 1370–1382

1373

and size (among others) are used to build a table of image transactions. Each transaction in the table has features of two objects, their relative position in the image and a descriptor indicating if they are normal or abnormal. Constraints defined by domain experts are used to restrict some itemsets to occur only in the head or body of a rule, reducing the number of generated rules, which tends to get very large. The main drawback of this work is the difficult and subjective task of labeling the findings in the medical images. Ordonez et al. [16] defined useful constraints and methods of association rule summarization to mine association rules involving medical data. Association rules are filtered using support, confidence and lift, where lift helps selecting rules with high predictive power. Continuous numeric data (e.g. temperature and pressure) are discretized with the help of a specialist who determines the appropriate intervals. While Ordonez et al. works with high-level image data, our proposed method works with low-level data automatically extracted from the images. In this paper, we detail the methods FAR and IDEA. FAR is a new method that employs association rules to reduce the dimensionality of the feature vectors that represent the images and to improve the precision of the similarity queries. IDEA has the advantage of promoting feature selection and discretization in a single step, reducing the complexity of the subsequent steps of the method. Also, the method suggests a set of keywords to compose the diagnosis of a given image, and uses a measure of certainty to rank the keywords according to their probability of occurring in the final diagnosis of the image given by the radiologist. Moreover, a prototype incorporating the IDEA method was experienced by radiologists, who demonstrated enormous interest in employing the system in their daily work.

3. Proposed methods In this section we discuss the FAR and the IDEA association rule-based methods. The FAR method is proposed to improve the precision of CBIR systems by promoting continuous feature selection, weighting the feature vectors according to the statistical association rules. The IDEA method uses association rules to suggest diagnosis for new images. 3.1. The FAR method Feature selection through Association Rules (FAR) is a new method that incorporates association rules to promote continuous feature selection in medical image databases. An important question is how to use statistical association rules to weight the image features. Suppose that the images were classified in m high-level classes X ¼ fx1 ; x2 ; . . . ; xm g. For each feature fi , StARMiner aims at finding rules of the form xj ! fi . That is, StARMiner relates each feature fi to each class xj . If a rule xj ! fi is found, it means that the feature fi well discriminates the images from class xj . Therefore, the most discriminative features fi are those that generate rules xj ! fi , for every xj 2 X, meaning that they discriminate well all image classes. In the same way, the least discriminant features are those that do not generate any rule, meaning that they have a uniform behavior among all classes. Thus, to weight a feature fi , the FAR method uses the number of mined rules where fi appears. Eq. (3), obtained empirically, shows the weighting assigned to each feature fi :

wi ¼ 10  r i þ q

ð3Þ

where r i is the number of mined rules where feature fi appears; q is a constant that receives the values q ¼ 0 or q ¼ 1. The use of q ¼ 0 means that it is desirable to remove the features that do not generate any rule. When q ¼ 1 is used, it means that all the features are kept and weighted by relevance. Therefore, when q ¼ 0, an implicit process of dimensionality reduction is performed over the feature vector, and when q ¼ 1, all features are weighted according to their relevance. The values wi obtained are employed as the weights of the Weighted Minkowski distance. Hence, the equation of the distance function used in the FAR method is:

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u n X u p dLp ðF; GÞ ¼ t ð10  r i þ qÞðfi  g i Þp

ð4Þ

i¼1

The FAR method is a supervised approach that deals with the inherent drawback of a CBIR system: the high dimensionality of feature vectors. We amend this restriction by using the association rules to weight the features according to their relevance. The steps of the proposed method are illustrated in Fig. 1. The FAR method is executed in two phases: training and test phases. The training phase is composed of three steps: (1) feature selection, (2) association rule mining, (3) continuous feature selection. The test phase employs the weights found in the training phase to perform similarity searches. 3.2. The IDEA method IDEA is also a supervised method that mines association rules relating visual features automatically extracted from images with the reports given by radiologists about the training images. The reports are composed of a set of keywords. Fig. 2 shows the pipeline execution of IDEA and Algorithm 1 summarizes its steps.

1374

M.X. Ribeiro et al. / Data & Knowledge Engineering 68 (2009) 1370–1382

Fig. 1. Pipeline of the FAR method.

Step 2

Step 3

Omega

Association rule mining

Step 1 Feature Extraction

Training Association Rules 4[0.5-0.9] Mass Benign 2[1.4-1.5] 3[0.6-0.8] Birads=3

Training images

Step 4 ACE

Mass Benign, Birads=3 Feature Extraction

Suggested Report

test image

Fig. 2. Pipeline of the IDEA method.

Algorithm 1. The steps of the IDEA Method. Input: Training images, a test image Output: Report (set of keywords). 1: Extract features of the training images 2: Execute Omega algorithm 3: Mine association rules 4: Extract features of the test image 5: Execute ACE 6: Return the suggested report (set of keywords)

In the training phase (see Algorithm 1), features are extracted from the images, and the feature vectors are used to represent the images (line 1). The feature vectors and the class of each training image are submitted to Omega, which removes irrelevant features from the feature vector and gives a discretization of the remaining features (line 2). The class is the most important keyword chosen by the specialist to describe the image. In the training phase, a processed feature vector is merged with the diagnosis keywords about the training images, producing the transaction representation of each image. The transaction representations of all training images are submitted to the Apriori algorithm [13] for association rule mining (line 3), limiting the minimum confidence to high values. In the test phase (lines 4–6), the feature vector of the test image is extracted (line 4) and submitted to the ACE algorithm (line 5), which uses the association rules to suggest keywords to compose the diagnosis of the test image. We discuss each step of the IDEA method bellow. 3.3. Feature extraction When dealing with medical images, the earliest phase of a CAD system demands to extract the main image features regarding a specific criterion. Essentially, the most representative features vary according to the image type (e.g. mammogram, brain or lung) and according to the focus of the analysis (e.g. to distinguish nodules or to identify brain white matter). The IDEA method can work with various types of medical images, and with different focus of analysis. However, for each type of image and goal, an appropriate feature extractor should be employed. The feature vectors evaluated in our approach are detailed in the section of experiments.

1375

M.X. Ribeiro et al. / Data & Knowledge Engineering 68 (2009) 1370–1382

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

A

B

B

A

A

B

A

A

cut points eliminated Fig. 3. Cut points eliminated in Step 2 of the Omega algorithm, using Hmin ¼ 2.

3.4. The Omega algorithm Omega is a novel supervised algorithm that performs discretization of continuous values. Omega processes each feature separately and discretizes a range of N sorted values in 4N steps, having linear cost on N. Let f be a feature and fi be the value of the feature f in the image i. Omega uses a data structure that links each instance value fi with the instance class label ci . We refer to an image instance Ii as the pair ðfi ; ci Þ. Let U k and U kþ1 be the limits of an interval T k . We say that an instance Ii ¼ ðfi ; ci Þ belongs to an interval T k ¼ ½U k ; U kþ1  if and only if U k < fi < U kþ1 . In Step 1, Omega initially sorts the continuous values and defines the initial cut points. A cut point is placed before the smallest value and another cut point is placed after the highest value. Every time that a value is modified and a change in the class label occurs, a new cut point is created. In Step 1, Omega produces pure bins, where the entropy is equal to zero. Step 1 produces intervals that minimize the inconsistencies created by the discretization process. However, the number of bins produced in this first step tends to be very large and very susceptible to noise. A discretization process that produces a huge number of intervals (in the worst case, the same number of the original continuous values) is not desirable, because it does not add any gain to the learning algorithm. In Steps 2 and 3, Omega eliminates cut points in order to reduce the number of intervals, strongly controlling the inconsistency rate. In Step 2, Omega restricts the minimum frequency that a bin must present, avoiding a huge number of cut points. Omega removes the right cut points of the intervals that do not satisfy the minimum frequency restriction given by an input parameter Hmin . Only the last interval is allowed to not satisfy the minimum frequency restriction. The higher the value of Hmin , the fewer bins result from this step. However, some caution should be taken before adjusting the Hmin value, because the higher the Hmin , the higher the inconsistencies generated by the discretization process. Thus, it is important to keep this value low, even achieving only a small reduction in the number of bins. The next step of the algorithm assures a higher reduction in the number of bins, controlling the number of inconsistencies that is generated by the discretization process. Fig. 3 shows an example of the cut points found in Step 1 that are eliminated in the Step 2 of Omega, using Hmin ¼ 2. In Step 3, Omega fuses consecutive intervals, measuring the inconsistency rate to determine which intervals should be merged. Let M T k be the majority class of an interval T k . Eq. (5) gives the inconsistency rate fT k of an interval T k .

fT k ¼

jT k j  jM T k j jT k j

ð5Þ

In Eq. (5), jT k j is the number of instances in the interval T k , and jM T k j is the number of instances of the majority class in the interval T k . The Omega algorithm fuses consecutive intervals that have the same majority class and also have inconsistency rates below or equal an input threshold fmax ð0 6 fmax 6 0:5Þ. Fig. 4 shows an example of a cut point found in Step 2 (see Fig. 3) that is eliminated in Step 3, using fmax ¼ 0:35. The inconsistency rates fT k of the second and third interval shown in Fig. 4 are, respectively fT 2 ¼ 0=2 ¼ 0 and fT 3 ¼ 1=3 ¼ 0:33. Since T 2 and T 3 have the same majority class, i.e. M T 2 ¼ MT 3 ¼ ‘‘A” and fT 2 6 fmax and fT 3 6 fmax , the second and third interval are fused. The cut points remaining in Step 3 are the final cut points returned by the algorithm. In Step 4, Omega performs the feature selection task. Let T be the set of intervals in which a feature is discretized. For each feature, Omega computes the global inconsistency fG value, according to Eq. (6).

P

fG ¼

T k 2T ðjT k j

P

 jMT k jÞ

ð6Þ

T k 2T jT k j

The feature selection criterion employed by Omega removes from the set of features every feature whose global inconsistency value is greater than an input threshold fGmax ð0 6 fGmax 6 0:5Þ. Since the number of inconsistencies of the features is the

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

A

B

B

A

A

B

A

A

Fig. 4. A cut point eliminated in Step 3 of the Omega algorithm, using fmax ¼ 0:35.

1376

M.X. Ribeiro et al. / Data & Knowledge Engineering 68 (2009) 1370–1382

factor that most contribute to disturb the learning algorithm, discarding the most inconsistent features can contribute to improve accuracy as well as to speed up the learning algorithm. 3.5. Association rule mining The IDEA method employs the Apriori algorithm [13] to mine association rules. The output of the Omega algorithm and the keywords of the report of the training images are submitted to the Apriori algorithm. A constraint, which restricts the diagnosis keywords to the head of the rules, is added to the mining process. The body of the rules is composed of indexes of the features and their intervals. The values of minimum confidence are set to be high (greater than 97%). The mined rules are used as input to the ACE algorithm. 3.6. The ACE algorithm Before presenting the Associative Classifier Engine (ACE) algorithm, it is necessary to clarify some terms. We say that an image matches a rule, if the image features satisfy the whole body of the rule. An image partially matches a rule, if the image features only satisfy part of the rule’s body. An image does not match a rule, if the image features do not satisfy any part of the rule’s body. ACE is a special classifier able to return multiple classes (keywords) when processing a test image. The ACE algorithm stores all itemsets (set of keywords) belonging to the head of the rules in a data structure. An itemset h is returned by ACE in the suggested diagnosis if the following conditions are satisfied:

MðhÞ P 1 ^ w ¼

3MðhÞ þ PðhÞ P wmin 3MðhÞ þ PðhÞ þ NðhÞ

where MðhÞ is the number of matches of the itemset h; PðhÞ is the number of partial matches, and NðhÞ is the number of no matches automatically computed. The variable w is the weight of the itemset. The weight indicates the level of certainty that an itemset h will belong to the final image diagnosis given by a specialist. The higher the value of weight, the stronger the confidence that h belongs to the diagnosis of the image. A threshold of minimum weight wmin ð0 6 wmin 6 1Þ is employed to limit the weight of an itemset in the suggested diagnosis. If wmin ¼ 0, all itemsets that match at least one rule are returned. Fig. 5 shows an example of ACE working. In this example, MðhÞ ¼ 1; PðhÞ ¼ 1 and NðhÞ ¼ 1 for the itemset h ¼ fbenigng. Therefore, if 45 P wmin , the itemset h ¼ fbenigng is returned by the algorithm, otherwise it is discarded. 4. Experiments In this section, we present results of experiments performed to highlight the efficacy and effectiveness of both FAR and IDEA methods. 4.1. Evaluating the FAR method One important issue related to CBIR systems is how to evaluate their efficacy. A standard approach to evaluate the accuracy of similarity queries is the precision and recall (P&R) graph [17]. Precision and recall are defined in Eqs. (7) and (8), respectively, where TR is the total number of relevant images for a given query; TRS is the number of relevant images actually returned by the query, and TS is the total number of images returned by the query. As a rule of thumb when analyzing P&R graphs, the closer the curve to the top of the graph, the better the retrieval technique is.

TRS TS TRS Recall ¼ TR Precision ¼

ð7Þ ð8Þ

To build the P&R graphs, we applied sets of k-nearest neighbor (k-NN) queries, using randomly selected query images from the dataset as query centers, and varying the values of k, from one to the size of the dataset. The distance functions used in the experiments were the ones described in Section 2. The feature vectors were acquired from the Haralick descriptors (a texture-based extractor). The feature vectors obtained were indexed using the Metric Access Method (MAM) Slim-tree [18], to accelerate the similarity queries processing. The Haralick descriptors [19] are based on statistical moments and

Fig. 5. Example of ACE working.

1377

M.X. Ribeiro et al. / Data & Knowledge Engineering 68 (2009) 1370–1382

obtained from co-occurrence matrix that have been largely used for texture-based image representation. The features obtained from the Haralick descriptors to our experiments were variance, step, entropy, energy, homogeneity, 3rd order moment, and inverse variance. These descriptors were combined in a single feature vector with 140 elements. Due to space limitations, we show the results obtained from one representative dataset using only one type of feature vector. In this section, we present the results obtained using the MRI dataset. It consists of 704 images of magnetic resonance (MR) and heart angiogram exams collected in the Clinical Hospital of University of Sao Paulo at Ribeirao Preto. The dataset contains 8 categories of images: angiogram, axial pelvis, axial head, axial abdomen, coronal head, coronal abdomen, sagittal head, and sagittal spine. The MRI dataset was divided in two sets: the training set composed of 176 images (25% of the MRI dataset), and the test set composed of 528 images (75% of the MRI dataset). The P&R graphs of Fig. 6 correspond to the experiments performed on the MRI dataset represented by the texture-based extractor. In Fig. 6, the graphs (a), (b), (c), (d), (e) and (f) correspond to the results using L1 ; L2 ; Linf ; v2 , Jeffrey Divergence, and Canberra distance functions, respectively, comparing our weighting approach, with non-weighting ones. The P&R curves in the graphs of Fig. 6 were built executing similarity queries employing: (1) non-weighted features; (2) the StARMiner feature selection (removing the irrelevant features, but not weighting); (3) the FAR method, using q ¼ 0 (weighting features and

1

0.8 Precision

(b)

L1 L1 Starminer Weighted L1 q=0 Weighted L1 q=1 Relief DTM

0.6 0.4

0.4 0.2

0

0.2

0.4

0.6 Recall

1

0.6

0

1

0.2

0.4

0.6

0.2

0

0.2

0.4

0.6

1

0.8

0

1

0

0.2

0.4

(f)

0.4

0.2

0.4

0.6 Recall

1

1

0.6 Canberra Canberra Starminer Weighted Canberra q=0 Weighted Canberra q=1 Relief DTM

0.4 0.2

0.2

0

0.8

0.8 Precision

0.6

0.6 Recall

Jeffrey Jeffrey Starminer Weighted Jeffrey q=0 Weighted Jeffrey q=1 Relief DTM

0.8 Precision

1

0.4

Recall

0

0.8

X2 X2 Starminer Weighted X2 q=0 Weighted X2 q=1 Relief DTM

0.8

0.2

(e)

0.6

1

(d)

0.4

0

0

Recall

Linf Linf Starminer Weighted Linf q=0 Weighted Linf q=1 Relief DTM

0.8 Precision

0.8

Precision

(c)

L2 L2 Starminer Weighted L2 q=0 Weighted L2 q=1 Relief DTM

0.6

0.2 0

1

0.8 Precision

(a)

0.8

1

0

0

0.2

0.4 0.6 Recall

0.8

1

Fig. 6. P&R graphs using (a) L1 , (b) L2 , (c) Linf , (d) v2 , (e) Jeffrey Divergence, and (f) Canberra distances obtained over the dataset represented by texturebased features, employing: non-weighting the features; the StARMiner feature selection; the FAR method, using q ¼ 0 (removing irrelevant attributes); the FAR method, using q ¼ 1; and the Relief-F and DTM feature selection algorithms.

1378

M.X. Ribeiro et al. / Data & Knowledge Engineering 68 (2009) 1370–1382

removing the irrelevant ones); (4) the FAR method, using q ¼ 1 (weighting features, but not removing the irrelevant ones); (5) the Relief-F algorithm and (6) the DTM algorithm. In the proposed method, the use of q ¼ 0 leads to dimensionality reduction of the feature vector, removing redundant features of it. For this dataset, the use of q ¼ 0 leaded to a reduction of 20% on the feature vector size. As afore mentioned Fig. 6 shows the P&R graphs obtained. Analyzing these graphs, we observe that the proposed technique clearly improves the precision of similarity queries and that the FAR method always outperformed the precisions obtained by the original feature vector, by StARMiner and by the traditional feature selection algorithms Relief-F and the DTM. In Fig. 6b, the proposed technique achieves a considerable gain in the precision around 20% using q ¼ 0 (promoting dimensionality reduction of 20% of the feature vector size) for a recall level of 35%, and of 38% when q ¼ 1. These results illustrate the ability of the proposed technique to improve the precision of similarity queries, even when it reduces the dimensionality of feature vectors. Continuing the analysis of the graphs of Fig. 6, we observe that the distance function that produces the lowest precision values is Linf (Fig. 6c). Even using this distance function, applying our technique leads to considerable gain in precision for both, q ¼ 1 and q ¼ 0. The distance function that produces the highest precision values is Canberra (Fig. 6f). For the Canberra distance function, the FAR method using q ¼ 1 and q ¼ 0 also presents a gain in precision. Fig. 7 illustrates an example of a k-NN (k = 8) query, where the top left image is the query center. Fig. 7a shows the result using the original features, and Fig. 7b shows the result using our proposed method with q ¼ 0 (performing dimensionality reduction of the feature vector). The images highlighted by a dashed line are false positive images. A false positive is a returned image whose class differs from the class of the query center. Clearly the FAR method provides an improvement on the results. It is important to note that the experiments were also executed using the v2 , Jeffrey Divergence and Canberra distances (see Fig. 6d–f). The results corroborate the effectiveness of FAR in improving the content-based retrieval of medical images and show that it can be extended to other distance functions beyond the traditional Minkowski family and other types of features (although not shown here), presenting a notable gain in the precision of similarity queries. Concluding, this experiment shows that FAR, in almost all cases, outperformed the precision obtained by traditional feature selection algorithms, such as Relief-F and DTM. Considering the results achieved, we argue that our proposed weighting technique is well-suited to perform continuous feature selection in content-based retrieval of medical images, improving the precision of them. 4.2. Evaluating the IDEA method Due to space limitations, we present only one experiment using the ROI dataset performed to validate the IDEA method to suggest diagnosis for medical images. The experiment was performed employing 10% of the images from the dataset for testing and the remaining images for training. The parameters of the Omega algorithm were set to Hmin ¼ 2; fmax ¼ 0:2 and fGmax ¼ 0:3, which are the tuning parameters of the algorithm. The values of minimum support minsup = 0.005 and minimum confidence minconf = 1.0 were used as Apriori input parameters. The value wmin ¼ 0, which maximizes the ACE accuracy, was employed as ACE input parameter. The ROI dataset consists of 446 images of Regions of Interest (ROIs) comprising tumoral tissues, taken from mammograms collected from the Breast Imaging Reporting and Data System of the Department of Radiology of University of Vienna (www.birads.at). These ROIs are long employed to train students of radiology. Each image has a diagnosis composed of three main parts: 1. Morphology: mass (circumscribed, indistinct, speculated); architectural distortion; asymmetric density, calcifications (amorph, pleomorph, linear, benign); 2. BI-RADS (Breast Imaging Reporting and Data System): six levels (0–5); 3. Histology: cyst, fibrosis, fatty tissue, etc. (total of 25 keywords).

Fig. 7. An example of k-NN (k-8) query execution, where the top left image is the query center. (a) using the original features. (b) using the FAR method with q ¼ 0. The images wrapped by a dashed line are false positives images.

1379

M.X. Ribeiro et al. / Data & Knowledge Engineering 68 (2009) 1370–1382 Table 2 Features extracted from the ROI dataset and their positions in the feature vector. Feature

Position

Average intensity, contrast Smoothness, skewness Uniformity, entropy Invariant moments Histogram mean, standard deviation

1–2 3–4 5–6 7–13 14–15

Table 3 Comparison among IDEA and other well-known classifiers in the task of determining the BI-RADS level using the ROI dataset. Measure

IDEA (%)

C4.5 (%)

Naive Bayes (%)

1NN classifier

Accuracy Sensitivity Especificity

96.7 91.3 71.4

95.5 85.7 100.0

86.7 71.4 77.8

71.1 50.0 33.3

In the feature extraction step, the images were segmented and features of texture, shape and color were extracted from the segmented regions. The segmentation process was performed eliminating image regions with gray level smaller than 0.14 (in a gray-scale range [0–1]) and applying the well-known Otsu’s technique [20] to the resultant image. The features shown in Table 2 were extracted from the segmented regions and used to compose the feature vector representation of the images. The Omega algorithm was applied to the image feature vector, which removed the 13th feature. It means that the 13th feature is the least differentiating feature for the ROI dataset. The output from Omega was submitted to the Apriori algorithm where 662 rules were mined. One example of rule mined in this step of IDEA method is:

1½167:03  169:1 ! Invasive Ductal Carcinoma ðIDCÞðs ¼ 0:01; c ¼ 1:0Þ This rule means that the images having the 1st feature (average intensity) in the closed interval [167.03–169.1] tend to be images of Invasive Ductal Carcinoma (IDC), with support of 0.01 (1% of the training images are images of IDC having the 1st feature value in that interval) and confidence 1.0 (all images having the 1st feature in that interval are images of IDC). The association rules generated and the test images were submitted to the ACE algorithm, which produced suggestions of keywords to compose the diagnosis for each test image. The suggested diagnoses were compared to the real diagnoses of the images given by specialists and by biopsy results.

Fig. 8. Screenshot of the IDEA System.

1380

M.X. Ribeiro et al. / Data & Knowledge Engineering 68 (2009) 1370–1382

To validate the IDEA method in the task of determining the BI-RADS level of the images, we compared it with three other well-known classifiers. It was first compared with C4.5 [9], a classifier that constructs a decision tree in the training phase. Second, IDEA was compared with Naive Bayes [21], a classifier that uses a probabilistic approach based on the Bayes’ theorem to predict the class labels. Third, it was compared with the 1-nearest neighbor (1NN), a classifier that uses the class label of the nearest neighbor (using the Euclidean distance) to classify a new instance. Table 3 shows the results. Since BI-RADS categorization has a fuzzy separation among consecutive levels, even for a radiologist, and has high similarity among consecutive levels, we considered correct if the BI-RADS level suggested by the methods is the same or adjacent to the level annotated by the radiologist in the image report. Note that the IDEA method leads to the highest values of accuracy and sensitivity. Moreover, IDEA suggests a set of keywords related to the other parts of the diagnosis without requiring more computational effort to that. Other classifiers would also return a set of keywords as the final result. However for each keyword returned, a new classification model should be built, increasing the complexity of the process and the computational effort to a great extent. For this reason, the IDEA results are more feasible to be achieved employing association rules than applying other mining techniques. In a batch execution, the test images were submitted to the IDEA system and the accuracy obtained considering the main parts of the diagnosis (morphology and Bi-RADS) were: Morphology: 91.3%; BI-RADS value: 96.7%. This result indicates that the employed features represent more the BI-RADS level of the lesion than the morphological properties of the images. A prototype, called the IDEA System, was implemented incorporating the IDEA method. The IDEA system was evaluated by two radiologists, obtaining a high degree of acceptance. They reported that the system has indicated lesions that they did not see in a first analysis. Fig. 8 shows a screenshot of the IDEA System when analyzing the image shown on the left. The system shows the weight of each diagnosis keyword between parentheses. The weight indicates the level of certainty that the respective keyword will belong to the final diagnosis given by the radiologist. 5. Conclusions In this paper we propose to employ association rules to support two types of medical systems: CBIR and CAD. To improve CBIR systems, we presented an association rule-based method, called FAR, that performs continuous feature selection based on the statistical association rules mined using the image features and the image categories. The experiment performed employing the FAR method shows that the proposed method improves the precision of the query results up to 38%, always outperforming the precision obtained by the original features, while decreasing the memory and processing costs. We also detailed IDEA, a method based on association rules to assist the radiologists in the task of diagnosing medical images. The results of using a real dataset show that the proposed method achieves high accuracy (up to 96.7%) and achieves the highest values of accuracy and sensitivity when compared with other well-known classifiers (C4.5, Naive Bayes, and 1-Nearest Neighbor). In addition, the IDEA method suggests a set of keywords not demanding more computational effort to that, while the other methods should built a new model for each keyword returned, increasing the complexity of the process and the computational effort to a great extent. For this reason, the IDEA results are more feasible to be achieved employing association rules than applying other mining techniques. Radiologists who evaluated the system demonstrated good acceptance for the IDEA system, showing enormous interest in employing the system to aid them in their daily work. The results indicate that association rules can be successfully employed to improve the medical CBIR and CAD systems, enhancing, speeding up and bringing more confidence to the work of the radiologists in their day-to-day task of analyzing medical images. Future work includes comparing the time required to analyze the images and the precision of the radiologists using the proposed methods and without using them. Also, the methods can be added to the PACS of the Clinical Hospital of the University of São Paulo at Ribeirão Preto to be extensively used and evaluated. References [1] P.H. Bugatti, M.X. Ribeiro, A.J.M. Traina, C.T. Jr, Content-based retrieval of medical images by continuous feature selection, in: The 21th IEEE International Symposium on Computer-Based Medical Systems, Jyvaskyla, Finland, 2008, pp. 272–277. [2] M.X. Ribeiro, A.J.M. Traina, C.T. Jr., N.A. Rosa, P.M.A. Marques, How to improve medical image diagnosis through association rules: The idea method, in: The 21th IEEE International Symposium on Computer-Based Medical Systems, Jyvaskyla, Finland, 2008, pp. 266–271. [3] R.C. Holte, Very simple classification rules perform well on most commonly used datasets, Machine Learning 11 (1993) 63–91. [4] M. Malcok, Y. Aslandogan, A. Yesildirek, Fractal dimension and similarity search in high-dimensional spatial databases, in: IEEE Intl. Conf. on Information Reuse and Integration, Waikoloa, Hawaii, USA, 2006, pp. 380–384. [5] Y. Rubner, C. Tomasi, Perceptual Metrics for Image Database Navigation the Kluwer Intl. Series in Engineering and Computer Science, Kluwer Academic Publishers, 2001. [6] K. Kira, L.A. Rendell, A practical approach for feature selection, in: Ninth Intl. Conf. on Machine Learning, Aberdeen, Scotland,1992, pp. 249–256. [7] I. Kononenko, Estimating attributes: analysis and extension of relief, in: European Conf. on Machine Learning, Catania, Italy, 1994, pp. 171–182. [8] C. Cardie, Using decision trees to improve case-based learning, in: 10th Intl. Conf. on Machine Learning, 1993, pp. 25–32. [9] R. Quinlan, C4.5: Programs for Machine Learning, San Mateo, CA, 1992, pp. 1–302. [10] R. Agrawal, T. Imielinski, A.N. Swami, Mining association rules between sets of items in large databases, in: Proceedings of the 1993 ACM SIGMOD ICMD, Washington, D.C., 1993, pp. 207–216. [11] M.X. Ribeiro, A.G.R. Balan, J.C. Felipe, A.J.M. Traina, C. Traina Jr., Mining statistical association rules to select the most relevant medical image features, in: First Intl. Workshop on Mining Complex Data, Houston, 2005, pp. 91–98. [12] M.-L. Antonie, O.R. Zaane, A. Coman, Associative classifiers for medical images, in: LNAI 2797, MMCD, Springer-Verlag, 2003, pp. 68–83. [13] R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in: Intl. Conf. on VLDB, Santiago de Chile, Chile, 1994, pp. 487–499. [14] X. Wang, M. Smith, R. Rangayyan, Mammographic information analysis through association-rule mining, in: IEEE CCGEI, 2004, pp. 1495–1498.

M.X. Ribeiro et al. / Data & Knowledge Engineering 68 (2009) 1370–1382

1381

[15] H. Pan, J. Li, Z. Wei, Mining interesting association rules in medical images, in: Advance Data Mining and Medical Applications, Springer-Verlag, 2005, pp. 598–609. [16] C. Ordonez, N. Ezquerra, C.A. Santana, Constraining and summarizing association rules in medical data, Knowledge and Information Systems 9 (3) (2006) 259–283. [17] R.A. Baeza-Yates, B.A. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley, Wokingham, UK, 1999. [18] J. Traina, Caetano, A.J.M. Traina, C. Faloutsos, B. Seeger, Fast indexing and visualization of metric datasets using slim-trees, IEEE Transactions on Knowledge and Data Engineering 14 (2) (2002) 244–260. [19] R.M. Haralick, K. Shanmugam, I. Distein, Textural features for image classification, IEEE Transactions on Systems, Man and Cybernetics 3 (1973) 610– 621. [20] N. Otsu, A thresholding selection method from gray-level histogram, IEEE Transactions on Systems, Man and Cybernetics 9 (1979) 62–66. [21] G.H. John, P. Langley, Estimating continuous distributions in Bayesian classifiers, Morgan Kaufman, San Mateo, 1995. pp. 338–345.

Marcela X. Ribeiro received the B.Sc. degree in computer engineering and the M.Sc. in computer science from the Federal University of São Carlos, Brazil, in 2002 and 2004, respectively. She received the Ph.D. degree in computer science at the Mathematics and Computer Science Institute of the University of São Paulo at São Carlos, Brazil, in 2008, and she is currently a Post-doctoral researcher at the same institute. Her research interests include multimedia data mining, visual data mining, visualization, computer-aided diagnosis (CAD) and content-based image retrieval (CBIR). She is a member of IEEE Computer Society, ACM and SBC.

Pedro H. Bugatti received the BSc degree in computer science from Euripides Soares da Rocha University of Marilia (UNIVEM), Brazil, in 2005 and the M.Sc. in computer science from University of São Paulo (USP), Brazil, in 2008. He is currently a Ph.D. student in computer science at the University of São Paulo (USP), Brazil. In 2005, he received from Brazilian Computer Society (SBC) the award as prominence alumnus. His research interests include image databases, indexing methods for multidimensional data, analysis of distance functions and similarity metrics.

Caetano Traina Jr. received the B.Sc. degree in electrical engineering, the M.Sc. and Ph.D. degrees in computer science from the University of São Paulo, Brazil, in 1978, 1982 and 1987, respectively. He is currently a full professor with the Computer Science Department of the University of São Paulo at São Carlos, Brazil. His research interests include access methods for complex data, data mining, similarity searching and multimedia databases. He is a member of IEEE, ACM, SIGMOD, SIAM and SBC.

Paulo M. Azevedo-Marques received his B.Sc. and M.Sc. degree in Electrical Engineering in 1986 and 1990, respectively and his Ph.D. in Applied Physics in 1994 at the University of Sao Paulo. He is an Associate Professor with the Medical Physics and Biomedical Informatics at the Internal Medicine Department, University of Sao Paulo (USP), School of Medicine in Ribeirão Preto, Brazil. His research interests include image processing for computer-aided diagnosis (CAD) and content-based image retrieval (CBIR), and Picture Archive and Communication Systems (PACS).

1382

M.X. Ribeiro et al. / Data & Knowledge Engineering 68 (2009) 1370–1382 Natalia A. Rosa received the B.Sc. degree in computer science from Pontifical Catholic University of Campinas (PUC), Brazil, in 1995 and her M.Sc. in computer science from University of São Paulo (USP), Brazil, in 2002. She received the Ph.D. degree in Medicine by the Medical Physics and Biomedical Informatics at the Internal Medicine Department, University of Sao Paulo (USP), Brazil, in 2007. Her research interests include image processing for computer-aided diagnosis (CAD), content-based image retrieval (CBIR), and Picture Archive and Communication Systems (PACS).

Agma J. M. Traina received the B.Sc. the M.Sc. and Ph.D. degrees in computer science from the University of São Paulo, Brazil, in 1983, 1987 and 1991, respectively. She is currently a full Professor with the Computer Science Department of the University of São Paulo at São Carlos, Brazil. Her research interests include image databases, image mining, indexing methods for multidimensional data, information visualization and image processing for medical applications. She is a member of IEEE Computer Society, ACM, SIGKDD, and SBC.