Multi-resolution semantic-based imagery retrieval using hidden Markov models and decision trees

Multi-resolution semantic-based imagery retrieval using hidden Markov models and decision trees

Expert Systems with Applications 37 (2010) 4425–4434 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: ww...

2MB Sizes 0 Downloads 27 Views

Expert Systems with Applications 37 (2010) 4425–4434

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Multi-resolution semantic-based imagery retrieval using hidden Markov models and decision trees Brandt Tso *, Joe L. Tseng Department of Information Management, Management College, National Defense University, 70, Sec. 2, Central N Rd., Bei-Tou, Taipei 11245, Taiwan

a r t i c l e

i n f o

Keywords: Hidden Markov model Semantic-based Decision tree C4.5 Wavelet LAB

a b s t r a c t This study presents a useful method for semantic-based imagery retrieval. The experiments are made in two parts. In the first part of the experiments, the newly designed one-dimensional hidden Markov models (HMM) in terms of ‘observation-sequence’ and ‘observation-density’ manipulation approaches are proposed so as to evaluate the corresponding performance in imagery retrieval accuracy. For the ‘observation-sequence’ manipulation method, there are totally four neighborhood systems being evaluated, while two neighborhood systems are tested in the ‘observation density’ manipulation domain. In the second part of the experiments, a C4.5 decision tree is introduced and trained by the HMM likelihoods so as to discover the retrieving rules to further enhance the imagery retrieval accuracy. The test imagery all belong to real-scene military vehicles and are hierarchically pre-processed using wavelet and LAB transforms. The imagery are classified into ‘Air-Force’, ‘Warship’, ‘Submarine’, ‘Tank’, and ‘Jeep’, respectively. It is found that using HMM alone can achieve the best accuracy of 68.8%, when decision trees are implemented, the accuracy can be further enhanced up to 78%. The results evidentially show the usefulness of the method, and can be used in intelligent systems in recognizing real-scene objects. Ó 2009 Elsevier Ltd. All rights reserved.

1. Introduction Object recognitions can be useful in a wide range of applications among various fields, including digital libraries, web searching, education, biomedicine, and the military. Although certain success has been achieved in recognizing objects within specific domains (Forsyth & Ponce, 2002; Tamura & Yokoya, 1984), for real-scene images, object recognition remains challenge in considerable parts due to the significant variations contained in the images (such as clutter backgrounds, viewpoint variations, varying illumination). The common basis for object recognition is to extract signatures (or features) for every image based on its pixel values and to define rules (such as based on distance measure) for recognizing objects within images (Tso & Mather, 2009). The signatures can be regarded as the significant compression of image representation. Specifically, the primary objective of extracting signatures is to bridge the gap between image objects and the pixel representation, that is, to create a better correlation with image objects. Currently, existing techniques for object recognitions can be roughly categorized into four types, namely, histogram matching (Gevers & Stokman, 2004; Pass & Zabith, 1999), color layout (Natsev, Rastogi, & Shim, 1999), edge detection (Bertozzi et al., 2007), and region-based representation (Chen, Bi, & Wang, 2006; Chen, * Corresponding author. Tel.: +886 2 28986600x604404. E-mail address: [email protected] (B. Tso). 0957-4174/$ - see front matter Ó 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2009.11.086

Wang, & Krovatz, 2005), depending on the core of approaches to extract and using signatures. Histogram matching techniques characterize objects within an image in terms of color distributions. However, the obvious drawback of such a global histogram representation is that information about object location, orientation, texture, and context are discard. Color layout approach normally partitions an image into blocks and then records the color feature of each block. One may therefore regard color layout as a low resolution representation of the original image. Although information such as shape can be preserved in color layout techniques, it is sensitive to shifting, scaling and rotation, because image objects are represented by a set of mutually independent features. Edge detection is also sensitive to clutter background, scaling and rotation, and thus the resulting level of object recognition performance is quite limited. As region-based, or so-called semantic-based representation is concerned, it attempts to overcome the deficiencies of histogram and color layout approaches by representing images at the object-level. The basic idea is to represent the visual appearance of an object as a structured combination of a number of mutually correlated context regions keyed by distinctive features (Tso & Olsen, 2005a). Even though theoretically region-based representation for object recognition may achieve more elegant recognition result, not much attention has been paid to developing relevant methodology to devote to this field. The major difficulty being faced by implementing such techniques is that the object is hard to be

4426

B. Tso, J.L. Tseng / Expert Systems with Applications 37 (2010) 4425–4434

completely extracted and separated from background, especially in the case of dealing with real-scene imagery. However, in recent years, it is found that one solution to overcome such barrier can be contextual information (Tso & Mather, 2009, Tso & Olsen, 2005b). Since within an object, each sub-region generally holds spatial and spectral relationship with neighboring sub-regions, if such contextual information can be successfully incorporated, it is of no doubt that image object representation and recognition can be improved. To model such contextual relationship, hidden Markov model (HMM), a robust model based on Markovian theory, can be applied (Baum & Egon, 1967; Baum & Petrie, 1966; Baum, Petrie, Soules, & Weiss, 1970; Baum et al., 1970; Rabiner, 1989). HMM is an important category of machine learning algorithm, and has become the core of many intelligent systems. Since the development of the HMM, it has earned popularity in speech recognition (Cole, Hirschman, Atlas, & Beckman, 1995) and information extraction field (Tso & Chang, 2007). Applications of the HMM to image processing has been also growing (Tso & Mather, 2009). When an HMM is implemented for modeling image, the most straightforward method of incorporating pixels into HMM is by sweeping the image line by line or column by column. It is noted that the original theory of HMM assumes a one-dimensional linear dependent structure (Baum & Egon, 1967; Rabiner, 1989), while an image is normally two-dimensional in a contextual sense. The spatial dependencies are thus not well modeled with the linear structure of HMMs, and may result in restrictions to the object representation and later recognition accuracy. There are alternative strategies to build 2D HMMs, or so-called Markov meshes to pursue better fitness for such 2D spatial dependencies (Li, Najmi, & Gray, 2000). However, as noted by Li et al. (2000) that using 2D HMMs requires developing considerably complicated algorithms for searching optimal model states. This may require making additional assumptions regarding models, and also contribute much heavier computational load. There is also another technique called pseudo 2D HMM (Eickeler, Mller, & Rigoll, 2000) which are extensions of one-dimensional HMM and are obtained by linking one-dimensional left to right HMMs to form vertical super-states. These super-states model the sequence of rows in the image, and the linear one-dimensional HMMs (which are inside the super-states) model each row. The applications of pseudo 2D HMM for object recognition are highly restricted to the images that the location of object(s) have to be the same (such as passport photo of human face), and thus is not applicable to the real-scene images in which the locations and directions of objects vary considerably among images. To make HMMs useful in solving real-scene object recognition issue, rather than making changes to the HMM structure, this study developed methodology in terms of spatial and feature space manipulation within a linear HMM to pursue the possibility of object recognition in a more robust way. In the methodology developed, the primary aim is to embed 2D information into onedimensional HMM. The ways of converting 2D information of the image object into one-dimension HMM and then implemented to object recognition task are thus the main interest of this study. The techniques being developed for modeling images using HMM include ‘observation-sequence’ and ‘observation-density’ adjustments as will be detailed in the later section. When using an HMM to model an image, the HMM requires several cycles running so as to achieve stable state. Such a process is called model training. As an HMM training is complete, one can then feed the well-trained HMM with another unknown image to calculate the likelihood and based on such an likelihood to estimate how similar the both images are and then to decide whether the unknown image belongs to the same type of the trained image or not. However, it is well-known that the calculation of the HMM

likelihood not only affected by the HMM but is also dominated by the image resolution (for instance, higher resolution image may contain more information, however also more noise-affected, lower resolution image may not contain very detailed information but may be easier to retrieve object of interest). In order to make the retrieval performance evaluation more evidential, this study preprocessed each image in a hierarchical sense to generate three-level of resolutions in serving experiments, and an image in each resolution may model by several HMMs with different assigned cluster numbers. The resulted likelihoods for an image are therefore multiple and difficult to manage. An C4.5 decision tree is then introduced to resolve such a situation and help to trace out the rules in retrieving the images of interest. This study presents the results of identifying military vehicles from real-scene imagery. The works described here is part of functions of MrSMIR (multi-resolution semantic-based military imagery retrieval) system currently under construction (Tso, 2008), which is a military digital library and intelligent image retrieval system. The experiments performed in this study are part of core functions constructed in MrSMIR. The rest of paper is organized as follows. In Section 2, the fundamentals of HMM theory and C4.5 decision tree are introduced. The methodology of converting 2D spatial information into one-dimensional HMM is detailed in Section 3. Also the parameter setting and training for the HMM and decision tree models are described. In Section 4, the retrieval performances are evaluated and the experimental results along with discussions are described. Finally, concluding remarks and the suggestions for future researches are given in Section 5.

2. Theoretical background 2.1. HMM theory A hidden Markov model (HMM) is distinguished from a general Markov model in which the states in an HMM cannot be observed directly (i.e., hidden) and can only be estimated through a sequence of observations generated along a time series. Assume the total number of states being N, and let both qt and ot each denotes the system state and the observation at time t. An HMM, k, can be formally characterized by k ¼ ðA; B; pÞ, where A is a matrix of probability transition between states, B is a matrix of observation probability densities relating to states, and p is a matrix of initial state probabilities, respectively. Specifically, A, B, and p are each further represented as

A ¼ ½aij ;

aij ¼ Pðqtþ1 ¼ jjqt ¼ iÞ;

1 6 i; j 6 N

ð1Þ

where

aij P 0;

N X

aij ¼ 1;

for i ¼ 1; 2; . . . ; N

ð2Þ

j¼1

B ¼ ½bj ðot Þ;

bj ðot Þ ¼ Pðot jqt ¼ jÞ;

16j6N

p ¼ ½pi ; pi ¼ Pðq1 ¼ iÞ; 1 6 i 6 N

ð3Þ ð4Þ

where N X

pi ¼ 1

ð5Þ

i¼1

The observation probability density bj ðot Þ for state j given observation ot is generally modeled as Gaussian distribution as (Rabiner, 1989).

  1 1 0 bj ðot Þ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi exp  ðot  lj ÞR1 ðo  l Þ t j j 2 ð2pÞk jRj j

ð6Þ

B. Tso, J.L. Tseng / Expert Systems with Applications 37 (2010) 4425–4434

where prime denotes vector transpose and k is the dimension of observation vector ot . Given an HMM k and observation sequence O ¼ fo1 ; o2 ; . . . ; oT g, one may estimate the best state sequence Q  ¼ fq1 ; q2 ; . . . ; qT g based on a dynamic programming approach so as to maximize PðQ  jO; kÞ. In order to make Q  meaningful, one has to well set up the model parameters A, B, and p. The Baum–Welch algorithm (Baum & Egon, 1967) is the most widely adopted methodology for model parameters estimation. The model parameters pi ; aij , mean li and covariance Ri are each characterized as

p ¼ c1 ðiÞ

ð7Þ

PT1

nt ði; jÞ a ¼ Pt¼1 T1 t¼1 ct ðiÞ PT ct ðiÞot li ¼ Pt1 T t¼1 ct ðiÞ PT c ðiÞðo  lt Þ0 ðot  lt Þ Ri ¼ t¼1 t PTt t¼1 ct ðiÞ

ð8Þ ð9Þ ð10Þ

where ct ðiÞ denotes the conditional probability of being state i at time t, given the observations, and nt ði; jÞ is the conditional probability of a transition from state i at time t to state j at time t+1, given the observations. Both ct ðiÞ and nt ði; jÞ can be solved in terms of a well-known forward–backward algorithm (Baum & Egon, 1967; Baum & Petrie, 1966). Define the forward probability at ðiÞ as the joint probability of observing the first t observation sequence O1 to t ¼ fo1 ; o2 ; . . . ; ot g and being in state i at time t. The at ðiÞ can be solved inductively by following formulae

a1 ðiÞ ¼ pi bi ðo1 Þ; 1 6 i 6 N N X atþ1 ðiÞ ¼ bi ðotþ1 Þ ½at ðiÞaij ;

ð11Þ for 1 6 t 6 T; for 1 6 i 6 N

ð12Þ

j¼1

Let the backward probability bt ðiÞ be the conditional probability of observing the observation sequence Ot to T ¼ fotþ1 ; otþ2 ; . . . ; oT g after time t given that the state at time t is i. As with the forward probability, the bt ðiÞ can be solved inductively as

bT ðiÞ ¼ 1; bt ðiÞ ¼

N X

16i6N aij bj ðotþ1 Þbtþ1 ðjÞ;

ð13Þ t ¼ T  1; T  2; . . . ; 1; 1 6 i 6 N

j¼1

ð14Þ The probabilities ct ðiÞ and nt ði; jÞ are then solved by

a ðiÞb ðiÞ ct ðiÞ ¼ PN t t i¼1 at ðiÞbt ðiÞ at ðiÞaij bj ðotþ1 Þbtþ1 ðjÞ nt ði; jÞ ¼ PN PN i¼1 j¼1 at ðiÞaij bj ðotþ1 Þbtþ1 ðjÞ

ð15Þ ð16Þ

By analyzing the structure and major parameters in the HMM training algorithm as described in Eqs. (7)–(16), it is clear that the resulting hidden state sequence is strongly dependent on the distribution of observations bi ðot Þ, and observation sequence O ¼ fo1 ; o2 ; . . . ; oT g. Specifically, one may sequentially determine the effectiveness of parameters through following trace: 1. bi ðot Þ and O both affect at ðiÞ and bt ðiÞ as shown in Eqs. (11)– (14); 2. at ðiÞ and bt ðiÞ make up ct ðiÞ and nt ði; jÞ as shown in Eqs. (15) and (16); 3. ct ðiÞ and nt ði; jÞ determine pi ; aij , and bi ðot Þ according to Eqs. (7)– (10), and eventually generates hidden states estimation Q  (corresponding to the number of clusters).

4427

As a result, if both the observation density bi ðot Þ and observation sequence O ¼ fo1 ; o2 ; . . . ; oT g are well managed, the revealed hidden state sequence will be closer to the ideal situation (e.g., reveal object patches). Based on the theoretical inference shown above, this study therefore develops the ‘observation-sequence’ adjustment, and ‘observation-density’ adjustment methods so as to pursue the possibility of embedding 2D information into onedimension HMM and in further to improve the performance in object recognition from real-scene imagery. For incorporating a 2D image into an HMM, the most straightforward manner is sweeping the image line by line to fit the pixels into HMM. Fig. 1a shows an example of 4  4 image. The sweeping process visits each pixel through left-right, strip-like, direction as shown in Fig. 1b. It is observed that such kind of arrangement uses only one-dimensional spatial dependencies. Each pixel gains contextual effectiveness only from its preceding left-hand side pixel. According to the characteristics of HMM, such an arrangement is unable to model image well. To overcome such drawback, three types of ‘observation-sequence’ adjustments are proposed, namely, Hilbert–Peano sequence (Abend, Jarley, & Kanal, 1965) (Fig. 1c), ‘V’like (Fig. 1d) and ‘U’-like (Fig. 1e) sequences. These methods provide more flexibility than strip-like approach in involving 2D neighborhood information. The proposed observation-sequence adjustment methods are detailed in follows. A 4  4 visiting example as shown in Fig. 1c is the basic structure required for explaining Hilbert–Peano sequence. The Hilbert–Peano sequence adopts a fractal-like (i.e., self-similarity) concept to visit the pixels within an image. As the ‘V’-like observation sequence is concerned (Fig. 1d), it incorporates pixels into HMM through a repeated ‘V’ shape scan. For instance, according to the sample image as shown in Fig. 1a, this method arranges pixels in terms of ‘V’ shape sequence as setting o1 ¼ p1;1 ; o2 ¼ p2;1 ; o3 ¼ p1;2 ; o4 ¼ p2;2 ; . . . to form a hidden Markovian chain. Fig. 1d presents the illustration to this proposed pixel fitting method. After pixels arrangement within row 1 and 2 are completed, one then switches to row 3 and 4, and repeat the process until all the rows within an image have been visited. In the case of ‘U’-like sequence, the pixels fitting process follows the ‘U’ shape scan by setting observation o1 ¼ p1;1 ; o2 ¼ p2;1 ; o3 ¼ p2;2 ; o4 ¼ p1;2 ; o5 ¼ p1;3 ; . . . to form a hidden Markovian chain as shown in Fig. 1e. After pixels arrangement within row 1 and 2 are completed, one then switches to row 3 and 4, and repeat the process until all the rows within an image have been visited. In addition to the ‘observation-sequence’ adjustment, there is another parameter bi ðot Þ, i.e., density of the observations that can bring the effect of 2D context to the hidden states estimation. Normally, in constructing an HMM for image classification in terms of strip-like arrangement as being shown, the observation at each step of HMM is only based on the single pixel alone. The estimated state for a pixel of interest is thus highly correlated to the pixel value observed and the spatial direction on which we use HMM to model the contextual dependencies, respectively. Let the spatial direction applied to build up an HMM remains as strip-like in direction. If, to each original observation ot0 , one more observation ot1 are added to expand the observation vector into fot0 ; ot1 g, the estimated hidden state is likely to be affected dependent on how the new density being measured in terms of the observation fot0 ; ot1 g. Thus, two observation-density adjustment methods are proposed, namely, one side neighborhood (Fig. 1f) and two-side neighborhood systems, which involves the neighborhood in vertical direction, while fitting an HMM in horizontal direction. It is worthwhile to note that the pixels taken into concern are within the scope of the first- and second-order neighborhood systems as commonly seen in Markov random fields theory. According to the methods described above, these methods alternatively arrange the information relating to vertical and diagonal contextual

4428

B. Tso, J.L. Tseng / Expert Systems with Applications 37 (2010) 4425–4434

Fig. 1. (a) A 4 by 4 image; (b) strip-like sequencing; (c) Hilbert–Peano sequencing; (d) ‘V’-like sequencing; (e) ‘U’-like sequencing; (f) one-side observation-density adjustments and (g) two-side observation-density adjustments.

dependencies to build up an HMM. Although the HMM is still onedimension, the information contained inside is, in spirit, 2D in effect. 2.2. Decision tree C4.5 Decision tree C4.5 is proposed by Quinlan (1993) who used an approach called gain ratio for tree induction based on Shannon’s entropy (Shannon, 1948) used in information theory. Gain ratio is a normalization version to information gain. During the tree

induction process, for a node t, each attribute (i.e., spectral band in the case of remote sensing) will be calculated the corresponding information gain resulted from the split at node t according to that attribute. The attribute with the greatest information gain will then be chosen for splitting at that node. In order to compute information gain, the entropy IE ðtÞ for node t must be firstly calculated, which is expressed by

IE ðtÞ ¼ 

m X j¼1

f ðt; jÞlog2 f ðt; jÞ

ð17Þ

B. Tso, J.L. Tseng / Expert Systems with Applications 37 (2010) 4425–4434

where f ðt; jÞ is the proportion of samples belonging to class j, j 2 f1; 2; . . . ; mg, within node t. That is, if node t contains Nt samples, f ðt; jÞ can be calculated by the following expression

f ðt; jÞ ¼

Nt 1 X Cðyi ; jÞ Nt i¼1

ð18Þ

where

Cða; bÞ ¼



1; if a ¼ b 0; otherwise

ð19Þ

In other words, IE ðtÞ indicates the information conveyed by the distribution of samples within node t. Now, define attribute X; X ¼ fx1 ; x2 ; . . . ; xr g. For a value xi , suppose that there are totally ni samples, and f ðt Xðxi Þ ; jÞ the proportion of samples with value xi belonging to class j within node t. The entropy IE ðt Xðxi Þ Þ associated with xi for node t can be computed by

IE ðtXðxi Þ Þ ¼ 

m X

f ðt Xðxi Þ ; jÞlog2 f ðt Xðxi Þ ; jÞ

ð20Þ

j¼1

Finally, the information gain associated with a split on attribute X is calculated by

 Gainðt; XÞ ¼ IE ðtÞ 

     n1 n2 nr IE ðt Xðx1 Þ Þ  IE ðt Xðx2 Þ Þ     IE ðtXðxr Þ Þ Nt Nt Nt ð21Þ

4429

Eq. (21) denotes the difference between the information needed to describe node t without and with involving attribute X, i.e., the gain in information due to attribute X. However, according to Eq. (21), it can be found that the information gain tends to favour attributes that have multiple values with each value being of small number of samples. This drawback makes the tree induction process be more easily affected by noise. To compensate for this, an alternative index called gain ratio, which can be regarded as a normalization version to information gain is proposed (Quinlan, 1993). Specifically, gain ratio is defined as

Gain ratioðt; XÞ ¼

Gainðt; XÞ Split Iðt; XÞ

ð22Þ

where Split_I(t, X) is expressed as

Split Iðt; XÞ ¼ 

r X

f ðt; xi Þlog2 f ðt; xi Þ

ð23Þ

i¼1

which represents the information due to the split of attribute X in node t. The gain ratio is then used as the core of C4.5 to formulate the decision tree from fed samples. 3. Design of experiments 3.1. Image preprocessing Fig. 2a shows the procedures of image preprocessing implemented in this study. All the images stored in MrSMIR (multi-res-

Fig. 2. (a) Example of image subjected to wavelet transform; (b) tank’ transformed from RGB to LAB and then to wavelet.

4430

B. Tso, J.L. Tseng / Expert Systems with Applications 37 (2010) 4425–4434

olution semantic-based military imagery retrieval) system are standardized into 256  384 in size. These images are firstly translated from RGB to the LAB space in order to achieve perception consistency (Jain, 1989), where L encodes luminance and A and B encode color information. The L band image is then subjected to wavelet transform using Daubechies-2 wavelet due to its better localization property (Daubechies, 1992). The L band image after wavelet transform outputs HH, HL, LH, and LL images (each image band shrinking to 1/4 in size), where LL band can be regarded as the down-sampling version of original image, while HH, LH, and HL bands are highly correlated to image texture. The reason for incorporating texture is that, in real-scene imagery, various backgrounds may cause degradation effect to recognized objects. It is thus expected that by appropriately introducing the wavelet textures (i.e., HH, HL, LH) and chrominance information (A and B bands in LAB space) may provide assistance in separating object from

backgrounds. For both A and B band images, the purpose of using wavelet transform is simply to obtain the corresponding down-sizing LL image. Finally, one obtains L, A, B, HH, HL, and LH six band images of 64  96 in size. Fig. 2b shows an example of image ‘tank’ transformed from RGB to LAB and then to wavelet features. As a result, the images will eventually become three types of sizes as 64  96, 32  48, and 16  24, respectively, subjected to further process. 3.2. Model parameter setting and training Currently under constructing MrSMIR system contains around 3000 real-scene military images. The images are categorized into five higher-level classes, namely, ‘air-force’, ‘warship’, ‘submarine’, ‘tank’, and ‘jeep’. Fig. 3a displays the samples in L band of each defined category. As described earlier, totally six methods (as shown

Fig. 3. (a) The samples of defined categories for object recognition; (b) modeling results with HMM Q  ¼ 2 and Q  ¼ 3; (c) different resolutions may generate different outcomes.

Fig. 4. Totally three resolutions with different setting of HMM Q  .

B. Tso, J.L. Tseng / Expert Systems with Applications 37 (2010) 4425–4434

4431

Fig. 5. (a) Results of object recognition using strip-like; (b) Hibert–Peano scan; (c) ’V’-like; (d) ’U’-like; (e) one-side observation-density adjustment and (f) two-side observation adjustment.

4432

B. Tso, J.L. Tseng / Expert Systems with Applications 37 (2010) 4425–4434

in Fig. 1b–g) for fitting images to HMMs are evaluated. Fig. 3b shows the images modeled by HMM with state number Q  chosen as 2 and 3 and resulted different modeling results. Fig. 3c demonstrates the image with low (16  24) and higher (64  96) resolution can generate in different model outcomes. One should therefore recognize that, for real-scene object recognition, the major difficulty being encountered for modeling is the choices of model parameter ðQ  Þ and the image resolutions. In order to resolve such issues, the experiments is then designed to incorporate the multi-resolution images with different Q  numbers as shown in Fig. 4. Therefore, in MrSMIR system, each image will be modeled by six HMMs. The choice of the number of HMM hidden states Q  shown in Fig. 4 requires further notation. When implementing HMM for modeling image contents, the pixel values correspond to the HMM observations, and the hidden states then correspond to the patches (i.e., cluster numbers) to which the pixel belongs. There-

fore, one idea of selecting suitable number of states may correspond roughly to the number of patches being formed within imagery. Hence HMM with from 2 to 10 states would be generally appropriate in the real-scene military image set used depending on how complicated the object and background within an image are. Rabiner (1989) however also demonstrates that as the number of states greater than 3, the errors are somewhat insensitive to the number of states. In this study, the number of states Q  is then arbitrarily chosen as 3, 4, and 5 for HMMs modeling 64  96 images, and Q  ¼ 2 and 3 for 32  48 images, and Q  ¼ 2 for 16  24 images, in such a way that the computational loading will not increase too dramatically and the resulting modeled imagery would not be made of too many patches to deteriorate the identification of semantic object patterns. After all, the experiments are made in two parts. In the first part of the experiments, the primary purpose is to evaluate the

Table 1 The part of likelihoods used to train the decision tree. Omit

Class

No. T00014 T00110 T00110 T00061 S00153 T00142 J00061 J00061 T00196 A00202 J00294 W00007 T00083 T00083 S00153 A00033 T00008 A00202 T00008 T00008

Tank Jeep Tank Jeep Submarine Tank Jeep Submarine Tank Tank Jeep Tank Jeep Submarine Tank Airforce Air force Warship Warship Tank

Cont 16  24 Q2

32  48 Q2

32  48 Q3

64  96 Q3

64  96 Q3

64  96 Q5

AVG

21.303734 22.147882 27.734067 22.851869 22.851869 23.717353 23.683224 25.034242 24.593261 24.357606 24.341429 23.759876 23.4808 23.4808 22.969964 24.261788 24.939386 24.83013 26.808456 24.507292

20.923993 21.788623 23.484117 22.481332 22.481332 22.056237 23.249728 22.544637 23.681778 22.697213 23.147333 23.707872 24.543377 24.543377 23.656107 23.970394 21.867245 24.134513 25.939226 24.76852

20.26692391 22.11593004 24.97635081 22.35339584 22.95339584 23.09470882 23.98964956 26.08915642 23.96770606 23.45499733 23.63982808 23.75106289 25.62137383 25.62137383 26.27342217 24.41048611 21.93338021 24.31029241 25.53394709 26.61058886

19.748621 21.563163 22.474192 22.657007 22.657007 22.897531 25.633683 23.481198 24.6167 23.346546 22.936688 24.801537 24.141674 21.141674 23.738042 24.212778 23.009089 23.975482 25.873822 25.582449

19.603914 22.090191 22.828156 22.248221 22.248221 24.247638 23.262659 23.765545 24.460644 23.264265 24.03313 23.20309 24.00302 24.00302 24.725561 23.695017 23.350386 25.833375 24.634648 24.997307

19.773139 22.052547 22.124282 22.537338 22.537338 23.024724 23.254621 23.282036 23.362411 23.455388 23.484111 23.838488 23.895358 23.895358 24.076065 24.149112 24.166482 24.270833 24.419692 24.56478

19.60391 21.56316 22.12428 22.24822 22.24822 22.05624 23.24973 22.54464 23.36241 22.69721 22.93669 23.20309 23.4808 23.4808 22.96996 23.69502 21.86724 23.97548 24.41969 24.50729

Fig. 6. Part of decision tree and rules used for retrieving.

B. Tso, J.L. Tseng / Expert Systems with Applications 37 (2010) 4425–4434

performances of newly designed HMM scanning sequences. The contributions of variously combined input features (i.e., L, A, B, HH, HL, LH) are also analyzed. In this part of experiments, during the image query process, an image is pre-processed hierarchically and then fed into the system to match the images. The test image now becomes three resolutions is then input to the corresponding HMMs of each image in the system and the likelihoods are computed using the Baum–Welch algorithm (Baum & Egon, 1967). The average likelihood of each image in the system will be extracted and jointly sorted with all the images, and the top 20 images with highest average likelihood will be retrieved. Finally, the retrieval accuracy will be evaluated. In the second part of experiments, rather than just based on maximal average likelihood to determine the retrieval mechanism, a decision tree C4.5 is introduced to trace the rules from the likelihoods, so as to seek the possibility of well managing the likelihood information and in further to enhance the image retrieval accuracy.

4433

recognition in this study shows the worst performance than using other combination of spectral and texture bands. One reason for explaining such an outcome could be that the HH band is generally sensitive to the noise which may degrade the object recognition quality (Tso & Mather, 2009). Therefore, for recognizing objects in real-scene imagery, the HH band should be used more carefully and with more consideration. Through the accuracies shown in Fig. 5, one can find that in comparison with the strip-like method (Fig. 5a) obtaining the highest accuracy of only 54.6% with using LAB–HL bands as inputs, the proposed ‘observation-sequence’ and ‘observation-density’ adjustment methods, from both theoretical and practical aspects, all gain the progress in modeling 2D image and achieve the higher accuracy in identifying objects to a certain extent in comparison with strip-like method. Among these proposed methods, the ‘U’-like method generally out performs other approaches. However, as the ‘observation-density’ adjustment methods (Fig. 5e and f) are concerned, though shows certain progress in accuracy, the improvement is of less ideal.

4. Results and discussions 4.1. Image retrieval accuracy based on HMM

4.2. Image retrieval accuracy when appending C4.5

Totally six HMM modeling methods in which four are ‘observation-sequence’ adjustment and two are ‘observation-density’ adjustment are implemented and the analysis uses spectral L, A, and B bands as basis to incorporate each and the possible combinations of HL, LH, and HH bands to evaluate the corresponding robustness in object recognition. For all the HMMs, the original model parameters, namely, A, B, and p, are randomly assigned following the methodology shown in Rabiner (1989). The hidden state (i.e., cluster) sequence Q  is then estimated using dynamic programming (Baum & Egon, 1967; Baum & Petrie, 1966) so as to make PðQ  jO; kÞ maximized. The iterations of the model parameters estimations in this study all can converge within 10 runs. As introduced earlier, when the HMMs finish the training, a test image is then input to each HMM, and the likelihood is calculated. The images in the top 20 maximal average likelihood are retrieved. For each image category, totally 30 test images are tried and accuracy analysis are based the average retrieving accuracy of these test images. The resulting object recognition using HMMs based on the strip-like, Hilbert–Peano, ‘V’-like, ‘U’-like, and one-side and twoside ‘observation-density’ adjustment schemes are shown in Fig. 5. Note that, within Fig. 5, the black bars indicate the average accuracy generated by each combination of input bands. It is shown that using LAB spectral bands alone for recognizing objects does not perform well, with the accuracies obtained from all the HMM modeling methods are no more than 50%. However, when wavelet textures are appropriately included, the accuracy increase dramatically, normally more than 10% increase in accuracy is achieved. Such experimental results indicate that using spectral information alone is not sufficient for identifying objects in realscene imagery, and the wavelet texture information when combined with spectral information together can provide strong effect in improving objects recognition accuracy rates. As the combinations of input features are concerned, it can be found that the increasing in input bands does not always positively contribute to the object recognition accuracy. This can be found through the involvements of all input bands (i.e., L, A, B, HL, LH, and HH) in all the HMM modeling methods do not perform well in the experiments. According to Fig. 5, it is also discovered that using the combination of LAB and HL bands generally produces the highest recognition accuracy. Therefore, as both computational burden and accuracy are regarded, using LAB and HL as inputs can be an ideal choice in this study domain. One interesting discovery is that using the combination of LAB and HH bands for object

According to the results showing above, based on choosing maximal average likelihood alone to decide whether an image should be retrieved or not, it is clear that the retrieving accuracy is not ideal (all less than 70%) in effect. The further enhancement is thus required. In order to well analyze the likelihood information, decision tree C4.5 is implemented to trace out more complicated decision rules. Through the experiments, it is demonstrated that using the combination of input features of L, A, B, and HL (Fig. 5d) with ‘U’like HMM achieve better results among all the HMMs. This type of HMMs are then served as basis for calculate the likelihoods during the second part of experiments. For each category, 20 test images are randomly chosen and sequentially input to the system, and to each test image the top 20 images with maximal average likelihoods from multi-resolution and multi-HMMs are gathered and totally resulted in 400 data records. These data are then fed into decision tree to trace out the rules. Table 1 shows the part of records used in building decision trees. Fig. 6 shows part of decision tree used for identified the images from MrSMIR system. Our experimental results show that the decision trees formulated are generally not complicated, and therefore can easily trace out the rules to recognize the image of interest. Fig. 7 presents the final result of image retrieval using decision trees. It is found that the retrieval accuracy can be enhanced to as high as 78% and the rest of categories can also achieve more than

Fig. 7. Results of image retrieval accuracy using decision trees.

4434

B. Tso, J.L. Tseng / Expert Systems with Applications 37 (2010) 4425–4434

70% accuracy. The results evidentially demonstrate the usefulness of the proposed approach. 5. Concluding remarks The potentials of using the combination of HMM and decision trees in modeling and retrieving imagery have been demonstrated in this research. The experiments for using observation-manipulation based HMMs in identifying military vehicles are performed and the accuracies are shown. It is found that the proposed HMMs using ‘observation-sequence’ adjustment are superior approaches for modeling imagery. For identifying objects using proposed ‘U’like HMM fitting method can achieve up to 68.8% in accuracy. When decision trees are introduced, the study further demonstrates the usefulness of decision tree in managing multiple likelihood information so as to improve imagery retrieval accuracy in further. The future works will be focusing on developing algorithms inside MrSMIR system for automatically linguistic indexing and interactive image retrievals. Acknowledgements The computing facility is provided by Department of Information Management, Management College, NDU, Taiwan. The research is sponsored by National Science Council (NSC) Taiwan under Contract No. NSC-96-2221-E-606-028. References Abend, K., Jarley, T. J., & Kanal, L. N. (1965). Classification of binary random patterns. IEEE Transactions on Information Theory, 11, 538–544. Baum, L. E., & Egon, J. A. (1967). An inequality with applications to statistical estimation for probabilistic functions of a Markov process and to a model for ecology. Bulletin of the American Meteorological Society, 73, 360–363. Baum, L. E., & Petrie, T. (1966). Statistical inference for probabilistic functions of finite state Markov chains. Annals of Mathematical Statistics, 37, 1554–1563. Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41, 164–171.

Bertozzi, M., Broggi, A., Caraffi, C., Del Rose, M., Felisa, M., & Vezzoni, G. (2007). Pedestrian detection by means of far-infrared stereo vision. Computer Vision and Image Understanding, 106, 194–204. Chen, Y., Bi, J., & Wang, J. Z. (2006). MILES: Multiple-instance learning via embedded instance selection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(12), 1–17. Chen, Y., Wang, J. Z., & Krovatz, R. (2005). CLUE: Cluster based image retrieval by unsupervised learning. IEEE Transactions on Image Processing, 14(8), 1187–1201. Cole, R., Hirschman, L., Atlas, L., & Beckman, M. (1995). The challenge of spoken language systems: Research directions for the nineties. IEEE Transactions on Speech Audio Processing, 3, 1–21. Daubechies, I. (1992). Ten lectures on wavelets. Philadelphia: SIAM. Eickeler, S., Mller, S., & Rigoll, G. (2000). A recognition of jpeg compressed face images based on statistical methods. Image and Vision Computing, 18, 279– 287. Forsyth, D. A., & Ponce, J. (2002). Computer vision: A modern approach. Prentice Hall. Gevers, T., & Stokman, H. (2004). Robust histogram construction from color invariants for object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(1), 113–117. Jain, A. K. (1989). Fundamental of digital image processing. Englewood Cliffs: Prentice Hall. Li, J., Najmi, A., & Gray, R. M. (2000). Image classification by a two-dimensional hidden Markov Model. IEEE Transactions on Signal Processing, 48, 517–533. Natsev, A., Rastogi, R., & Shim, K. (1999). WALRUS: A similarity retrieval algorithm for image databases. SIGMOD Record, 28(2), 395–406. Pass, G., & Zabith, R. (1999). Comparing images using joint histograms. Multimedia Systems, 7, 234–240. Quinlan, J. R. (1993). C4.5: Algorithm for machine learning. San Mateo: Morgan Kaufmann. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of IEEE, 77, 257–286. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379–423. 623–656. Tamura, H., & Yokoya, N. (1984). Image database systems: A survey. Pattern Recognition, 17(1), 29–43. Tso, B. (2008) MrSMIR project. National Science Council, Taiwan, contract no. NSC 96-2221-E-606-028. Tso, B., & Chang, P. (2007). Mining free-structured information based on hidden Markov models. Expert Systems with Applications, 32(1), 97–102. Tso, B., & Mather, P. M. (2009). Classification methods for remotely sensed data (2nd ed.). London, UK: Taylor and Francis. Tso, B., & Olsen, R. C. (2005a). Combining spectral and spatial information into hidden Markov models for unsupervised image classification. International Journal of Remote Sensing, 26(10), 2113–2133. Tso, B., & Olsen, R. C. (2005b). A contextual classification scheme based on MRF model with improved parameter estimation and multiscale fuzzy line process. Remote Sensing of Environment, 97, 127–136.