Biological shape characterization for automatic image recognition and diagnosis of protozoan parasites of the genus Eimeria

Biological shape characterization for automatic image recognition and diagnosis of protozoan parasites of the genus Eimeria

Pattern Recognition 40 (2007) 1899 – 1910 www.elsevier.com/locate/pr Biological shape characterization for automatic image recognition and diagnosis ...

1MB Sizes 0 Downloads 18 Views

Pattern Recognition 40 (2007) 1899 – 1910 www.elsevier.com/locate/pr

Biological shape characterization for automatic image recognition and diagnosis of protozoan parasites of the genus Eimeria夡 César A.B. Castañón a,b , Jane S. Fraga a , Sandra Fernandez a , Arthur Gruber a,∗ , Luciano da F. Costa b,∗∗ a Instituto de Ciencias ˆ Biomédicas, Departmento de Parasitologia, Universidade de São Paulo, Av. Prof. Lineu Prestes 1374, São Paulo SP, 05508-000, Brazil b Instituto de Física de São Carlos, Universidade de São Paulo, Caixa Postal 369, São Carlos SP, 13560-970, Brazil

Received 21 July 2006; received in revised form 21 November 2006; accepted 6 December 2006

Abstract We describe an approach of automatic feature extraction for shape characterization of seven distinct species of Eimeria, a protozoan parasite of domestic fowl. We used digital images of oocysts, a round-shaped stage presenting inter-specific variability. Three groups of features were used: curvature characterization, size and symmetry, and internal structure quantification. Species discrimination was performed with a Bayesian classifier using Gaussian distribution. A database comprising 3891 micrographs was constructed and samples of each species were employed for the training process. The classifier presented an overall correct classification of 85.75%. Finally, we implemented a real-time diagnostic tool through a web interface, providing a remote diagnosis front-end. 䉷 2007 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. Keywords: Shape analysis; Feature extraction; Pattern classification; Image processing; Remote diagnosis; Real-time systems; Eimeria; Avian coccidiosis

1. Introduction An important goal in image analysis is to classify and recognize objects of interest in digital images. Objects can be characterized in several ways, e.g. by identifying their colors, textures, shapes, movements, and position within images. No ubiquitous approach is currently available to resolve pattern recognition problems for different domains of images. In order to construct a model for object characterization and posterior classification, one needs to do a previous analysis of the domain of images. Some applications of pattern recognition for biological problems, specifically for diagnosis purposes, have been reported in the literature. Comaniciu et al. [1] developed an image retrieval system to discriminate between malignant lymphomas



Supplementary material is available at http://puma.icb.usp.br/coccimorph/

∗ Corresponding author. Tel.: +55 11 30917274; fax: +55 11 30917417.

E-mail addresses: [email protected] (C.A.B. Castañón), [email protected] (J.S. Fraga), [email protected] (S. Fernandez), [email protected] (A. Gruber), [email protected] (L. da F. Costa). ∗∗Also for correspondence.

and chronic lymphocytic leukemia, using descriptors for textural and shape characterization. A similar work, developed by Sabino et al. [2] for Leukemia diagnosis, was based on textural identification through gray level co-occurrence matrices. Jalba et al. [3] proposed another interesting approach for automatic diatom identification, based on contour analysis by constructing a morphological curvature scale space for feature extraction. Other works adopted a Gaussian multivariate analysis for the identification of bacterial types [4], recognition of culture cells [5], and classification of chromosome images [6]. A particularly interesting application field for implementing image-based identification algorithms is parasite diagnosis. Parasites have been classically discriminated and identified through non-automated morphological analysis, among other methods. Since many parasitic organisms present developmental stages that have a well-defined and reasonably homogeneous morphology, they are amenable to pattern recognition techniques. Eimeria, a genus comprising pathogenic protozoan parasites, has been used in several image analysis studies [7–9]. A total of seven distinct Eimeria species can infect the domestic fowl, causing an economically relevant disease known as coccidiosis [10]. Because different species vary in

0031-3203/$30.00 䉷 2007 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2006.12.006

1900

C.A.B. Castañón et al. / Pattern Recognition 40 (2007) 1899 – 1910

Fig. 1. Photomicrographs of oocysts of the seven Eimeria species of domestic fowl. Samples: (a) E. maxima, (b) E. brunetti, (c) E. tenella, (d) E. necatrix, (e) E. praecox, (f) E. acervulina, and (g) E. mitis.

pathogenicity and virulence, their precise discrimination is important for epidemiological studies and disease control measures. Parasite oocysts, a round-shaped developmental stage, are shed in profuse amounts in the feces of infected chicks. Oocysts of distinct species present differences of size (area, diameter), contour (elliptic, ovoid, circular), internal structure, thickness and color of the oocyst wall, among other morphological variations (Fig. 1). However, the correct species discrimination by human visual inspection is severely restricted by the slight morphological differences that exist among the distinct species and the overlap of characteristics. Considering the limitations imposed by morphology-based diagnosis, different molecular approaches have been devised for species discrimination, such as a PCR-based diagnostic assay using the ribosomal ITS1 as a target [11,12]. Our group has also developed molecular diagnostic tools for Eimeria spp., including a multiplex PCR assay for the simultaneous diagnosis of the seven species that infect the domestic fowl [13]. These molecular diagnosis assays are very sensitive and specific, but require highly trained personnel and sophisticated infrastructure. Previous works have reported the differentiation of Eimeria [7–9] and helminths [14] using digital image recognition. Kucera and Reznicky [7] reported the species differentiation of Eimeria spp. of domestic fowl using only two features, length and width of oocysts, which were computed in a semiautomatic fashion. Such a limited number of characters, however, restricted the ability to differentiate all seven species due to the similar morphology and overlap among the distinct species. Sommer [15,16], working with cattle Eimeria, used a more complex approach, where the parametric contour was considered as input to compute the amplitude of the Fourier transform. Nevertheless, the classification method (average linkage clustering) does not consider the distribution of elements and is not particularly suitable for real-time systems. Yang et al. [17] developed an automatic system for human helminth egg detection and classification using artificial neural networks (ANNs).

The authors followed the work developed by Sommer [15], where the parametric contour of the object was used to compute the amplitude of the Fourier transform. Cross-validation results showed correct classification rates of 86.1–90.3%, but the small number of samples utilized severely restricted an estimation of the confidence level of the approach. Another work using ANNs for object detection was described by Widmer et al. [18] for the identification of Cryptosporidium parvum. The authors differentiated parasite oocysts from sample debris with success, but no species differentiation was conducted. The small number of features utilized in these previous works can be explained by the difficulty of quantifying morphological features. This limitation, together with the high complexity of the algorithms, makes the development of real-time systems for automatic diagnosis a challenging task. In addition, the set of features to be used is strongly dependent on the characteristics of the image domain. In this regard, our group has reported several techniques for shape characterization. Thus, Bruno et al. [19] used multiscale features for the characterization of cat ganglion neural cells, whereas Coelho et al. [20] proposed another set of features (diameter, eccentricity, fractal dimension, influence histogram, influence area, convex hull area and convex hull diameter) for the same problem. Costa et al. [21] used digital curvature as a feature for morphological characterization and classification of landmark shapes. In this paper we present an approach to extract morphological information by using different computer vision techniques in order to perform an automatic species differentiation of Eimeria spp. oocysts. We report the development of a shape representation approach that considers three types of morphological characteristics: (a) multiscale curvature, (b) geometry, and (c) texture. All these features are automatically extracted constituting a 13D (13-dimensional) future vector for each oocyst image. While the considered measurements and adopted classification methods used throughout this work are not necessarily

C.A.B. Castañón et al. / Pattern Recognition 40 (2007) 1899 – 1910

novel, their combined application resulted in an operational framework that was extensively validated for Eimeria classification. To our knowledge, this is the most complete example of a system that implements pattern recognition methods for parasite diagnosis and fully integrates them with a user friendly web interface. We believe that this work may represent a new paradigm for parasitological diagnosis. The article is organized as follows. In Section 2, an overall view of the system is presented. In Section 3 we discuss the different techniques and the methodology for shape characterization, while the process and techniques for species classification and image similarity are presented in Section 4. Section 5 reports the results of species differentiation and the development of a real-time automatic species discrimination system. Sections 6 and 7 present a discussion and conclusions of this work, respectively. 2. An overview of the diagnosis system based in automatic shape characterization Fig. 1 presents oocyst photomicrographs of the seven Eimeria species of domestic fowl. As can be seen, the different species vary in terms of size, shape, and internal structure. In order to identify the species that corresponds to a specific oocyst image, we initially developed a mathematical model to characterize oocyst morphology, and then applied it for species discrimination. The oocyst analysis and recognition process reported here include three components: (a) image preprocessing, (b) feature extraction, and (c) pattern recognition. The image pre-processing stage defines the boundary of the object to be processed. This boundary is determined by the parametric contour of the oocyst, which is a bi-dimensional vector (x, y) representing the localization of each pixel in the contour. The feature extraction step uses the parametric contour of the oocysts as an input vector. We used in total 13 features describing curvature, geometrical measures and texture. These characteristics constitute the feature vector of the oocyst image, which in turn is stored in the feature database and provides the input data for the pattern recognition stage. The last analysis stage comprises the pattern recognition or pattern classification. For this task, the classifier is submitted to a training process using known observations, a training set, and the necessary statistics. A class of patterns is typically represented as a probability density function (pdf) of features. In this case, a simple model like a single Gaussian distribution is used to represent each of the patterns. This Gaussian function provides the basis for the multidimensional Bayesian classifier, whose decision regions provides the basis for the species identification. 3. Shape characterization Images can be mathematically understood as sets of connected points in a bidimensional space F, that can be approximated in a discrete binary image space. Image classification, performed directly on F, is a hard task that may require O(N 2 ) comparisons, assuming that each image has N pixels.

1901

Table 1 Geographic origin of the Eimeria strains and species used in this study and the respective number of image samples Name

Origin

Samples

E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E.

Houghton, England São Paulo, Brazil Santa Catarina, Brazil São Paulo, Brazil Houghton, England São Paulo, Brazil São Paulo, Brazil Czech Republic São Paulo, Brazil São Paulo, Brazil São Paulo, Brazil São Paulo, Brazil Houghton, England São Paulo, Brazil USA Houghton, England Czech Republic São Paulo, Brazil

374 114 148 418 103 91 127 335 199 223 259 145 377 180 190 311 137 160

acervulina H acervulina 103 acervulina R7 brunetti C maxima H maxima L maxima 50 mitis RT mitis 30 mitis 44 necatrix DF necatrix 103 praecox H praecox 1D1A praecox D tenella H tenella CR tenella MC

The representation of an image can be modified by applying suitable image transformation (IT) mapping from F to a new, and typically smaller, feature space F  . This means that most of the classification-related information is “squeezed” in a relatively small number of particularly informative features, leading to a reduction of the necessary feature space dimension. The basic reasoning behind transform-based features is that some chosen “sets of filters” [22] can exploit and remove information redundancies that normally occur in natural images [23]. Shape can be represented either by its contour or by its region [24,25]. Global contour-based descriptors can be computed from the shape boundary. In the present work, three sets of features are used: (a) shape analysis tools based on the multiscale Fourier transform-based approach to curvature estimation, (b) geometrical measurements, and (c) features for texture characterization. 3.1. Biological samples Parasite samples of each one of the seven Eimeria species that infect the domestic fowl were used throughout this work. In addition, whenever available, we used multiple strains of each species, collected from different geographic sources (Table 1). The parasites were propagated in three-week old chicks and oocysts were isolated following standard protocols [10]. 3.2. Image pre-processing We used oocyst micrographs as the starting point for the automatic analysis. The pictures were obtained with an optical microscope (Nikon Eclipse E800) coupled to a 4-megapixel CCD camera (Nikon Coolpix 4500). The images were captured with a 40× magnification objective and saved as 24-bit JPEG (fine quality option) files. Using these conditions, all pictures presented a spatial resolution of 11.1pixels/m. Depending on the

1902

C.A.B. Castañón et al. / Pattern Recognition 40 (2007) 1899 – 1910

Fig. 2. Different stages of oocyst image pre-processing. An original color image is firstly converted into (a) a gray scale image. After segmentation, the resulting (b) binarized image is used for (c) contour detection.

sample concentration and purity, a single micrograph could provide many oocysts for further processing. Low quality oocyst images were not considered for downstream analysis. Common practical problems included out of focus images, oocysts not adequately well positioned, and atypical oocyst morphologies caused by accidental cracking or squeezing. Another problematic aspect we observed is related to the presence of debris and bacteria in dirty samples, thus complicating object segmentation. The process of oocyst image isolation was carried out manually by using an image processing program (Gimp or Adobe Photoshop). The objects of interest, single oocysts, were cropped out of the picture and used to create new image files that were in turn used as input data to our system. A total of 3891 oocyst images constituted the data set of the present work. Image quality can be substantially heterogeneous due to differences of illumination, contrast, focus, acquisition resolution, thus hampering object detection. To reduce the effect of illumination variations, we equalized the images through the histogram specification method [26], considering as eigenimage a prototype computed previously for each species from the training set. For object segmentation, we applied a thresholding approach [26] with a cut-off value manually determined for each image. As a result, binary images were produced with the respective object being defined by black pixels on a background of white pixels. The steps of converting an original color oocyst image into a parametric contour are depicted in Fig. 2. The binarized images (see Fig. 2b) are submitted to an algorithm that extracts the external contour of the object. This is done by selecting an initial point belonging to the contour of the object. The algorithm involves successive detections of the next contour pixel by using chain-code directions. The result is a parametric representation, where every point in the contour is identified by coordinates x(t) and y(t) [25]. 3.3. Curvature based on multiscale Fourier transformation The curvature of an object is an important characteristic that can be extracted from the respective contour. The pioneer work of Attneave [27] emphasized the importance that transient events and asymmetries have in human visual perception, thus influencing the subsequent research on shape in computer

vision. Riggs [28], for instance, postulated that curvature detectors would be present at the neuronal level in humans. Due to its biological motivation, curvature analysis has gained attention from the pattern recognition community, and many methods have been proposed to compute it [29]. Our approach takes advantage of the closed parametric contour that is represented by the x(t) and y(t) signals, which are used for curvature estimation using the Fourier derivative property [30]. Let the parametric representation of the contour be c(t) = (x(t), y(t))

(1)

the curvature k(t) of c(t) is defined as k(t) =

x(t) ˙ y(t) ¨ − x(t) ¨ y(t) ˙ (x(t) ˙ 2 + y(t) ˙ 2 )3/2

,

(2)

where x˙ and y˙ are the first derivatives, x¨ and y¨ are the second derivatives, of the signals x(t) and y(t), respectively. Those values can be easily computed using the Fourier derivative property [25]. Using an arc length parameterization, and convolving the original contour signal (t) with derivatives of Gaussian function, with varying standard deviation a, then derived from Eq. (2), the multiscale curvature is defined as described by Mokhtarian et al. [29]: k(t, a) = x(t, ˙ a)y(t, ¨ a) − x(t, ¨ a)y(t, ˙ a).

(3)

The multiscale approach to curvature estimation leads to the so-called curvegram, where the curvature values appear as a scale-space representation. Fig. 3 shows the contour of an oocyst (panel a) and its corresponding curvegram (panel b). Gaussian smoothing is essential for controlling curvature instabilities caused by noise along the contour (t), which would otherwise produce many peaks of variable height. The smoothing level is determined by the standard deviation a of the Gaussian function. A small a value (Fig. 3b, a = 10) results in a noisy curvature, whereas a higher value yields a smoother curvature (Fig. 3c, a = 50). This effect can be better observed in a 3D curvegram that includes different scale values (Fig. 3d). While the curvature itself can be used as a feature vector, this approach presents some serious drawbacks, including the fact that the curvature signal can be too large (involving thousands

C.A.B. Castañón et al. / Pattern Recognition 40 (2007) 1899 – 1910

1903

Fig. 3. An oocyst contour (a) and the corresponding curvegrams using gamma values of 10 (b) and 50 (c), or a range of standard deviation values for Gaussian function, displayed in a 3D curvegram (d).

of points, depending on the contour) and highly redundant. Once the curvature has been estimated, the following shape measures [25] can be calculated in order to circumvent these problems: sampled curvature, curvature statistics (mean, median, variance, standard deviation, entropy higher moments, etc.), maxima, minima, inflection points, and bending energy.

degree, the same process is applied with respect to the minor axis [25]. Some additional measurements related to symmetry have also been described in the literature [32–36]. In the present work, we considered the diameters (major and minor axis) and symmetry of the oocysts. Simple global descriptors included area (number of pixels into region), eccentricity (length of major axis/length of minor axis), circularity (perimeter2 /area), and bending energy [37].

3.4. Geometrical measurements Some oocyst species present distinctive characterization based only in shape and size, making necessary to find out additional features to characterize them. For instance, principal component analysis [25] was applied in order to find the main directional vectors (eigenvectors), and used to define some measurements such as diameters and symmetry. We used the bilateral symmetry, that is considered a primary case from a geometric concept of symmetry [31]. Considering a binary image, the shape is reflected with respect to the orientations being defined by its major axis to find a bilateral symmetry

3.5. Texture characterization based on co-occurrence matrices The several methods for texture analysis have been classified by Tuceryan and Jain [38] into four categories: statistical, geometrical, model based and signal processing based. A powerful, frequently used method, involves the so-called co-occurrence matrices [39]. This method provides a second-order approach for generating texture features. Although mainly applied to texture discrimination of images, co-occurrence matrices have also been used for region segmentation [40].

1904

C.A.B. Castañón et al. / Pattern Recognition 40 (2007) 1899 – 1910

The co-occurrence matrices take into account information about the relative positions of the various gray levels within the image. There are two parameters used for computing co-occurrence matrices: (a) the relative distance among the pixels and (b) their relative orientation. They involve the conditional joint probabilities, Cij , of all pairwise combination of gray levels given the inter-pixel displacement vector (x , y ), which represents the separation of the pixel pairs in the x- and y-directions, respectively. Traditionally, the probabilities are stored in a gray level co-occurrence matrix (GLCM) [39,41]. This resultant matrix is a second-order histogram from which some information can be extracted [42]: angular second moment, contrast, inverse difference moment and entropy. 4. Pattern classification 4.1. Bayesian classifier Classification is always performed with respect to some properties (or features) of the objects. Indeed, the fact that objects share the same property defines an equivalence relation in terms of partitioning of the object space. In this sense, a sensible classification operates in such a way as to group together into the same “class” entities that share some properties, while distinct classes are assigned to entities with distinct properties. We use the term “classifier” for each statistical tool, trained using a specific data set, to discriminate distinct “classes”. A Bayesian classifier [43] utilizes a probabilistic approach for classification. It can be used to compute the probability that an example x belongs to class i . The computer implementation is facilitated using the multivariate normal density function, which is entirely defined by two parameters: the mean i and covariance matrix i . Although the Bayesian decision rule is not a discriminant function, it defines regions that can be expressed in terms of discriminant functions gi (x). To classify a new element x into one of the i classes, we take the highest value of the gi functions as the corresponding true class. 4.2. Algorithm for the partition process and classifier Aiming at obtaining a robust and reliable classifier, we developed an algorithm (Algorithm 1) to select the best combination of features, evaluate the most adequate size of the training set, and evaluate the classification accuracy. For each class, the corresponding data set was randomly divided into two groups, the training and the test sets. Different proportions of these sets were tested using intervals defined by integers (e.g. from 10:90 to 90:10). In addition, for each training:test proportion we generated a user-defined number of randomly selected paired sets, which were evaluated independently to reduce possible sampling biases. Each set was then evaluated with respect to its ability to correctly classify. The average of the classification scores, obtained for each of these paired sets, was considered as the final score of correct classification for that particular proportion of training:test sets. This approach was recursively applied to the different training:test set percentages.

Finally, the classification matrix was calculated as the average of all confusion matrices resulting of each training:test partition. Algorithm 1. CLASSIFICATION() Require: DataSet; Require: N c ← # of classes; Require: Nf ← # of features; Require: %training ← % of training set; Require: %test ← % of test set; Require: NrandomPartitions ← # of random sets; Require: LC ← # of learning cycles; Ensure: MclassMean[ ][ ] 1: set MclassAux[ ][ ] with zeros; 2: for i = 1 to NrandomPartitions do 3: [TrainingSet, TestSet]=PARTITION(Dataset, %training, %test, N c); 4: Mclass = BAYESIANCLASSIFIER(TrainingSet, TestSet, N c, Nf , LC); 5: MclassAux = MclassAux + Mclass; 6: end for 7: MclassMean = MclassAux/NrandomPartitions; 8: return MclassMean; The procedure requires a DataSet with a defined number of classes (Nc) and a number of features (Nf). The partition is defined by the %training : %test proportion, and the number of times that the random process of partition will occur is determined by the NrandomPartitions parameter. Additionally, a LC parameter defines the number of learning cycles of the classifier. The resultant matrix is MclassMean. For a better understanding of the algorithm, the partition process and the classifier are represented as separate implementations. The PARTITION function is responsible for the random process of partition of the DataSet, using the following parameters as input: the data set, the training:test proportion, and the number of classes. The function thus returns the respective training and test sets. The BAYESIANCLASSIFIER function is the core process that implements the classifier. The classifier is trained with the TrainingSet and evaluated with the TestSet. Both tasks also require as input the number of classes, features, and learning cycles. The function then returns a classification confusion matrix Mclass. Finally, MclassMean is the resultant confusion matrix, calculated as the average of all Mclass confusion matrices, computed for each of the distinct random partitions. 4.3. Image similarity Following class assignment of the x vector through a Bayesian classifier, the next step is to know the level of similarity between the query image and the assigned species. In this sense, the prototype element of the class is the mean  of the normal density. Considering a training set composed by samples x1 , . . . , xn , the prototype of this set is the average of the samples. Thus, we adopted this prototype as the most representative element for each class. The Mahalanobis distance is used as a similarity metric between the element x classified in class i and its

C.A.B. Castañón et al. / Pattern Recognition 40 (2007) 1899 – 1910

prototype i . This distance is adequate for multivariate normal data, that tends to cluster around the mean vector , falling in an ellipsoidally shaped cloud whose principal axes are the eigenvectors of the covariance matrix . Thus, the natural measure of the distance from x to the mean  is provided by the quantity r 2 = (x − )t −1 (x − ).

5.2. Evaluation of the size of the training set

5.1. Feature space and selection The feature space is defined by features divided into three groups: curvature, geometry, and texture. In the present work, the feature vectors are 13D. Table 2 displays the 13 morphological features utilized for shape characterization. Feature selection is a NP-hard problem [44]. A possible method that guarantees an optimal solution is the exhaustive search, where all combinations of features subsets are tested. For each combination, we used the separability criteria (Bayesian classifier, described in Section 4.1) and selected the best feature vector combination [42]. These methods, also known as sequential methods, are the mainstream approach for performing feature selection and are guaranteed to find the optimal subset [45]. Using a sequential forward selection (SFS) [46] test for each number and combination of features, we generated subsets that were subsequently processed by the classification process (Algorithm 1). To determine the overall rate of correct classification, each subset was then divided randomly into training (30%) and test (70%) sets. This process was repeated 100 times for each feature combination. The best combination of features, yielding the highest value of correct classification, was determined and selected for each number of utilized features, varying from two to 13 combined features. Table 3 shows the results obtained for the different Table 2 Feature space for morphological characteristics of Eimeria spp. of domestic fowl ID

Feature name

Curvature

1 2 3

Mean of curvature Standard deviation of curvature Entropy of curvature

Geometry

4 5 6 7 8 9

Major axis Minor axis Symmetry through major axis Symmetry through minor axis Area Entropy of oocyst content

Texture

numbers of utilized features. Thus, the best combination of two features (4 and 5) yielded a correct classification of 77.25%. The highest correct classification value (85.90%) overall was obtained with a combination of 12 features. Since the correct classification rates observed by using 10–13 features varied within the range of one standard deviation (data not shown), we decided to employ the 13 features in all subsequent analyses.

(4)

5. Results

Type

1905

10 11 12 13

Angular second moment Contrast Inverse difference moment Entropy

The 13D space is divided into three types of features: curvature, geometry, and texture.

In order to estimate the minimum number of samples required for the training set, still able to yield an acceptable rate of correct classification, we conducted a series of experiments. Considering that the number of samples for each species was not the same, we randomly extracted 320 elements from each class. In this regard, a total of 2240 oocyst samples of the seven Eimeria species were used. For each species, the corresponding data set was randomly divided into two groups, the training set and the test set, in relative proportions varying from 95%:5% to 5%:95%, respectively, using intervals defined by integers. In addition, for each proportion, the number of random partitions was 100. The average of the diagonal of the resultant confusion matrix, obtained for each of these 100 paired sets, was considered as the final score of correct classification for that particular proportion of training:test sets. This approach was repeated to the different training:test set percentages using Algorithm 1. As can be seen in Fig. 4, there is a clear correlation between the size of the training set and the overall accuracy of the classification. For a data set size of 2240 images, a good compromise of training set size and accuracy was attained with circa 30% of the images. Considering that the data set is constituted by 2240 samples from the seven distinct Eimeria species, we conclude that a minimum acceptable size for the training set would be 96 images for each species, comprising a total of 672 samples (30% of the data set). In fact, using distinct smaller data sets, we confirmed that this absolute number of oocyst images per species was adequate for training purposes (data not shown). 5.3. Analysis of species differentiation Species differentiation experiments were performed with a data set of 3891 oocyst images, comprising multiple strains of the different Eimeria species that infect the domestic fowl. The complete list of the strains and species utilized in this work is presented in Table 1. From the overall data set, we used 30% of the images for the training set, and 70% for the test set. A total of 100 paired sets were randomly generated and each one was used as an input for the classification process (see Algorithm 1), which in turn generated a confusion matrix as a result of species discrimination. Therefore, at the end of the recursive process, we generated 100 confusion matrices which were used to compute the average confusion matrix. This latter matrix contained the mean of correct classification for all tested species. Finally, by computing the diagonal average, we obtained the overall percentage of correct classification of the system.

1906

C.A.B. Castañón et al. / Pattern Recognition 40 (2007) 1899 – 1910

Table 3 Feature selection using the SFS test # of features

Curvature 1

2 3 4 5 6 7 8 9 10 11 12 13

× × ×

2

×

Geometry 3

× × × × × ×

4

5

×

× × × × × × × × × × × ×

× × × × ×

Rate (%)

Texture 6

7

8

× ×

× × × × × × × × × × ×

× × × × × × × × × × ×

9

× × × × × × × × ×

10

11

12

×

× × × ×

× × × × × × × ×

× × × × × × ×

13

× × × × × × × × ×

77.25 79.90 81.02 82.45 83.89 85.04 85.64 85.63 85.75 85.73 85.90 85.75

The best combination of features and the resulting correct species classification values are presented.

5.4. A real-time diagnosis system As a proof-of-principle that our approach could be applied for the automatic morphological discrimination of Eimeria species, we developed COCCIMORPH, a realtime system accessible through a web interface (available at http://puma.icb.usp.br/coccimorph). COCCIMORPH allows the user to upload an image, detect the contour interactively and obtain a real-time classification. Fig. 5 shows the framework of this system, which is divided into the following three levels:

Fig. 4. Effect of the size of the training set on the classification accuracy. A total of 2240 images were used for the evaluation. The size of the training set is represented as percentages relative to the whole data set. The absolute number of images is also presented (in parentheses).

The overall percentage of correct species assignment observed was 85.75%. Table 4 presents the final confusion matrix, where we can clearly see that the best classification was obtained for E. maxima (99.21%). Conversely, E. praecox and E. necatrix presented the worst results, with 74.23% and 74.90% of correct discrimination rates, respectively. These results were due to a cross-classification with other Eimeria species. Thus, E. necatrix was incorrectly classified as E. acervulina (6.10%) and E. tenella (9.94%). Similarly, some other Eimeria species were also incorrectly classified as E. necatrix (E. acervulina in 12.53%, E. praecox in 10.94% and E. tenella in 12.22%). These results show that E. necatrix and E. praecox are certainly the most difficult species to be differentiated due to the morphological similarity among themselves and to other species. This is in agreement with what is classically reported by personnel involved with visual inspection and classification of Eimeria field samples.

• Database: This level stores the feature vectors that compose the data set. Micrographs and isolated images are also stored and can be visualized through a web interface. • Application: This is the developmental level of the system, which is divided into three modules: import subsystem, analysis subsystem and application and web server. • Client: This level is oriented to interact with the end-user, allowing for the visualization and uploading of images for diagnostic purposes. The analysis subsystem represents the kernel of the system and is responsible for the image pre-processing, feature extraction and pattern classification. This module was entirely developed in C + +, resulting in a rapid response of the system during the image processing step, thus permitting a real-time processing through the web. Considering that different users have distinct setups of microscopes and digital cameras, the magnification and resolution of the captured images can vary significantly from those used in this work. In order to normalize the image scale, the user must first determine the number of pixels/m of the captured image. This can be simply done using a calibrated microscope scale, such as those imprinted on specialized measuring slides. Alternatively, hemocytometer counting chambers, commonly used in many laboratories, can also be employed. Once a picture of the scale is obtained, the custom spatial resolution, expressed as the number of pixels/m, can be easily determined using any

C.A.B. Castañón et al. / Pattern Recognition 40 (2007) 1899 – 1910

1907

Table 4 Confusion matrix of species differentiation of Eimeria spp. of domestic fowl Species

E. E. E. E. E. E. E.

acervulina maxima brunetti mitis praecox tenella necatrix

Oocyst number

636 321 418 757 747 608 404

Ascribed species E. ace

E. max

E. bru

E. mit

E. pra

E. ten

E. nec

83.83 0.00 0.00 0.99 0.19 0.65 6.10

0.01 99.21 0.31 0.00 0.00 0.00 0.00

0.00 0.79 95.04 0.00 2.97 1.98 0.53

1.26 0.00 0.00 92.51 6.08 0.41 3.98

0.29 0.00 0.91 2.52 74.23 4.24 4.55

2.07 0.00 3.19 0.24 5.59 80.51 9.94

12.53 0.00 0.56 3.75 10.94 12.22 74.90

Fig. 5. Framework of the real-time system for automatic diagnosis of Eimeria species.

image processing program (e.g. Gimp, Adobe Photoshop, etc.). Provided that the user obtains all other subsequent images under the same conditions, this step must be performed only once. COCCIMORPH’s interface presents a “pixel/micrometer” fill in the blank box where the user can enter custom values of resolution. The system will then automatically normalize the resolution in regard to the images of the database. 5.5. The Eimeria image database A particularly helpful support for this work has been the ample availability of biological samples. Thus, we constructed a comprehensive database of oocyst micrographs, including parasite strains isolated from different regions of the world. This repository was made publicly available as the “Eimeria Image Database” through a link on the COCCIMORPH’s site.

6. Discussion In this paper we report the development of an effective pattern recognition approach for shape characterization and automatic discrimination of different species of the protozoan parasite Eimeria spp. We propose the use of a set of features comprising three categories: (a) curvature, (b) geometry, and (c) texture. These features are extracted automatically and used to compose a 13D feature vector. The system was developed and standardized using microscopic images taken from pure samples of each one of the parasite species. A large number of images, comprising in total 3891 oocyst micrographs, was used to reduce the effect of shape heterogeneity. In addition, whenever available, we used several samples of each species, collected from different geographic sources, in order to dilute possible intra-specific variations and maximize inter-specific

1908

C.A.B. Castañón et al. / Pattern Recognition 40 (2007) 1899 – 1910

discrimination. Other sources of data variability were also assessed, including differences on microscope illumination and contrast, as well as the volume of the parasite suspension between the slide and coverslip. Finally, we used a relatively high number of features, that were submitted to a feature selection process to evaluate how many and which of them would compose the most discriminative set. The approach described here is simple and permits a reliable identification of the parasite species. Features are not limited to the simplest and most traditional geometric measures, as we also computed curvature to represent the form, and texture for internal structure characterization. Considering that this diagnostic system is based on morphology, the correct species assignment rate obtained (85.75%) can be considered a very good result, especially if compared to a subjective human diagnosis. Furthermore, given the complexity of the algorithms for feature extraction, the current implementation is computationally efficient, permitting a rapid and real-time interaction of the end-user through a web interface. Finally, because the system uses generic algorithms, it can be easily extended to discriminate other organisms. For this task, the user just needs to provide a new image data set and use it to train the system to discriminate the different classes. In fact, a preliminary study, including 11 Eimeria species that infect the domestic rabbit, showed a similar discriminative performance (data not shown). Previous studies using digital image processing applied to Eimeria [7–9] have been reported in the literature. These systems, however, were restricted to a semiautomatic oocyst diameter measurement and still required a strong human interaction during processing. In addition, most studies employed a small number of morphological characters. Thus, some works used as features the oocyst diameters [7,9], whereas others used the Fourier transform of the contour [15] or computed statistics from it [17]. Another general limitation was related to the classification method, where multidimensional data distribution has not been considered. Sommer [15] used Euclidean distance as a metric for clusterization. This metric assumes that the data is homogeneously distributed, which is not necessarily the case, especially when multidimensional data is used. Yang et al. [17], working with human helminth eggs, used four morphometric features and two stages of ANNs. These ANNs were used for the identification of eggs from artifacts, and for species discrimination, respectively. However, the estimation of the average correct classification ratio was based on a very small image data set, and the possible influence of intra-specific variability was not assessed by the authors. We also preliminarily considered alternative classification methodologies, such as SVM [47,48]. More specifically, we compared the performance of Bayesian classifier and SVM considering situations involving seven Eimeria categories and 13 features. Because the obtained results did not indicate superior performance of the SVM methodology (actually, slightly better results were achieved for the Bayesian classifier), we decided to adopt the Bayesian methodology. An additional reason motivating such a choice is the fact that the Bayesian classifier

is considerably simpler for on-line and interactive implementations of the system. Several possible applications of our system can be foreseen in a near future. Initially, the large image data set of Eimeria oocysts was made publicly available as the Eimeria Image Database. The database also includes now circa 2500 images of 11 Eimeria species that infect the domestic rabbit. Since this database can be added with new parasite images in the future, it may represent an invaluable resource for classical parasitologists and also for teaching purposes. From the computational standpoint, it represents a novel repository of parasite image data, useful for experimental protocols involving pattern recognition methods. As such, new algorithms could be tested using this data set as a golden standard of validated biological samples. In addition to the image database, the precise morphometric data of the different Eimeria species provides a unique opportunity to revisit the classic size estimations [10]. As such, we intend to provide new parasite identification charts where morphometric data will be presented in the light of the current modern microscope optics and digital image technology. This kind of data will certainly be of a high value to the Eimeria scientific community, as well as to researchers in pattern recognition, which may use such repository to test new measurement and classification methodologies. An envisaged application of the shape characterization methodology described here is the implementation of a realtime diagnostic tool through a web interface. In this direction, we have created an experimental front-end for public access. Since diagnosis is performed in real-time, there are almost no delay between the sample querying and the final diagnostic result. We foresee that such system would allow for a reliable diagnosis with no need of biological sample transportation between the farms and the reference laboratory. This represents a particularly important achievement, since live sample traffic may represent a sanitary risk due to the potentiality of disease dissemination. Also, compared to other diagnostic approaches, our system does not require trained personnel on parasite identification or molecular biology techniques. The incorporation of other parasites to the system may even increase the scope of applicability of this electronic diagnostic tool. Coccidian protozoa and helminth eggs, by presenting a morphology similar to Eimeria oocysts, are the obvious candidates to be included in a near future. With the current decreasing prices of high resolution (above four megapixels) digital cameras, our system is relatively cheap. In fact, any reasonable microscope with a digital photo documentation system (a camera and an adapter tube) would represent the minimum apparatus for such methodology. Another aspect where shape characterization may have an interesting impact is on phylogenetic analysis. Classic phylogenetics used to rely on morphometric data, but since DNA sequencing became a mainstream and relatively cheap technique, most current inferences are now based on molecular data. Because our morphological features have a quantitative representation, they can be discretized and converted into data matrices amenable to phylogenetic methods. Phylogenetic inference

C.A.B. Castañón et al. / Pattern Recognition 40 (2007) 1899 – 1910

of the genus Eimeria has been reported using the ribosomal 18S sequence [49]. Our group has recently characterized the complete mitochondrial genome of the seven chicken Eimeria species (Romano et al.—manuscript in preparation) and used this data set to reconstruct the phylogeny of this group. Preliminary results show a good agreement between inferences based on these molecular markers and the morphological features described in this work. Thus, morphometric data applied to phylogenetic inference may provide an interesting counterpart to molecular-based phylogenies, with potentially exciting evolutionary implications. 7. Conclusions In this paper, an effective shape characterization approach for automatic species differentiation in Eimeria spp. is proposed. The extracted features identify different morphological properties of the oocysts, related to the characterization of form, geometry and internal structure. This shape representation was applied for the differentiation of the seven Eimeria species of domestic fowl, and the results revealed a good reliability of the feature set. Finally, a real-time diagnosis system was implemented and made available for the scientific community. We believe that our system demonstrates the feasibility of using computer-assisted systems to provide an interesting alternative for the rapid diagnosis of parasites. Acknowledgments Luciano da F. Costa (308231/03-1) and Arthur Gruber (306793/2004-0) are grateful to CNPq for financial support. César A.B. Castañón received a fellowship from CAPES and the work presented herein formed part of his Ph.D. Thesis. Jane S. Fraga and Sandra Fernandez received fellowships from CNPq and FAPESP, respectively. References [1] D. Comaniciu, P. Meer, D. Foran, Image-guided decision support system for pathology, Mach. Vision Appl. 11 (4) (1999) 213–224. [2] D. Sabino, L. Costa, E. Rizzatti, M. Zago, A texture approach to leukocyte recognition, Real-Time Imaging 10 (4) (2004) 205–216. [3] A. Jalba, M. Wilkinson, J. Roerdink, Shape representation and recognition through morphological curvature scale spaces, IEEE Trans. Image Process. 15 (2) (2006) 331–341. [4] S. Trattner, H. Greenspan, G. Tepper, S. Abboud, Automatic identification of bacterial types using statistical imaging methods, IEEE Trans. Med. Imaging 23 (7) (2004) 807–820. [5] X. Long, W. Cleveland, Y. Yao, Effective automatic recognition of cultured cells in bright field images using Fisher’s linear discriminant preprocessing, Image Vision Comput. 23 (13) (2005) 1203–1213. [6] M. Sampat, A. Bovik, J. Aggarwal, K. Castleman, Supervised parametric and non-parametric classification of chromosome images, Pattern Recognition 38 (8) (2005) 1209–1223. [7] J. Kucera, M. Reznicky, Differentiation of species of Eimeria from the fowl using a computerized image-analysis system, Folia Parasitol. (Praha) 2 (38) (1991) 107–113. [8] A. Daugschies, S. Imarom, W. Bollwahn, Differentiation of porcine Eimeria spp. by morphologic algorithms, Vet. Parasitol. 81 (3) (1999) 201–210.

1909

[9] A. Plitt, S. Imarom, A. Joachim, A. Daugschies, Interactive classification of porcine Eimeria spp. by computer-assisted image analysis, Vet. Parasitol. 86 (2) (1999) 105–112. [10] P.L. Long, B.J. Millard, L.P. Joyner, C.C. Norton, A guide to laboratory techniques used in the study and diagnosis of avian coccidiosis, Folia Vet. Lat. 6 (3) (1976) 201–217. [11] B.E. Schnitzler, P.L. Thebo, J.G. Mattsson, F.M. Tomley, M.W. Shirley, Development of a diagnostic PCR assay for the detection and discrimination of four pathogenic Eimeria species of the chicken, Avian Pathol. 27 (5) (1998) 490–497. [12] B.E. Schnitzler, P.L. Thebo, F.M. Tomley, A. Uggla, M.W. Shirley, PCR identification of chicken Eimeria: a simplified read-out, Avian Pathol. 28 (1) (1999) 89–93. [13] S. Fernandez, A.H. Pagotto, M.M. Furtado, A.M. Katsuyama, A.M. Madeira, A. Gruber, A multiplex PCR assay for the simultaneous detection and discrimination of the seven Eimeria species that infect domestic fowl, Parasitology 127 (4) (2003) 317–325. [14] A. Joachim, N. Dulmer, A. Daugschies, Differentiation of two Oesophagostomum spp. from pigs, O. dentatum and O. quadrispinulatum, by computer-assisted image analysis of fourth-stage larvae, Parasitol. Int. 48 (1) (1999) 63–71. [15] C. Sommer, Quantitative characterization, classification and reconstruction of oocyst shapes of Eimeria species from cattle, Parasitology 116 (1) (1998) 21–28. [16] C. Sommer, Quantitative characterization of texture used for identification of eggs of bovine parasitic nematodes, J. Helminthol. 72 (2) (1998) 179–182. [17] Y. Yang, D. Park, H. Kim, M. Choi, J. Chai, Automatic identification of human helminth eggs on microscopic fecal specimens using digital image processing and an artificial neural network, IEEE Trans. Biomed. Eng. 48 (6) (2001) 718–730. [18] K.W. Widmer, K.H. Oshima, S.D. Pillai, Identification of Cryptosporidium parvum oocysts by an artificial neural network approach, Appl. Environ. Microbiol. 68 (3) (2002) 1115–1121. [19] O. Bruno, R. Cesar Jr., L. Consularo, L. Costa, Automatic feature selection for biological shape classification in SYNERGOS, in: Proceedings of the SIBGRAPI’98, International Symposium on Computer Graphics, Image Processing, and Vision, 1998, pp. 363–370. [20] R. Coelho, V.D. Gesù, G.L. Bosco, J. Tanaka, C. Valenti, Shape-based features for cat ganglion retinal cells classification, Real-Time Imaging 8 (3) (2002) 213–226. [21] L. Costa, S. dos Reis, R. Arantes, A. Alves, G. Mutinari, Biological shape analysis by digital curvature, Pattern Recognition 37 (3) (2004) 515–524. [22] D. Regan, Human Perception of Objects, York University, New York, 2000. [23] B. Olshausen, D. Field, Vision and the coding of natural images, Am. Sci. 88 (3) (2000) 238–245. [24] D. Zhang, G. Lu, Review of shape representation and description techniques, Pattern Recognition 37 (1) (2004) 1–19. [25] L. Costa, R. Cesar Jr., Shape Analysis and Classification: Theory and Practice, CRC Press, Boca Raton, FL, 2000. [26] R. Gonzales, R. Woods, Digital Image Processing, Addison-Wesley, Reading, MA, 1993. [27] F. Attneave, Some informational aspects of visual perception, Psychol. Rev. 61 (3) (1954) 183–193. [28] L. Riggs, Curvature as a feature of pattern vision, Science 181 (4104) (1973) 1070–1072. [29] F. Mokhtarian, A. Mackworth, A theory of multiscale, curvature-based shape representation for planar curves, IEEE Trans. Pattern Anal. Mach. Intell. 14 (8) (1992) 789–805. [30] R. Cesar Jr., L. Costa, Towards effective planar shape representation with multiscale digital curvature analysis based on signal processing techniques, Pattern Recognition 29 (9) (1996) 1559–1569. [31] H. Weyl, Symmetry, Princeton University Press, New Jersey, 1980. [32] H. Zabrodsky, S. Peleg, D. Avnir, A measure of symmetry based on shape similarity, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR92), 1992, pp. 703–706.

1910

C.A.B. Castañón et al. / Pattern Recognition 40 (2007) 1899 – 1910

[33] M. Brady, H. Asada, Smoothed local symmetries and their implementation, Technical Report, Cambridge, MA, USA, 1984. [34] J. Sato, R. Cipolla, Affine integral invariants for extracting symmetry axes, Image Vision Comput. 15 (8) (1997) 627–635. [35] Y. Bonneh, D. Reisfeld, Y. Yeshurun, Quantification of local symmetry: application to texture discrimination, Spat. Vision 8 (4) (1994) 515–530. [36] B. Zavidovique, V.D. Gesù, Kernel based symmetry measure, in: ICIAP, 2005, pp. 261–268. [37] I. Young, J. Walker, J. Bowie, An analysis technique for biological shape I, Inform. Control 25 (4) (1974) 357–370. [38] M. Tuceryan, A. Jain, Texture analysis, in: C.H. Chen, L.F. Pau, P.S.P. Wang (Eds.), The Handbook of Pattern Recognition and Computer Vision, second ed., World Scientific Publishing Co., Singapore, 1998, pp. 207–247. [39] R. Haralick, K. Shanmugam, I. Dinstein, Textural features for image classification, IEEE Trans. Systems Man Cybern. SMC-3 (6) (1973) 610–621. [40] R. Jobanputra, D. Clausi, Preserving boundaries for image texture segmentation using grey level co-occurring probabilities, Pattern Recognition 39 (2) (2006) 234–245. [41] R. Conners, Towards a set of statistical features which measure visually perceivable qualities of texture, in: Proceedings of Pattern Recognition Image Processing Conference, 1979, pp. 382–390.

[42] S. Theodoridis, K. Koutroumbas, Pattern Recognition, Academic Press, San Diego, 1998. [43] R. Duda, P. Hart, D. Stork, Pattern Classification, Wiley, New York, 2001. [44] P.M. Narendra, K. Fukunaga, A branch and bound algorithm for feature subset selection, IEEE Trans. Comput. 26 (9) (1977) 917–922. [45] A. Jain, R. Duin, J. Mao, Statistical pattern recognition: a review, IEEE Trans. Pattern Anal. Mach. Intell. 22 (1) (2000) 4–37. [46] A. Jain, D. Zongker, Feature selection: evaluation, application, and small sample performance, IEEE Trans. Pattern Anal. Mach. Intell. 19 (2) (1997) 153–158. [47] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, Cambridge, 2000. [48] K. Crammer, Y. Singer, On the algorithmic implementation of multiclass kernel-based vector machines, J. Mach. Learn. Res. 2 (5) (2001) 265–292. [49] J.R. Barta, D.S. Martin, P.A. Liberator, M. Dashkevicz, J.W. Anderson, S.D. Feighner, A. Elbrecht, A. Perkins-Barrow, M.C. Jenkins, H.D. Danforth, M.D. Ruff, H. Profous-Juchelka, Phylogenetic relationships among eight Eimeria species infecting domestic fowl inferred using complete small subunit ribosomal DNA sequences, Parasitology 83 (2) (1997) 262–271.

About the Author—CÉSAR ARMANDO BELTRÁN CASTAÑÓN holds a B.Sc. degree from Universidad Católica de Santa María, Peru, in Systems Engineering, and a M.Sc. in Computer Science from the University of São Paulo (USP), Brazil. He is currently a Ph.D. student in Bioinformatics at the USP. His research interests include 2D image shape analysis, pattern recognition, feature extraction and selection, content-based image retrieval, and computational biology. About the Author—JANE SILVEIRA FRAGA holds a Veterinary Medicine degree from the University of São Paulo (USP). She is currently finishing her Ph.D. Thesis on the characterization of dsRNA viruses infecting Eimeria spp. of domestic fowl. About the Author—SANDRA FERNANDEZ holds a Veterinary Medicine degree and a Ph.D. of Parasitology (USP). She is currently heading a research and development group at Laboratório Biovet S/A, a private company that is a major vaccine producer in Brazil. About the Author—ARTHUR GRUBER holds a Veterinary Medicine degree, and a Ph.D. in Biochemistry (USP). He is currently an Associate Professor at the Department of Parasitology of the USP. His main research interests include molecular biology and genomics of coccidian parasites, and the development of bioinformatics applications for sequence analysis. About the Author—LUCIANO DA FONTOURA COSTA holds a B.Sc. in Electronic Engineering and Computer Science, a M.Sc. in Applied Physics, and a Ph.D. in Electronic Engineering (King’s College, University of London). He is currently a Full Professor at the USP. His main interests include natural and artificial vision, shape analysis, pattern recognition, computational neuroscience, computational biology, and bioinformatics.