Intelligent classification of web pages using contextual and visual features

Intelligent classification of web pages using contextual and visual features

Applied Soft Computing 11 (2011) 1638–1647 Contents lists available at ScienceDirect Applied Soft Computing journal homepage: www.elsevier.com/locat...

609KB Sizes 0 Downloads 131 Views

Applied Soft Computing 11 (2011) 1638–1647

Contents lists available at ScienceDirect

Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc

Intelligent classification of web pages using contextual and visual features Ali Ahmadi a,∗ , Mehran Fotouhi b , Mahmoud Khaleghi c a

Electrical & Computer College, Khajeh Nasir Toosi University of Technology, Shariati St., Seyedkhandan, Tehran, Iran Computer Department, Sharif University of Technology, Tehran, Iran c Iranian Telecommunication Research Center, Tehran, Iran b

a r t i c l e

i n f o

Article history: Received 25 May 2009 Received in revised form 26 January 2010 Accepted 1 May 2010 Available online 12 May 2010 Keywords: Web-pages classification Content based filtering Porn image detection Skin color detection Adult image detection

a b s t r a c t In this paper we address classification of Web content and in particular its application in the detection of pornographic Web pages. Filtering of undesirable Web content is mainly achieved based on blocking a specific Web address via searching it in a reference list of black URLs or doing a plain contextual analysis on the page by searching special keywords in the text. The main problem with current filtering methods is the requirement for instantly update of the URL list and also the high rate of over-blocking the usual pages. In this paper, we propose an intelligent approach which is based on using textual, profile, and visual features in a hierarchical structure classifier. Textual features contain information about keywords, black-words, etc. and profile features contain structural information like number of links, meta-tags, pictures, etc. As for the visual features we employ a sort of global and local indicative features including topological and shape-based characteristics which are extracted from the skin region. The algorithm was applied on a dataset with 1295 Web pages as training set including 700 porn pages (coming with text, image, or both) in English and Persian, and 595 non-porn pages including pages with medical, health, sports, etc. topics. Using a test dataset with 290 Web-ages a 95% accuracy rate was obtained. © 2010 Elsevier B.V. All rights reserved.

1. Introduction With the ever-growing Web, the Websites with objectionable contents like pornography, violence, racism, etc. have been augmented rapidly during recent years. Among the offensive contents, the pornography is the most harmful one affecting children safety and causing many destructive side effects. According to a recent survey, one in four kids reported having at least one unwanted exposure to sexually explicit pictures, and one out of five reported receiving a sexual solicitation [23]. Different researches and efforts have been carried out recently on how to block the pornography Websites among them content-based filtering is the most effective one [5,6]. Also, many software packages have been developed [2] which mainly employ two kinds of approaches for classifying Web pages: static filtering and dynamic filtering. Static filtering is based on blocking a specific Web address via searching it in a reference list of black URLs. Although this method has high speed of processing, but its shortcoming is the requirement for instantly update of the URL list. This updating is a very hard task in the rapidly improving Web. Another problem is the high rate of over-blocking the usual pages such as pages with medical, sports,

∗ Corresponding author. Tel.: +98 21 22361217; fax: +98 21 22361217. E-mail address: [email protected] (A. Ahmadi). 1568-4946/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.asoc.2010.05.003

or arts topics, or blocking a Website because of only one immoral page on it. In dynamic filtering, the classification is performed based on the content analysis. First, the content of the pages is analyzed by using intelligent algorithms like learning models, data mining methods and so on, and then the page is classified based on the content features. The classification accuracy is higher in this approach but it has overloading in classification process which makes problem in the online applications. As a key point in the content-based filtering, images are considered as the essential part of Web pages particularly in the adult pages. A study of more than four million Web pages reveals that 70% of them contain images and there are, on average, 18.8 images per Web pages [13]. Also, a statistical analysis of 1232 pornographic and 6967 non-pornographic Web pages shows that 72% of pornographic pages have more than 5 images and 60% of them have more than 10 images [14]. In addition, 40% of pornographic Web pages have more than 5 links to image and video files. Therefore, any effective Web site classification system should take into account the visual content part and provide a method for detecting the characteristics of the images within the page. Through a survey on all the main works in the literature, we realized that a heuristic combination of content features together with a hybrid structure for classifier can significantly enhance the filtering performance. In this paper, we propose an intelligent approach

A. Ahmadi et al. / Applied Soft Computing 11 (2011) 1638–1647

based on using three sorts of features, that is, textual, profile, and visual features, in a hierarchical structure of classifier and test its performance on a representative dataset of Web pages. As for the visual features we employ a sort of global and local indicative features including topological and shape-based characteristics which are extracted from the skin region. Artificial neural networks are used as learning models for both skin detection and final image classification. 2. Related works According to a survey on works concerned with content-based filtering, there are two main approaches for Web classification: classification based on textual content features, and classification based on both visual and textual features. The first group uses textual analysis mainly by searching a list of indicative keywords over the text. In the second methods, textual content-based analysis together with visual features are used to get a more robust classification. Visual features are extracted from the images in the Web pages by utilizing effective image processing techniques. For instance, skin-area detection [3,4,8], detection of ROIs1 in the human body [9,11], and image wavelet transform [12] are some of the approaches already proposed. Here, we point out to some main works in the field. Hammami et al. [3] proposed an approach based on extraction of both contextual and visual features. In their work, a number of 20 textual and profile features are extracted along with some visual features which are obtained from the skin-area in the images. It is asserted that the accuracy of the system is significantly improved when a hierarchical combination of contextual and visual features is used. However, the method for combination of features as well as classifiers is not clearly described. Using structural features of the input page as well as the various kinds of textual features is one advantage of their work. The drawback with their work is that no effectiveness or correlation analysis is applied on the extracted features. Moreover, the classification based on visual features is not of a high accuracy because of using only proportion of skin pixels to the total pixels as the main feature. Also, the method for classifiers combination is not clearly described. In [27], Hu et al. first use a C4.5 tree to categorize input pages into three classes of continuous texts, distinct texts, and image pages. A CNN2 net is employed for finding the semantic relations within continuous text and a naïve Bayesian algorithm is used for recognition of distinct texts. The classification results based on textual and visual features are lastly combined by a Bayesian algorithm. The system is tested with 1000 pages of different subjects and the average classification rate is obtained as 91.6%. Making use of semantic analysis for classification of textual contents is the main advantage of their work. But, the sequential procedure of classification seems to cause increasing errors throughout the classification steps. Moreover, the multi-dimensional histogram method they exploited for extracting skin features is a global operator which seems not to be a proper solution for its high time-complexity and low accuracy in detection of quasi-skin areas. Chen et al. [9] proposed a statistical approach by combining textual and visual contents. Their work consists of three steps: (1) classification based on discrete text (keywords), (2) classification based on continuous text (sentences), and (3) classification through images. Besides using methods like skin color detection, they have also used some other features for image classification based on the ROI. At the end, a fusion classifier combined of k-NN3 for classi-

1 2 3

Region of interest. Cellular neural network. k nearest neighbors.

1639

fication of texts and Yang method for classification of images is introduced and a 91.8% classification rate is obtained over 1500 sample pages. Using only URLs and keywords instead of a contentbased analysis, as well as small size of test dataset, and relatively low accuracy rate are some shortcomings of their work. Lee et al. [10,24] used an artificial neural network to classify Web pages based on textual contents. The textual features contain page title, the visible part of the text, metadata for page description and keywords, and tooltips of the images. A pre-processing stage is used for converting the features to an input vector for neural network classifier. The system was trained with a training dataset of 3777 non-porno pages and 1009 porno pages (labeled manually), and for testing the system performance a database containing 535 porno sites and 523 non-porno sites was used. The accuracy is reported as 95%. The advantage of the method is the ability of recognition of bilingual pages (Chinese and English) and the shortcoming is the disability of classifying gallery pages or pages with many images. As for the visual features, the first step in almost all of existing approaches is skin detection. In [15], authors have shown that there is a strong correlation between the percentage of skin and the possibility of pornographic content within the image. Pixel-based and region-based algorithms are two main approaches applied for skin detection. In pixel-based methods, the color of pixel is considered as a feature while in region-based methods, classification is done based on spatial information of pixels. However, due to various parameters such as individual characteristics (e.g., race, age, body part) or variation of illumination on skin appearance, the result of detection might be unreliable. To overcome these problems, a number of approaches based on different color spaces have been proposed. In [15], a histogram method with Gaussian mixture models is proposed for skin detection. Forsyth and Fleck [16] used texture information in a logopponent color space to segment skin regions. Zheng et al. [17] proposed a statistical skin detection method based on maximum entropy model. In [18], region based skin detection is proposed. Color and texture feature are extracted from arbitrary-shaped segmented regions. These features are classified by Gaussian mixture models. A cellular learning automata based skin detection is proposed in [19]. They extract a skin probability map based on texture information. The map is then fed to the CLA4 to make decision on skin-like regions. Explicit rules based on YCbCr color space are used in [20] for skin detection. In [21], a skin color distribution model based on RGB, normalized RGB, and HSV color spaces is constructed using correlation and linear regression between the components. Girgis et al. [8] presented a system for extracting images from Websites and detecting the skin areas. This system which is called BHO (browser helper object) was an IE5 accessible object that ran in the background of IE and could extract all images and URL links in the page. Two techniques were introduced for skin detection based on color spaces: YUV and RGB and finally it is proved that YUV space is the optimum space for skin detection algorithm. But the method proposed for skin color detection seems not to be a robust algorithm. Bosson et al. [7] proposed a method based on visual content features. In the first stage they use a skin filter to localize skin pixels and generate a skin blob. Then some topological features such as area, the length of the major and minor axis of an ellipse fitted to the blob, are used to classify the page. The MLP6 classification method has given the minimum misclassification rate among other four applied methods. For evaluating the algorithm, a data set of 10,005 images which were hand-classified into five categories, are used

4 5 6

Cellular learning automata. Internet explorer. Multi layer perceptron.

1640

A. Ahmadi et al. / Applied Soft Computing 11 (2011) 1638–1647

and a classification rate 87.2% is reported. The advantage of their work is in the optimal classifier design but the features used for skin detection are not described clearly. Also no robust method is introduced for evaluating the visual features. Different approaches have been applied for image classification based on skin detection results. A straightforward algorithm has been proposed for nudity detection in [21]. This algorithm is based on some simple rules defined on the percentage of skin area in the image. A number of researchers have introduced additional heuristic features. In [16], authors used a body geometric filter to find the existence of human structures such as limbs from the skin regions. In [17], some elliptical features are extracted from the image and a MLP is used for learning these features. Authors in [18] used eigen-region as features and learned them with MLP. In [20], an image retrieval based detection method is proposed. Color, texture, and a set of shape features have been used to retrieve 100 most similar images from the image database containing both adult and non-adult pictures. Wang et al. [12] developed a WIPE7 based image retrieval technique which uses a feature vector including Daubechies wavelets, normalized central moments, and color histogram. Johns and Rehg [15] used five simple features from the output of the skin detector and then trained a MLP classifier on these features to determine whether a human being is present in the image or not. Liu et al. [26] applied an image retrieval method for pornography detection. First, the human being is detected in the picture and then by making a skin color analysis, being porn or not-porn is determined. Arentz and Olstad [25] extracted a set of visual features from the connected skin area which contain information about color, texture, form, center of gravity, and area of the skin. The main concentration in their approach is on the skin area detection. A validation dataset consisting 20 Web pages with around 2000 images is exploited and a 89% classification rate is obtained. The advantage of the method is the application of genetic algorithm for optimization of feature selection which leads to a significant reduction in misclassification rate. The weakness of their method is disability of recognition and classification of pages without images. Each of the above methods has its strength and weakness such as system over loading, high computational cost, over blocking, nonsupporting all kinds of Web pages, low accuracy, etc. which will be discussed later. 3. Outline of the proposed algorithm Fig. 1 illustrates outline of the proposed algorithm. In order to overcome the shortcomings of pervious works, we use three set of classifiers, which call them weak classifiers, based on three types of content features. Then we combine them in a hierarchical structure to obtain the final robust classifier for pornography detection. As illustrated in the block diagram, the training system is composed of three principle modules including feature extraction, feature vector generation, and hierarchical classification. First, a representative database of Web pages was provided by utilizing WebCrawler software. The pages were manually classified into porn and non-porn categories. In the feature extraction and feature vector generation steps, three types of analysis are applied to extract required features containing: (i) textual content features as shown in Table 1, (ii) profile content features as shown in Table 2, and (iii) visual contents including skin color space and specific object analyzing. These features are then used in a learning process to generate the optimized parameters for the classifier as well as the optimized feature vector. Later in the testing phase, we use

7

Wavelet image pornography elimination.

Fig. 1. Flowchart of the proposed system for detection of immoral Web pages (training phase).

Table 1 Textual features for a Web page. wrd xwrd pcxwrd nxkywrd pcxkywrd nxdscript ntitwrd nxtitwrd

Number of words in the page Number of black words in the page Ratio of black words to total words Number of keywords with a black word within them Ratio of black keywords to all keywords Number of descriptions with a black word within them Number of total words in the page title Number of black words in the page title

this trained classifier to classify input pages into porn and non-porn classes. We specify three distinct categories for classification of Web pages regarding immoral contents: • Category 0 containing all permitted pages (i.e. ordinary permitted pages without any pornographic or immoral articles). • Category 1 including immoral pages but not porn (coming without dirty words or harmful images for children, or the non-porn pages looking as porn ones like medical pages, etc.).

Table 2 Profile features for a Web page. npix nxpix nlink nxlink nxxlink pcxlink nmtkywrd nvideo nframe ncolor ndscript nwarn nxwarn ntooltip nxtooltip

Number of images in the page Number of images with a black word on their names Number of links in the page Number of links with a black word in them Number of page links existed in the black URL list Ratio of black links to total links Number of meta tags with keywords Number of videos in the page Number of frames in the page Number of colors used in the page Number of meta tags with description Number of warning tags Number of warnings with a black word in them Number of tooltips Number of tooltips with a black word in them

A. Ahmadi et al. / Applied Soft Computing 11 (2011) 1638–1647

• Category 2 including porn pages (coming with dirty words and pornographic images). The novelty of the proposed method can be summarized as follows: (1) The input pages are classified into three distinct classes according to their level of immorality, rather than a single negative or positive class. (2) Three different lists of characteristic words corresponding to three different classes of output are used for extracting the word-based features in the main text, the title, tooltips, etc. (3) A sort of local and global features extracted with a new approach is employed for recognition of porn pictures. The combination of these features with the skin color feature which itself is generated with a new method, form a very effective set of visual features. (4) A hierarchical structure is used for integrating of classifiers. 4. Feature extraction Here, we describe each of the textual, profile, and visual features and how we extract them. 4.1. Textual and profile features Textual features include all the items listed in Table 1 and the profile features are as listed in Table 2. The features are extracted and processed by means of the WebCrawler software which is designed and implemented by our team members. The WebCrawler can process a single or group of Web pages in online or offline mode and extract their textual and profile features as well as the images within the pages. There is the possibility of processing the URL links within a page with an unlimited depth. Also, different settings can be initialized and black-word list and black-URL list can be introduced to the system. For classification of Web pages we need a sort of features to be representative of pages characteristics. At first, we used density features which were calculated by counting the number of keywords used in each of the main characteristic of the page, that is, context, links, title, images name, and tooltips. In other word, the number of characteristic words used in the context or other page attributes are determined according to the different categories’ keywords. And the density value for each feature is obtained via dividing the number counted, by the total number of words. These density values are determined for each of three categories and each of the five features mentioned above, and consequently, give 15 textual features. The classification results based on these features showed a good accuracy rate over many sorts of Web pages. But in case of pages which are concerned with a specific subject and using specific words belonging to a specific category, the system performance was confusing. For example, a page which is talking about children sexual rights, have a high density of immoral word while the page is not an immoral page. The point in such pages is that the number of very specific words is high and the variety of words is low. Therefore, a new feature was generated based on the variety of characteristic words used in the page. The number of distinct words indicative of each category was calculated and divided by the number of total distinct words in the page. We called this feature as frequency of category words in the page. By using this new feature, the classification accuracy was increased significantly. 4.2. Visual features This step contains extracting two sorts of features: global and local. Our feature extraction algorithm has some similarity with

1641

what introduced in [17,22,7], but it differs considerably in practical implementation. In most of adult image detection systems, the percentage of skin is directly used as one of the main features. For example, in [7] the area of face is considered as a feature. This might cause errors in detection of images which contain other objects with skin-like color or skin objects with different distances to camera. Also, this will reduce the effect of other features in the detection process. Experimental results in [21] show that very small number of adult images has skin percentage less than 15%. Based on our experiments, if the largest skin region contains more than 20% of skin pixels (as a threshold value), this region can be considered as the main region for extracting local features. We will fit one ellipse as global ellipse to all skin regions. As for the largest skin region, another ellipse is fitted for extracting local features which we call it as local ellipse. Parameters of ellipses are computed using central moments. 4.2.1. Skin detection method Many methods have been suggested for skin detection but color based approaches are widely used according to their high speed and good precision. In this paper, color based features are utilized for discriminating skin and non-skin areas. Since color based features are easy to obtain and also robust to the orientation and scaling, using these features makes skin detection system faster and more precise. The drawback of these features is their sensitivity to the ambient conditions such as illumination and type of camera. The color space RGB is more sensitive to the illumination. One way to overcome this problem is transforming RGB to the color space YCbCr, eliminating the illumination axis Y, and using just chrominance axes Cb and Cr. In this paper, both color spaces have been investigated. The first group of features includes R, G, and B value of a pixel and its four neighbor pixels (15 features), and second group of features includes r, g (normalized R and G), Cb, and Cr of a pixel and its four neighbors (20 features). We selected the second group due to better classification performance which is reported in Section 5.2. 4.2.2. Global features These features are extracted from the global ellipse. We normalize the ellipse center, major axis length and minor axis length relative to image size. The ratio of minor axis to major axis is also computed. 4.2.3. Local features Local features are computed on the largest skin region. We obtain three categories of features for the largest skin region. Features of first category show the compactness of the largest skin region and its situation respect to other skin regions. These features are: • • • • • • • •

The ratio of minor axis to major axis of local ellipse. Normalized center of local ellipse respect to image size. Eccentricity of local ellipse. Difference between angle of the major axis from the horizontal axis of local ellipse and global ellipse. The ratio of largest skin region to area of local ellipse. The ratio of skin region to area of bounding box. The ratio of width of bounding box to height of bounding box. The ratio of largest skin region to all skin regions.

Figs. 2 and 3 show a local bounding box and ellipse on the largest skin region for two different images. It seems clear that these features make significant differences between porn and non-porn images. These features are rotation and scale invariant and can be learned by one of the machine leaning algorithms.

1642

A. Ahmadi et al. / Applied Soft Computing 11 (2011) 1638–1647

Fig. 2. A non-porn image with its local fit bounding box and ellipse: (a) original image; (b) local fit bounding box; (c) local fit ellipse. Fig. 4. An example for decision tree.

The second category of features is based on the shape of the largest skin region. Shape based object detection and recognition is common in image processing. The purpose of shape descriptors is to distinctively illustrate the object shape. A good shape descriptor should be insensitive to noise, decrease the within-class variance, and maximize the between-class variance. Several approaches proposed for describing shape of an object. For instant, chain code, curvature and moments are used as good shape descriptors. Here, we use seven normal moment invariants defined by Hu [22] to describe shape features. These moments are invariant and independent of translation, scale, and rotation. The third category of features consists of textural features that is computed on the largest skin region. Moments describe the outside of the shape. We need also extra features to illustrate the inside characteristic of the shape. For this, we used normalized edge direction histogram. The edge direction histogram is provided to represent the local edge distribution of an image in MPEG7. The edge direction histogram is computed in 6 directions (i.e. 0–45–90–135–225–315◦ ). To make this feature scale invariant, the edge direction of boundary of the largest skin region is omitted. In

order to make it invariant against rotation, we rotate the largest skin region clockwise so that the local ellipse gets same orientation as the horizontal axis. Finally, a feature vector of observing image containing 29 global and local features is constructed. This feature vector is fed to a classifier in the next stage. 4.2.4. Feature analysis For an image that contains only one nude object (or porn scene), global features will be equal to local ones. Some local features such as the ratio of skin region to area of bounding box or the ratio of width of bounding box to height of bounding box are used for detecting non-porn images which contain large skin region. For an adult image that has only one nude object (or porn scene), these features value change in a limited range. Global and local features exploited here are similar with features in [18,21,22], and may be good enough for detecting porn image but our simulation results show that these features solely are not robust for detecting non-porn images. So we considered extra features such as shape and texture to discriminate non-porn from porn images. We used Hu moments as shape descriptors and normalized edge direction histogram as textural descriptors. The latest feature is extracted to illustrate the inside characteristic of the shape. For porn images, this feature value changes in a limited range. If the feature gets very low value – that is having a soft skin region – or if it gets very high value – that is the coarseness of skin region is considerable, it means that the image is probably not a porn one. 5. Classifier design 5.1. Classification of textual and profile features with ID3 Decision trees are tree-structure classification method in which the nodes are tests on input patterns and the leaves are classes. Fig. 4 illustrates this structure. According to the different values that attribute(s) can take in each node, the nodes are split into the branches and connected to the lower nodes. Each input pattern passes only one path from the first node of the tree (i.e. the root) to the leaf which determines the class of that pattern. Minimum misclassification and simplicity are two major criteria in decision tree design. Different algorithms are proposed for decision trees such as ID3, C4.5, CART, and CHAID [1]. In ID38 algorithm, the attribute selection for each node is based on maximum gain information or equivalently maximum decrement in entropy. The stopping criterion for tree growing is when all samples are in the same class or when the highest information gain is not more than zero. Comparing to the original version of ID3, in this paper

Fig. 3. An adult image with its local fit bounding box and ellipse: (a) original image; (b) local fit bounding box; (c) local fit ellipse.

8

Induction of decision tree.

A. Ahmadi et al. / Applied Soft Computing 11 (2011) 1638–1647

1643

Fig. 5. Mean square error of the network for both cases of RGB and rgCbCr features.

we use also Chi-square test to stop the growing of tree which is a kind of pre-pruning method. In this criterion, if p-value is greater than a confidence-level, the node is not split. For constructing the tree, 1164 pages from three different categories were randomly selected as training dataset (276 pages from Cat0, 382 pages from Cat1, and 506 pages from Cat2). The confidence-level for the tree was taken as 0.1. The experimental classification results are reported in Section 6.2. 5.2. MLP classifier for skin area Multi-layer perceptron as a powerful model in pattern recognition has been applied for our skin detection process. This classifier can produce complex boundaries between different classes. To detect whether a pixel in an image is a skin or not, the color features of this pixel and its four neighbor pixels are used. The MLP we applied has one hidden layer and one neuron in its output layer. The size of hidden layer and input (number of features) is flexible and to be optimized. As for activating function, sigmoid is utilized as a nonlinear function in all neurons. LevenbergMarquardt is used as learning method. Cross-validation sampling method is applied for producing training and test dataset. To determine the size of hidden layer, the number of neurons in this layer has been changed from 5 to 50 and for each number of neurons the network has been trained 7 times with different biases and weights. The mean square errors (MSE) of network in 7 runs with RGB and rgCbCr features for each number of neurons in hidden layer are illustrated in Fig. 5. According to Fig. 5, in the case of RGB-based features, a hidden layer with 30 neurons is the best choice. In case of rgCbCr-based features, 45 neurons in the hidden layer is the best choice. The training procedure and experimental tests are reported in Section 6.3. In order to create train dataset, 165 images from different skin and non-skins regions and in various ambient conditions, races, and camera situations are gathered through internet. From these images, 20,877 random pixels are selected including 11,572 skin pixel and 9305 non-skin pixels. 70% of these pixels were selected randomly as a train set, 20% as a test set, and 10% for validation set which the later one is chosen for network assessment. The classification results obtained from the validation dataset for both RGB and rgCbCr cases are reported in Table 3. Results show that using Table 3 Classification results for validation data using RGB and rgCbCr based features. Color-based features

Number of neurons in hidden layer

RGB rgCbCr

30 45

True positive 86% 88.80%

False positive 23.40% 21.10%

rgCbCr features causes less error in classification of pixels but more complicated network is needed. 5.3. Classification of visual features The task in this step is to discover the decision rule on the feature vector so that optimally separates adult images from those are not. Evidence from [7] shows that MLP classifier offers a statistically considerable performance over several other approaches such as generalized model, the k-nearest neighbor classifier and the support vector machine. The MLP we used here, was adjusted to have two hidden layers with 15 and 7 neurons in each. As for training of the classifier, 165 images were used as training dataset containing both porn images with label +1 and non-porn images with label 0. Feature vectors for all images are extracted and fed to MLP. The network output is a number between 0 and 1. The near the number to one, the more possibly the input image corresponds to an adult image. We used a threshold value to get the binary decision. Fig. 6 shows the classification results on some typical image samples. The misclassification of Images 1, 2, and 3 is due to similarity of objects or objects’ pose in the images to the specific objects in the porn images. Other images are classified correctly. Fig. 7 displays the ROC9 of the system performance for different threshold values used in the MLP classifier. 5.4. Classifiers combination The final classification procedure in the system is implemented as follows. First, the language of the input page is identified through the coding of the words within the page. Next, the input page is classified based on textual and profile features using the ID3 classifier. The classification result comes as three probability values (p0 , p1 , and p2 ) for assigning the input page to one of the three categories (0, 1, and 2) mentioned in Section 3. If we have

  (p1 + p2 ) − p0  > Th

(1)

where p0 is the probability of associating input page to category 0 (i.e. permitted pages), p1 and p2 are probability of associating to categories 1 and 2, respectively (both immoral pages), and Th is a threshold value, then the classification between moral and immoral categories is reliable, that is there is an enough confidence bound between winner and nearest loser. If not, then we go to the next stage of classification, which is the classifier based on visual features. The classification result of this stage is used as an auxiliary feature for further discrimination between category 0 and

9

Receiver operating characteristic.

1644

A. Ahmadi et al. / Applied Soft Computing 11 (2011) 1638–1647

Fig. 7. ROC curve of proposed system in detection of porn images.

ered by Mozilla browser using Google search engine. Among this dataset, a number of 1295 pages including 1072 English and 223 Persian pages were randomly selected to form a dataset we call it as Statistically Distributed Data for Pornography Detection (SDDPD). This dataset consists of 700 immoral pages ranging from erotic to pornographic, which sampled from 300 Websites in a keyword searching procedure. Porno pages include plain-text pages, gallerylike pages and the pages consisting of both texts and images. The rest 595 pages were non-pornographic with the subject of health care, training, scientific news, sports, art and entertainment. Also, some confusing pages such as anti-AIDS, medical material and topics about sex education, and some other normal pages containing partial nude body pictures or suspicious words, were included in order to evaluate the false positive rate of the system. Prior to classification, all pages were labeled manually according to association to one of the three categories mentioned earlier. Table 4 illustrates the distribution of data based on page language, type of the page, being immoral or normal, and the belonging category. 6.2. Experiments with textual and profile features

Fig. 6. Examples of image classification in the proposed system. The right-side column shows the classifier’s output which is positive when larger than 0.5 and negative otherwise. The images labeled as 1, 2, 3 are FP cases, 4, 5, 6 are TN cases, and 7, 8 are TP ones.

other categories in general, and between categories 1 and 2 as subcategories of immoral pages. For this, five images of the page are randomly selected and classified by using global and local visual features. If the number of images classified as porn image is more than two, the page is considered as immoral page (either category 1 or 2 according to max of p1 or p2 ). Otherwise, the classification will be performed based on the preceding results of textual and profile classification (i.e. max of p1 , p2 , p3 ). The whole procedure is illustrated in the flowchart of Fig. 8. 6. Experimental results 6.1. Dataset description In order to evaluate the proposed system performance, we used a manually collected dataset, containing 5000 Web pages gath-

As described in Section 5.1, the ID3 classifier we used for textual and profile features was trained with 1164 pages selected from the SDDPD dataset. Next, a dataset with 290 pages was selected for testing the system performance. This dataset contains different types of English and Persian pages, associating to different categories (0–2), and coming with or without images (number of pages from cateTable 4 The distribution of Web pages in the experimental dataset. Page type

Textual

Text wz image

Immoral Normal

250 225

450 370

700 595

Total

475

820

1295 Persian

Total

Page type

English

Immoral Normal

600 472

100 123

Total 700 595

Total

1072

223

1295

Cat0 Cat1 Cat2

453 163 456

112 19 92

565 182 548

Total

1072

223

1295

A. Ahmadi et al. / Applied Soft Computing 11 (2011) 1638–1647

1645

Fig. 8. Flowchart of the hierarchical classifications in the final classifier.

Table 5 Confusion matrix for classification of train data with ID3. Cat0 Train data Cat0 Cat1 Cat2

Cat1

Cat2

84.80% 234

12.30% 34

2.90% 8

16.20% 62

77.50% 296

6.20% 24

1% 5

8.50% 43

90.5% 458

The highest accuracy rate is for classification of Cat2 data. The misclassification of data from Cat0 to Cat1 and vice versa is high (22.4% and 25%) which is due to similarity between data of two categories specially for the border data like sport or medical pages. This means that the extracted features cannot discriminate between these two classes, strongly. In order to extract the most effective features, we have applied a correlation analysis on the features. However, the results show higher accuracy of ID3 classifier comparing to the Bayesian method we used in our previous work [28].

6.3. Experiments with skin color features gory 0 is 125 pages, Cat1: 43 pages, Cat3: 122 pages). The confusion matrices for classification results are shown in Tables 5 and 6. As can be seen from Table 5, the performance of ID3 in classification of category 1 data is weak (63%) but in other cases it is acceptable. Table 6 Confusion matrix for classification of test data based on textual and profile features. Cat0 Test pages Cat0

Cat1

74.60% 50

22.40% 15

Cat1

25% 23

63% 58

Cat2

3% 4

12.20% 16

Cat2 3% 2 12% 11 84.70% 111

In the next stage, the classification of test dataset was performed based on only skin color features. Pages containing at least two images with skin area of 50% or more are classified as immoral. The results are reported in Table 7. As can be seen from the table, there is relatively high false rate in Cat0 data (permitted pages) that is due to the skin color or quasi-skin color in the non-porn images which makes them to be classified as porn pages. In Cat2 data (immoral pages), because of pure textual pages without any pictures, a high rate of pages (25%) are not classified correctly. But, the most misclassification rate appears in the data of Cat1 (48%), that is, the pages which are not strictly permitted or immoral and could not be recognized based on only skin color. The average of classification rate by using only skin color feature on 290 test pages considering TP and TN columns, is then calculated as 77%.

1646

A. Ahmadi et al. / Applied Soft Computing 11 (2011) 1638–1647

Table 7 Confusion matrix for classification of test data based on skin color feature. Classified to Immoral true (TP) Test pages Cat0

Immoral false (FP)

0

12.70% 16

Cat1

24% 10

18.50% 8

Cat2

75% 92

0

Cat1 Cat2

Cat0

Cat1

90% 112

8.50% 11

16.50% 7 1% 1

71% 31 3% 4

87.30% 109 28% 12 0

Table 8 Confusion matrix for classification of test data based on combination of textual, profile, and visual features.

Test pages Cat0

Permitted true (TN)

Cat2 1.50% 2 12.50% 5 96% 117

6.4. Experiments with combination of features Finally and in order to reduce the misclassification rate, we used visual features (explained in Section 4.1) together with textual and profile features. The test dataset containing 290 pages was used again to evaluate the combined system performance. The classification results and error rate in each category are reported in Table 8. Here, the threshold Th in relation 1 is considered as 0.2, the minimum size for images is taken 50 × 50 pixels and the threshold for skin area is 50%. As it can be seen, the results are improved significantly. This improve comes up as increase in TP and reduction in FN rates. It is worth-noting that Tables 6 and 8 show the results for classification into three distinct classes (Cat0, Cat1, Cat2), but if we consider classification of data into only two classes of immoral and permitted, the average classification rate is obtained as 95%. These new results are shown in Tables 9 and 10 for two cases of textualprofile features and combination of all features, respectively. 6.5. Time complexity analysis As for classification of an input Web page, the most timeconsuming steps include: first, extracting all words or tokens in

Permitted false (FN) 0 29.50% 13 25% 30

the text and searching them within the black keyword lists, and second, detection of skin area for all images within the input page. Regarding the first step, the time complexity can be considered in the order of N × W, where N is the number of words in the page and W is the total number of black words for all three categories Cat0, Cat1, Cat2. As for the second step, considering the three layers MLP we exploited for classification, the time cost for each image is in order of M × k, where M stands for number of pixels in each image, and k = 20 × 100 where 20 is the number of components in skin feature vector, and 100 is the approximate number of instructions on one input line in each feed-forward transfer of MLP. Therefore, the total time complexity of the system for classification of each input page will be determined as N × W + M × K where K is defined as K = 5 × k (as we process maximum 5 images from each page). Using a personal computer equipped with a 2 GHz Pentium processor, the average time for classification of each input page was obtained as 0.8 s. 7. Comparison with other works It is very hard to make a practical comparison between the proposed system and other works in the literature. This is mainly due to lack of information about other works and lack of a standard benchmark for evaluating different systems. The experimental results reported by the researchers are typically based on a limited number of data or data of specific databases which are usually not available to all. There is no benchmark database used by all the works in the field. Also, there is no possibility to approve the correctness of reported results. Another problem is that there is no major work in field of Persian Web page classification and it makes it difficult to do a valid comparison. Nevertheless, in order to give an illustrative comparison, we tried to simulate the performance of one typical and more recent work in the field and compare the classification results for same validation dataset. We chose for Hammami et al. work [3]. The

Table 9 Classification results for test data based on textual and profile features. Classified to

Test pages Immoral Permitted

Immoral true (TP)

Immoral false (FP)

Permitted true (TN)

Permitted false (FN)

89% 0

0 22%

0 78%

11% 0

Table 10 Classification results for test data based on textual, profile, and visual features. Classified to

Test pages Immoral Permitted

Immoral true (TP)

Immoral false (FP)

Permitted true (TN)

Permitted false (FN)

95% 0

0 11%

0 89%

5% 0

A. Ahmadi et al. / Applied Soft Computing 11 (2011) 1638–1647 Table 11 Classification results for test data obtained from the simulated system based on Hammami work. Cat0 Test pages Cat0 Cat1 Cat2

Cat1

Cat2

82.5% 71

10.5% 9

7% 5

21% 9

70% 21

9% 3

4.5% 4

5.5% 13

80% 65

description of the system is given in chapter 2 of this paper. According to the information given in their paper, 14 features were selected as textual-structural features and one feature of skin color (i.e. the percentage of skin area to the whole image) was selected as visual feature. However, the skin color detection method they applied, was not clearly described. The Bayesian classifier was used for classification of textual-structural features. The classification results of the simulated system for the 200 pages of our prepared test dataset are shown in Table 11. As it can be realized, the classification results obtained from our proposed system (Table 8) overall are better. In classification of permitted pages (Cat0 data and some part of Cat1), the accuracy of both systems seems relatively the same, but in classification of porn pages (Cat2 data and some part of Cat1) which mainly come with images, the accuracy of our system is significantly higher. 8. Conclusion In this paper, we have presented a Web pages filtering system based on combination of textual, profile, and visual features. The model employs a hierarchical set of classifiers, an ID3 classifier for textual and profile features, and neural networks model for skin color and visual features. A classification rate of 95% was obtained when system applied on a dataset of 1295 Web pages. Comparing to other works in the field, the proposed model is advantageous in term of using more effective contextual features and combining them with two sorts of robust visual features. There are still many issues to be considered in the future works. Some instances of these issues are: the ability of incremental learning for classifiers, how to prepare a representative dataset of Web pages indicating all porn-pages characteristics, how to reduce the FP rate in classification of various pages with various types of subjects and images from different categories. Acknowledgements This research was supported by Iranian Telecommunication Research Center (ITRC). Authors would like to thank ITRC for their kind helps and support. References [1] L. Rokach, O. Maimon, Data mining with decision trees: theory and applications, World Scientific (2008) 71–72. [2] http://www.consumersearch.com/parental-control-softwar. [3] M. Hammami, et al., WebGuard: a web filtering engine combining textual, structural, and visual content-based analysis, IEEE Transactions on Knowledge and Data Engineering 18 (February (2)) (2006).

1647

[4] M. Hammami, et al., Adult content web filtering and face detection using datamining based skin-color model, in: Proceedings of the International Conference on Multimedia and Expo (ICME’04), 2004. [5] Filtering Concepts and Applications, http://www.knowclub.com/paper/?p=322 (in Persian). [6] Family Online Safety Institute, FOSI, http://www.fosi.org. [7] A. Bosson, et al., Non-retrieval: blocking pornographic images, in: Proceedings of the International Conference on the Challenge of Image and Video Retrieval, vol. 2383, Lecture Notes in Computer Science, Springer-Verlag, 2002. [8] Girgis, et al., An approach to image extraction and accurate skin detection from web pages, International Journal of Computer Science and Engineering 1 (2) (2007) (Spring). [9] Z. Chen, et al., A novel web page filtering system by combining texts and images, in: Proceedings of the IEEE Conference on Web Intelligence, 2006. [10] Lee, et al., Neural networks for web content filtering, in: IEEE Intelligent Systems, 2002, ISBN: 1094-7167. [11] Shen, et al., The filtering of internet images based on detecting erotogenicpart, in: Proceedings of the Third IEEE Conference on Natural Computation, 2007. [12] Wang, et al., Classifying objectionable websites based on image content, in: Proceedings of the Fifth International Workshop on Interactive Distributed Multimedia Systems and Telecommunication Services, 1998. [13] B. Stayrynkevitch, et al., Poesia software architecture definition document, in: Technical Report, Poesia Consortium, December, 2002. [14] W. Ho, P. Watters, Statistical and structural approaches to filtering internet pornography, in: Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, 2004, pp. 4792–4798. [15] M. Jones, J. Rehg, Statistical colour models with application to skin colour detection Technical Report CRL, vol. 11, Compaq Cambridge Research Lab, 1998. [16] D.A. Forsyth, M. Fleck, Automatic detection of human nudes, International Journal of Computer Vision 32 (1) (1999) 63–77. [17] H. Zheng, M. Daoudi, B. Jedynak, Blocking adult images based on statistical skin detection, Electronic Letters on Computer Vision and Image Analysis 4 (2) (2004) 1–14. [18] Y. Xu, B. Li, X. Xue, H. Lu, Region-based pornographic image detection, in: Proceedings of the IEEE Seventh Workshop on Multimedia Signal Processing (MMSP’05), Shanghai, China, 2005. [19] A.A. Abin, M. Fotouhi, S. Kasaei, Skin segmentation based on cellular learning automata, in: Proceedings of the Advances in Mobile Computing and Multimedia (MoMM), Linz, Austria, November, 2008, pp. 254–259. [20] J.L. Shih, C.H. Lee, C.S. Yang, An Adult Image Identification System Employing Image Retrieval Technique, 2007. [21] R. Ap-apid, An algorithm for nudity detection, in: Proceedings of the British Machine Vision Conference, vol. 2, 2001, pp. 491–500. [22] M.K. Hu, Visual pattern recognition by moment invariants, in: Proceedings of the IRE Transactions on Information Theory, IT-8, 1962, pp. 179–187. [23] Internet Pornography: Are Children at Risk?, http://www.nap.edu/catalog/ 10261.html?onpi-Webextra-050202. [24] P.Y. Lee, S.C. Hui, A.C.M. Fong, An intelligent categorization engine for bilingual web content filtering, IEEE Transactions on Multimedia 7 (6) (2005) 1183– 1190. [25] W.A. Arentz, B. Olstad, Classifying offensive sites based on image content, Computer Vision and Image Understanding 94 (1–3) (2004) 295–310. [26] B. Liu, J. Su, Z. Lu, Z. Li, Pornographic images detection based on CBIR and skin analysis, in: Proceedings of the Forth International Conference on Semantics, Knowledge and Grid, 2008, pp. 487–488. [27] W. Hu, et al., Recognition of pornographic web pages by classifying texts and images, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (June (6)) (2007) 1019–1034. [28] A. Ahmadi, M. Zamanian, M.M. Takami, Adult web page filtering using textual, structural, and visual features, Journal of Iran Telecommunication Research Center (IJICT) (February) (2010) (in Persian). Ali Ahmadi received his B.Sc. in Electrical Engineering from Amirkabir University, Tehran, Iran in 1991 and M.Sc. and Ph.D. in Artificial Intelligence and Soft Computing from Osaka Prefecture University, Japan in 2001 and 2004, respectively. He worked as a researcher in Research Center for Nanodevices and Systems in Hiroshima University, Japan during 2004–2007. He has been with Khajeh-Nasir University of Technology, Tehran, Iran as assistant professor from 2007. His research interests include intelligent models, machine vision, hardware-software co-design, and learning algorithms. Mehran Fotouhi was born in Isfahan, Iran, in 1981. He received the B.S. degree in Computer Engineering from Isfahan University of Technology, Iran in 2005, and the M.S. degree in Computer Architecture from Sharif University of Technology at Tehran in 2009. Currently he works on content-based Web page classification at Iran Telecommunication Research Center (ITRC).