Intelligent classification of web pages using contextual and visual features

Applied Soft Computing 11 (2011) 1638–1647 Contents lists available at ScienceDirect Applied Soft Computing journal homepage: www.elsevier.com/locat...

Download PDF

609KB Sizes 0 Downloads 131 Views

Report

PDF Reader
Full Text

Applied Soft Computing 11 (2011) 1638–1647

Contents lists available at ScienceDirect

Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc

Intelligent classiﬁcation of web pages using contextual and visual features Ali Ahmadi a,∗ , Mehran Fotouhi b , Mahmoud Khaleghi c a

Electrical & Computer College, Khajeh Nasir Toosi University of Technology, Shariati St., Seyedkhandan, Tehran, Iran Computer Department, Sharif University of Technology, Tehran, Iran c Iranian Telecommunication Research Center, Tehran, Iran b

a r t i c l e

i n f o

Article history: Received 25 May 2009 Received in revised form 26 January 2010 Accepted 1 May 2010 Available online 12 May 2010 Keywords: Web-pages classiﬁcation Content based ﬁltering Porn image detection Skin color detection Adult image detection

a b s t r a c t In this paper we address classiﬁcation of Web content and in particular its application in the detection of pornographic Web pages. Filtering of undesirable Web content is mainly achieved based on blocking a speciﬁc Web address via searching it in a reference list of black URLs or doing a plain contextual analysis on the page by searching special keywords in the text. The main problem with current ﬁltering methods is the requirement for instantly update of the URL list and also the high rate of over-blocking the usual pages. In this paper, we propose an intelligent approach which is based on using textual, proﬁle, and visual features in a hierarchical structure classiﬁer. Textual features contain information about keywords, black-words, etc. and proﬁle features contain structural information like number of links, meta-tags, pictures, etc. As for the visual features we employ a sort of global and local indicative features including topological and shape-based characteristics which are extracted from the skin region. The algorithm was applied on a dataset with 1295 Web pages as training set including 700 porn pages (coming with text, image, or both) in English and Persian, and 595 non-porn pages including pages with medical, health, sports, etc. topics. Using a test dataset with 290 Web-ages a 95% accuracy rate was obtained. © 2010 Elsevier B.V. All rights reserved.

1. Introduction With the ever-growing Web, the Websites with objectionable contents like pornography, violence, racism, etc. have been augmented rapidly during recent years. Among the offensive contents, the pornography is the most harmful one affecting children safety and causing many destructive side effects. According to a recent survey, one in four kids reported having at least one unwanted exposure to sexually explicit pictures, and one out of ﬁve reported receiving a sexual solicitation [23]. Different researches and efforts have been carried out recently on how to block the pornography Websites among them content-based ﬁltering is the most effective one [5,6]. Also, many software packages have been developed [2] which mainly employ two kinds of approaches for classifying Web pages: static ﬁltering and dynamic ﬁltering. Static ﬁltering is based on blocking a speciﬁc Web address via searching it in a reference list of black URLs. Although this method has high speed of processing, but its shortcoming is the requirement for instantly update of the URL list. This updating is a very hard task in the rapidly improving Web. Another problem is the high rate of over-blocking the usual pages such as pages with medical, sports,

∗ Corresponding author. Tel.: +98 21 22361217; fax: +98 21 22361217. E-mail address: [email protected] (A. Ahmadi). 1568-4946/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.asoc.2010.05.003

or arts topics, or blocking a Website because of only one immoral page on it. In dynamic ﬁltering, the classiﬁcation is performed based on the content analysis. First, the content of the pages is analyzed by using intelligent algorithms like learning models, data mining methods and so on, and then the page is classiﬁed based on the content features. The classiﬁcation accuracy is higher in this approach but it has overloading in classiﬁcation process which makes problem in the online applications. As a key point in the content-based ﬁltering, images are considered as the essential part of Web pages particularly in the adult pages. A study of more than four million Web pages reveals that 70% of them contain images and there are, on average, 18.8 images per Web pages [13]. Also, a statistical analysis of 1232 pornographic and 6967 non-pornographic Web pages shows that 72% of pornographic pages have more than 5 images and 60% of them have more than 10 images [14]. In addition, 40% of pornographic Web pages have more than 5 links to image and video ﬁles. Therefore, any effective Web site classiﬁcation system should take into account the visual content part and provide a method for detecting the characteristics of the images within the page. Through a survey on all the main works in the literature, we realized that a heuristic combination of content features together with a hybrid structure for classiﬁer can signiﬁcantly enhance the ﬁltering performance. In this paper, we propose an intelligent approach

A. Ahmadi et al. / Applied Soft Computing 11 (2011) 1638–1647

based on using three sorts of features, that is, textual, proﬁle, and visual features, in a hierarchical structure of classiﬁer and test its performance on a representative dataset of Web pages. As for the visual features we employ a sort of global and local indicative features including topological and shape-based characteristics which are extracted from the skin region. Artiﬁcial neural networks are used as learning models for both skin detection and ﬁnal image classiﬁcation. 2. Related works According to a survey on works concerned with content-based ﬁltering, there are two main approaches for Web classiﬁcation: classiﬁcation based on textual content features, and classiﬁcation based on both visual and textual features. The ﬁrst group uses textual analysis mainly by searching a list of indicative keywords over the text. In the second methods, textual content-based analysis together with visual features are used to get a more robust classiﬁcation. Visual features are extracted from the images in the Web pages by utilizing effective image processing techniques. For instance, skin-area detection [3,4,8], detection of ROIs1 in the human body [9,11], and image wavelet transform [12] are some of the approaches already proposed. Here, we point out to some main works in the ﬁeld. Hammami et al. [3] proposed an approach based on extraction of both contextual and visual features. In their work, a number of 20 textual and proﬁle features are extracted along with some visual features which are obtained from the skin-area in the images. It is asserted that the accuracy of the system is signiﬁcantly improved when a hierarchical combination of contextual and visual features is used. However, the method for combination of features as well as classiﬁers is not clearly described. Using structural features of the input page as well as the various kinds of textual features is one advantage of their work. The drawback with their work is that no effectiveness or correlation analysis is applied on the extracted features. Moreover, the classiﬁcation based on visual features is not of a high accuracy because of using only proportion of skin pixels to the total pixels as the main feature. Also, the method for classiﬁers combination is not clearly described. In [27], Hu et al. ﬁrst use a C4.5 tree to categorize input pages into three classes of continuous texts, distinct texts, and image pages. A CNN2 net is employed for ﬁnding the semantic relations within continuous text and a naïve Bayesian algorithm is used for recognition of distinct texts. The classiﬁcation results based on textual and visual features are lastly combined by a Bayesian algorithm. The system is tested with 1000 pages of different subjects and the average classiﬁcation rate is obtained as 91.6%. Making use of semantic analysis for classiﬁcation of textual contents is the main advantage of their work. But, the sequential procedure of classiﬁcation seems to cause increasing errors throughout the classiﬁcation steps. Moreover, the multi-dimensional histogram method they exploited for extracting skin features is a global operator which seems not to be a proper solution for its high time-complexity and low accuracy in detection of quasi-skin areas. Chen et al. [9] proposed a statistical approach by combining textual and visual contents. Their work consists of three steps: (1) classiﬁcation based on discrete text (keywords), (2) classiﬁcation based on continuous text (sentences), and (3) classiﬁcation through images. Besides using methods like skin color detection, they have also used some other features for image classiﬁcation based on the ROI. At the end, a fusion classiﬁer combined of k-NN3 for classi-

1 2 3

Region of interest. Cellular neural network. k nearest neighbors.

1639

ﬁcation of texts and Yang method for classiﬁcation of images is introduced and a 91.8% classiﬁcation rate is obtained over 1500 sample pages. Using only URLs and keywords instead of a contentbased analysis, as well as small size of test dataset, and relatively low accuracy rate are some shortcomings of their work. Lee et al. [10,24] used an artiﬁcial neural network to classify Web pages based on textual contents. The textual features contain page title, the visible part of the text, metadata for page description and keywords, and tooltips of the images. A pre-processing stage is used for converting the features to an input vector for neural network classiﬁer. The system was trained with a training dataset of 3777 non-porno pages and 1009 porno pages (labeled manually), and for testing the system performance a database containing 535 porno sites and 523 non-porno sites was used. The accuracy is reported as 95%. The advantage of the method is the ability of recognition of bilingual pages (Chinese and English) and the shortcoming is the disability of classifying gallery pages or pages with many images. As for the visual features, the ﬁrst step in almost all of existing approaches is skin detection. In [15], authors have shown that there is a strong correlation between the percentage of skin and the possibility of pornographic content within the image. Pixel-based and region-based algorithms are two main approaches applied for skin detection. In pixel-based methods, the color of pixel is considered as a feature while in region-based methods, classiﬁcation is done based on spatial information of pixels. However, due to various parameters such as individual characteristics (e.g., race, age, body part) or variation of illumination on skin appearance, the result of detection might be unreliable. To overcome these problems, a number of approaches based on different color spaces have been proposed. In [15], a histogram method with Gaussian mixture models is proposed for skin detection. Forsyth and Fleck [16] used texture information in a logopponent color space to segment skin regions. Zheng et al. [17] proposed a statistical skin detection method based on maximum entropy model. In [18], region based skin detection is proposed. Color and texture feature are extracted from arbitrary-shaped segmented regions. These features are classiﬁed by Gaussian mixture models. A cellular learning automata based skin detection is proposed in [19]. They extract a skin probability map based on texture information. The map is then fed to the CLA4 to make decision on skin-like regions. Explicit rules based on YCbCr color space are used in [20] for skin detection. In [21], a skin color distribution model based on RGB, normalized RGB, and HSV color spaces is constructed using correlation and linear regression between the components. Girgis et al. [8] presented a system for extracting images from Websites and detecting the skin areas. This system which is called BHO (browser helper object) was an IE5 accessible object that ran in the background of IE and could extract all images and URL links in the page. Two techniques were introduced for skin detection based on color spaces: YUV and RGB and ﬁnally it is proved that YUV space is the optimum space for skin detection algorithm. But the method proposed for skin color detection seems not to be a robust algorithm. Bosson et al. [7] proposed a method based on visual content features. In the ﬁrst stage they use a skin ﬁlter to localize skin pixels and generate a skin blob. Then some topological features such as area, the length of the major and minor axis of an ellipse ﬁtted to the blob, are used to classify the page. The MLP6 classiﬁcation method has given the minimum misclassiﬁcation rate among other four applied methods. For evaluating the algorithm, a data set of 10,005 images which were hand-classiﬁed into ﬁve categories, are used

4 5 6

Cellular learning automata. Internet explorer. Multi layer perceptron.

1640

A. Ahmadi et al. / Applied Soft Computing 11 (2011) 1638–1647

and a classiﬁcation rate 87.2% is reported. The advantage of their work is in the optimal classiﬁer design but the features used for skin detection are not described clearly. Also no robust method is introduced for evaluating the visual features. Different approaches have been applied for image classiﬁcation based on skin detection results. A straightforward algorithm has been proposed for nudity detection in [21]. This algorithm is based on some simple rules deﬁned on the percentage of skin area in the image. A number of researchers have introduced additional heuristic features. In [16], authors used a body geometric ﬁlter to ﬁnd the existence of human structures such as limbs from the skin regions. In [17], some elliptical features are extracted from the image and a MLP is used for learning these features. Authors in [18] used eigen-region as features and learned them with MLP. In [20], an image retrieval based detection method is proposed. Color, texture, and a set of shape features have been used to retrieve 100 most similar images from the image database containing both adult and non-adult pictures. Wang et al. [12] developed a WIPE7 based image retrieval technique which uses a feature vector including Daubechies wavelets, normalized central moments, and color histogram. Johns and Rehg [15] used ﬁve simple features from the output of the skin detector and then trained a MLP classiﬁer on these features to determine whether a human being is present in the image or not. Liu et al. [26] applied an image retrieval method for pornography detection. First, the human being is detected in the picture and then by making a skin color analysis, being porn or not-porn is determined. Arentz and Olstad [25] extracted a set of visual features from the connected skin area which contain information about color, texture, form, center of gravity, and area of the skin. The main concentration in their approach is on the skin area detection. A validation dataset consisting 20 Web pages with around 2000 images is exploited and a 89% classiﬁcation rate is obtained. The advantage of the method is the application of genetic algorithm for optimization of feature selection which leads to a signiﬁcant reduction in misclassiﬁcation rate. The weakness of their method is disability of recognition and classiﬁcation of pages without images. Each of the above methods has its strength and weakness such as system over loading, high computational cost, over blocking, nonsupporting all kinds of Web pages, low accuracy, etc. which will be discussed later. 3. Outline of the proposed algorithm Fig. 1 illustrates outline of the proposed algorithm. In order to overcome the shortcomings of pervious works, we use three set of classiﬁers, which call them weak classiﬁers, based on three types of content features. Then we combine them in a hierarchical structure to obtain the ﬁnal robust classiﬁer for pornography detection. As illustrated in the block diagram, the training system is composed of three principle modules including feature extraction, feature vector generation, and hierarchical classiﬁcation. First, a representative database of Web pages was provided by utilizing WebCrawler software. The pages were manually classiﬁed into porn and non-porn categories. In the feature extraction and feature vector generation steps, three types of analysis are applied to extract required features containing: (i) textual content features as shown in Table 1, (ii) proﬁle content features as shown in Table 2, and (iii) visual contents including skin color space and speciﬁc object analyzing. These features are then used in a learning process to generate the optimized parameters for the classiﬁer as well as the optimized feature vector. Later in the testing phase, we use

7

Wavelet image pornography elimination.

Fig. 1. Flowchart of the proposed system for detection of immoral Web pages (training phase).

Table 1 Textual features for a Web page. wrd xwrd pcxwrd nxkywrd pcxkywrd nxdscript ntitwrd nxtitwrd

Number of words in the page Number of black words in the page Ratio of black words to total words Number of keywords with a black word within them Ratio of black keywords to all keywords Number of descriptions with a black word within them Number of total words in the page title Number of black words in the page title

this trained classiﬁer to classify input pages into porn and non-porn classes. We specify three distinct categories for classiﬁcation of Web pages regarding immoral contents: • Category 0 containing all permitted pages (i.e. ordinary permitted pages without any pornographic or immoral articles). • Category 1 including immoral pages but not porn (coming without dirty words or harmful images for children, or the non-porn pages looking as porn ones like medical pages, etc.).

Table 2 Proﬁle features for a Web page. npix nxpix nlink nxlink nxxlink pcxlink nmtkywrd nvideo nframe ncolor ndscript nwarn nxwarn ntooltip nxtooltip

Number of images in the page Number of images with a black word on their names Number of links in the page Number of links with a black word in them Number of page links existed in the black URL list Ratio of black links to total links Number of meta tags with keywords Number of videos in the page Number of frames in the page Number of colors used in the page Number of meta tags with description Number of warning tags Number of warnings with a black word in them Number of tooltips Number of tooltips with a black word in them

A. Ahmadi et al. / Applied Soft Computing 11 (2011) 1638–1647

• Category 2 including porn pages (coming with dirty words and pornographic images). The novelty of the proposed method can be summarized as follows: (1) The input pages are classiﬁed into three distinct classes according to their level of immorality, rather than a single negative or positive class. (2) Three different lists of characteristic words corresponding to three different classes of output are used for extracting the word-based features in the main text, the title, tooltips, etc. (3) A sort of local and global features extracted with a new approach is employed for recognition of porn pictures. The combination of these features with the skin color feature which itself is generated with a new method, form a very effective set of visual features. (4) A hierarchical structure is used for integrating of classiﬁers. 4. Feature extraction Here, we describe each of the textual, proﬁle, and visual features and how we extract them. 4.1. Textual and proﬁle features Textual features include all the items listed in Table 1 and the proﬁle features are as listed in Table 2. The features are extracted and processed by means of the WebCrawler software which is designed and implemented by our team members. The WebCrawler can process a single or group of Web pages in online or ofﬂine mode and extract their textual and proﬁle features as well as the images within the pages. There is the possibility of processing the URL links within a page with an unlimited depth. Also, different settings can be initialized and black-word list and black-URL list can be introduced to the system. For classiﬁcation of Web pages we need a sort of features to be representative of pages characteristics. At ﬁrst, we used density features which were calculated by counting the number of keywords used in each of the main characteristic of the page, that is, context, links, title, images name, and tooltips. In other word, the number of characteristic words used in the context or other page attributes are determined according to the different categories’ keywords. And the density value for each feature is obtained via dividing the number counted, by the total number of words. These density values are determined for each of three categories and each of the ﬁve features mentioned above, and consequently, give 15 textual features. The classiﬁcation results based on these features showed a good accuracy rate over many sorts of Web pages. But in case of pages which are concerned with a speciﬁc subject and using speciﬁc words belonging to a speciﬁc category, the system performance was confusing. For example, a page which is talking about children sexual rights, have a high density of immoral word while the page is not an immoral page. The point in such pages is that the number of very speciﬁc words is high and the variety of words is low. Therefore, a new feature was generated based on the variety of characteristic words used in the page. The number of distinct words indicative of each category was calculated and divided by the number of total distinct words in the page. We called this feature as frequency of category words in the page. By using this new feature, the classiﬁcation accuracy was increased signiﬁcantly. 4.2. Visual features This step contains extracting two sorts of features: global and local. Our feature extraction algorithm has some similarity with

1641

what introduced in [17,22,7], but it differs considerably in practical implementation. In most of adult image detection systems, the percentage of skin is directly used as one of the main features. For example, in [7] the area of face is considered as a feature. This might cause errors in detection of images which contain other objects with skin-like color or skin objects with different distances to camera. Also, this will reduce the effect of other features in the detection process. Experimental results in [21] show that very small number of adult images has skin percentage less than 15%. Based on our experiments, if the largest skin region contains more than 20% of skin pixels (as a threshold value), this region can be considered as the main region for extracting local features. We will ﬁt one ellipse as global ellipse to all skin regions. As for the largest skin region, another ellipse is ﬁtted for extracting local features which we call it as local ellipse. Parameters of ellipses are computed using central moments. 4.2.1. Skin detection method Many methods have been suggested for skin detection but color based approaches are widely used according to their high speed and good precision. In this paper, color based features are utilized for discriminating skin and non-skin areas. Since color based features are easy to obtain and also robust to the orientation and scaling, using these features makes skin detection system faster and more precise. The drawback of these features is their sensitivity to the ambient conditions such as illumination and type of camera. The color space RGB is more sensitive to the illumination. One way to overcome this problem is transforming RGB to the color space YCbCr, eliminating the illumination axis Y, and using just chrominance axes Cb and Cr. In this paper, both color spaces have been investigated. The ﬁrst group of features includes R, G, and B value of a pixel and its four neighbor pixels (15 features), and second group of features includes r, g (normalized R and G), Cb, and Cr of a pixel and its four neighbors (20 features). We selected the second group due to better classiﬁcation performance which is reported in Section 5.2. 4.2.2. Global features These features are extracted from the global ellipse. We normalize the ellipse center, major axis length and minor axis length relative to image size. The ratio of minor axis to major axis is also computed. 4.2.3. Local features Local features are computed on the largest skin region. We obtain three categories of features for the largest skin region. Features of ﬁrst category show the compactness of the largest skin region and its situation respect to other skin regions. These features are: • • • • • • • •

The ratio of minor axis to major axis of local ellipse. Normalized center of local ellipse respect to image size. Eccentricity of local ellipse. Difference between angle of the major axis from the horizontal axis of local ellipse and global ellipse. The ratio of largest skin region to area of local ellipse. The ratio of skin region to area of bounding box. The ratio of width of bounding box to height of bounding box. The ratio of largest skin region to all skin regions.

Figs. 2 and 3 show a local bounding box and ellipse on the largest skin region for two different images. It seems clear that these features make signiﬁcant differences between porn and non-porn images. These features are rotation and scale invariant and can be learned by one of the machine leaning algorithms.

1642

A. Ahmadi et al. / Applied Soft Computing 11 (2011) 1638–1647

Fig. 2. A non-porn image with its local ﬁt bounding box and ellipse: (a) original image; (b) local ﬁt bounding box; (c) local ﬁt ellipse. Fig. 4. An example for decision tree.

The second category of features is based on the shape of the largest skin region. Shape based object detection and recognition is common in image processing. The purpose of shape descriptors is to distinctively illustrate the object shape. A good shape descriptor should be insensitive to noise, decrease the within-class variance, and maximize the between-class variance. Several approaches proposed for describing shape of an object. For instant, chain code, curvature and moments are used as good shape descriptors. Here, we use seven normal moment invariants deﬁned by Hu [22] to describe shape features. These moments are invariant and independent of translation, scale, and rotation. The third category of features consists of textural features that is computed on the largest skin region. Moments describe the outside of the shape. We need also extra features to illustrate the inside characteristic of the shape. For this, we used normalized edge direction histogram. The edge direction histogram is provided to represent the local edge distribution of an image in MPEG7. The edge direction histogram is computed in 6 directions (i.e. 0–45–90–135–225–315◦ ). To make this feature scale invariant, the edge direction of boundary of the largest skin region is omitted. In

order to make it invariant against rotation, we rotate the largest skin region clockwise so that the local ellipse gets same orientation as the horizontal axis. Finally, a feature vector of observing image containing 29 global and local features is constructed. This feature vector is fed to a classiﬁer in the next stage. 4.2.4. Feature analysis For an image that contains only one nude object (or porn scene), global features will be equal to local ones. Some local features such as the ratio of skin region to area of bounding box or the ratio of width of bounding box to height of bounding box are used for detecting non-porn images which contain large skin region. For an adult image that has only one nude object (or porn scene), these features value change in a limited range. Global and local features exploited here are similar with features in [18,21,22], and may be good enough for detecting porn image but our simulation results show that these features solely are not robust for detecting non-porn images. So we considered extra features such as shape and texture to discriminate non-porn from porn images. We used Hu moments as shape descriptors and normalized edge direction histogram as textural descriptors. The latest feature is extracted to illustrate the inside characteristic of the shape. For porn images, this feature value changes in a limited range. If the feature gets very low value – that is having a soft skin region – or if it gets very high value – that is the coarseness of skin region is considerable, it means that the image is probably not a porn one. 5. Classiﬁer design 5.1. Classiﬁcation of textual and proﬁle features with ID3 Decision trees are tree-structure classiﬁcation method in which the nodes are tests on input patterns and the leaves are classes. Fig. 4 illustrates this structure. According to the different values that attribute(s) can take in each node, the nodes are split into the branches and connected to the lower nodes. Each input pattern passes only one path from the ﬁrst node of the tree (i.e. the root) to the leaf which determines the class of that pattern. Minimum misclassiﬁcation and simplicity are two major criteria in decision tree design. Different algorithms are proposed for decision trees such as ID3, C4.5, CART, and CHAID [1]. In ID38 algorithm, the attribute selection for each node is based on maximum gain information or equivalently maximum decrement in entropy. The stopping criterion for tree growing is when all samples are in the same class or when the highest information gain is not more than zero. Comparing to the original version of ID3, in this paper

Fig. 3. An adult image with its local ﬁt bounding box and ellipse: (a) original image; (b) local ﬁt bounding box; (c) local ﬁt ellipse.

8

Induction of decision tree.

A. Ahmadi et al. / Applied Soft Computing 11 (2011) 1638–1647

1643

Fig. 5. Mean square error of the network for both cases of RGB and rgCbCr features.

we use also Chi-square test to stop the growing of tree which is a kind of pre-pruning method. In this criterion, if p-value is greater than a conﬁdence-level, the node is not split. For constructing the tree, 1164 pages from three different categories were randomly selected as training dataset (276 pages from Cat0, 382 pages from Cat1, and 506 pages from Cat2). The conﬁdence-level for the tree was taken as 0.1. The experimental classiﬁcation results are reported in Section 6.2. 5.2. MLP classiﬁer for skin area Multi-layer perceptron as a powerful model in pattern recognition has been applied for our skin detection process. This classiﬁer can produce complex boundaries between different classes. To detect whether a pixel in an image is a skin or not, the color features of this pixel and its four neighbor pixels are used. The MLP we applied has one hidden layer and one neuron in its output layer. The size of hidden layer and input (number of features) is ﬂexible and to be optimized. As for activating function, sigmoid is utilized as a nonlinear function in all neurons. LevenbergMarquardt is used as learning method. Cross-validation sampling method is applied for producing training and test dataset. To determine the size of hidden layer, the number of neurons in this layer has been changed from 5 to 50 and for each number of neurons the network has been trained 7 times with different biases and weights. The mean square errors (MSE) of network in 7 runs with RGB and rgCbCr features for each number of neurons in hidden layer are illustrated in Fig. 5. According to Fig. 5, in the case of RGB-based features, a hidden layer with 30 neurons is the best choice. In case of rgCbCr-based features, 45 neurons in the hidden layer is the best choice. The training procedure and experimental tests are reported in Section 6.3. In order to create train dataset, 165 images from different skin and non-skins regions and in various ambient conditions, races, and camera situations are gathered through internet. From these images, 20,877 random pixels are selected including 11,572 skin pixel and 9305 non-skin pixels. 70% of these pixels were selected randomly as a train set, 20% as a test set, and 10% for validation set which the later one is chosen for network assessment. The classiﬁcation results obtained from the validation dataset for both RGB and rgCbCr cases are reported in Table 3. Results show that using Table 3 Classiﬁcation results for validation data using RGB and rgCbCr based features. Color-based features

Number of neurons in hidden layer

RGB rgCbCr

30 45

True positive 86% 88.80%

False positive 23.40% 21.10%

rgCbCr features causes less error in classiﬁcation of pixels but more complicated network is needed. 5.3. Classiﬁcation of visual features The task in this step is to discover the decision rule on the feature vector so that optimally separates adult images from those are not. Evidence from [7] shows that MLP classiﬁer offers a statistically considerable performance over several other approaches such as generalized model, the k-nearest neighbor classiﬁer and the support vector machine. The MLP we used here, was adjusted to have two hidden layers with 15 and 7 neurons in each. As for training of the classiﬁer, 165 images were used as training dataset containing both porn images with label +1 and non-porn images with label 0. Feature vectors for all images are extracted and fed to MLP. The network output is a number between 0 and 1. The near the number to one, the more possibly the input image corresponds to an adult image. We used a threshold value to get the binary decision. Fig. 6 shows the classiﬁcation results on some typical image samples. The misclassiﬁcation of Images 1, 2, and 3 is due to similarity of objects or objects’ pose in the images to the speciﬁc objects in the porn images. Other images are classiﬁed correctly. Fig. 7 displays the ROC9 of the system performance for different threshold values used in the MLP classiﬁer. 5.4. Classiﬁers combination The ﬁnal classiﬁcation procedure in the system is implemented as follows. First, the language of the input page is identiﬁed through the coding of the words within the page. Next, the input page is classiﬁed based on textual and proﬁle features using the ID3 classiﬁer. The classiﬁcation result comes as three probability values (p0 , p1 , and p2 ) for assigning the input page to one of the three categories (0, 1, and 2) mentioned in Section 3. If we have

(p1 + p2 ) − p0 > Th

(1)

where p0 is the probability of associating input page to category 0 (i.e. permitted pages), p1 and p2 are probability of associating to categories 1 and 2, respectively (both immoral pages), and Th is a threshold value, then the classiﬁcation between moral and immoral categories is reliable, that is there is an enough conﬁdence bound between winner and nearest loser. If not, then we go to the next stage of classiﬁcation, which is the classiﬁer based on visual features. The classiﬁcation result of this stage is used as an auxiliary feature for further discrimination between category 0 and

9

Receiver operating characteristic.

1644

A. Ahmadi et al. / Applied Soft Computing 11 (2011) 1638–1647

Fig. 7. ROC curve of proposed system in detection of porn images.

ered by Mozilla browser using Google search engine. Among this dataset, a number of 1295 pages including 1072 English and 223 Persian pages were randomly selected to form a dataset we call it as Statistically Distributed Data for Pornography Detection (SDDPD). This dataset consists of 700 immoral pages ranging from erotic to pornographic, which sampled from 300 Websites in a keyword searching procedure. Porno pages include plain-text pages, gallerylike pages and the pages consisting of both texts and images. The rest 595 pages were non-pornographic with the subject of health care, training, scientiﬁc news, sports, art and entertainment. Also, some confusing pages such as anti-AIDS, medical material and topics about sex education, and some other normal pages containing partial nude body pictures or suspicious words, were included in order to evaluate the false positive rate of the system. Prior to classiﬁcation, all pages were labeled manually according to association to one of the three categories mentioned earlier. Table 4 illustrates the distribution of data based on page language, type of the page, being immoral or normal, and the belonging category. 6.2. Experiments with textual and proﬁle features

Fig. 6. Examples of image classiﬁcation in the proposed system. The right-side column shows the classiﬁer’s output which is positive when larger than 0.5 and negative otherwise. The images labeled as 1, 2, 3 are FP cases, 4, 5, 6 are TN cases, and 7, 8 are TP ones.

other categories in general, and between categories 1 and 2 as subcategories of immoral pages. For this, ﬁve images of the page are randomly selected and classiﬁed by using global and local visual features. If the number of images classiﬁed as porn image is more than two, the page is considered as immoral page (either category 1 or 2 according to max of p1 or p2 ). Otherwise, the classiﬁcation will be performed based on the preceding results of textual and proﬁle classiﬁcation (i.e. max of p1 , p2 , p3 ). The whole procedure is illustrated in the ﬂowchart of Fig. 8. 6. Experimental results 6.1. Dataset description In order to evaluate the proposed system performance, we used a manually collected dataset, containing 5000 Web pages gath-

As described in Section 5.1, the ID3 classiﬁer we used for textual and proﬁle features was trained with 1164 pages selected from the SDDPD dataset. Next, a dataset with 290 pages was selected for testing the system performance. This dataset contains different types of English and Persian pages, associating to different categories (0–2), and coming with or without images (number of pages from cateTable 4 The distribution of Web pages in the experimental dataset. Page type

Textual

Text wz image

Immoral Normal

250 225

450 370

700 595

Total

475

820

1295 Persian

Total

Page type

English

Immoral Normal

600 472

100 123

Total 700 595

Total

1072

223

1295

Cat0 Cat1 Cat2

453 163 456

112 19 92

565 182 548

Total

1072

223

1295

A. Ahmadi et al. / Applied Soft Computing 11 (2011) 1638–1647

1645

Fig. 8. Flowchart of the hierarchical classiﬁcations in the ﬁnal classiﬁer.

Table 5 Confusion matrix for classiﬁcation of train data with ID3. Cat0 Train data Cat0 Cat1 Cat2

Cat1

Cat2

84.80% 234

12.30% 34

2.90% 8

16.20% 62

77.50% 296

6.20% 24

1% 5

8.50% 43

90.5% 458

The highest accuracy rate is for classiﬁcation of Cat2 data. The misclassiﬁcation of data from Cat0 to Cat1 and vice versa is high (22.4% and 25%) which is due to similarity between data of two categories specially for the border data like sport or medical pages. This means that the extracted features cannot discriminate between these two classes, strongly. In order to extract the most effective features, we have applied a correlation analysis on the features. However, the results show higher accuracy of ID3 classiﬁer comparing to the Bayesian method we used in our previous work [28].

6.3. Experiments with skin color features gory 0 is 125 pages, Cat1: 43 pages, Cat3: 122 pages). The confusion matrices for classiﬁcation results are shown in Tables 5 and 6. As can be seen from Table 5, the performance of ID3 in classiﬁcation of category 1 data is weak (63%) but in other cases it is acceptable. Table 6 Confusion matrix for classiﬁcation of test data based on textual and proﬁle features. Cat0 Test pages Cat0

Cat1

74.60% 50

22.40% 15

Cat1

25% 23

63% 58

Cat2

3% 4

12.20% 16

Cat2 3% 2 12% 11 84.70% 111

In the next stage, the classiﬁcation of test dataset was performed based on only skin color features. Pages containing at least two images with skin area of 50% or more are classiﬁed as immoral. The results are reported in Table 7. As can be seen from the table, there is relatively high false rate in Cat0 data (permitted pages) that is due to the skin color or quasi-skin color in the non-porn images which makes them to be classiﬁed as porn pages. In Cat2 data (immoral pages), because of pure textual pages without any pictures, a high rate of pages (25%) are not classiﬁed correctly. But, the most misclassiﬁcation rate appears in the data of Cat1 (48%), that is, the pages which are not strictly permitted or immoral and could not be recognized based on only skin color. The average of classiﬁcation rate by using only skin color feature on 290 test pages considering TP and TN columns, is then calculated as 77%.

1646

A. Ahmadi et al. / Applied Soft Computing 11 (2011) 1638–1647

Table 7 Confusion matrix for classiﬁcation of test data based on skin color feature. Classiﬁed to Immoral true (TP) Test pages Cat0

Immoral false (FP)

0

12.70% 16

Cat1

24% 10

18.50% 8

Cat2

75% 92

0

Cat1 Cat2

Cat0

Cat1

90% 112

8.50% 11

16.50% 7 1% 1

71% 31 3% 4

87.30% 109 28% 12 0

Table 8 Confusion matrix for classiﬁcation of test data based on combination of textual, proﬁle, and visual features.

Test pages Cat0

Permitted true (TN)

Cat2 1.50% 2 12.50% 5 96% 117

6.4. Experiments with combination of features Finally and in order to reduce the misclassiﬁcation rate, we used visual features (explained in Section 4.1) together with textual and proﬁle features. The test dataset containing 290 pages was used again to evaluate the combined system performance. The classiﬁcation results and error rate in each category are reported in Table 8. Here, the threshold Th in relation 1 is considered as 0.2, the minimum size for images is taken 50 × 50 pixels and the threshold for skin area is 50%. As it can be seen, the results are improved significantly. This improve comes up as increase in TP and reduction in FN rates. It is worth-noting that Tables 6 and 8 show the results for classiﬁcation into three distinct classes (Cat0, Cat1, Cat2), but if we consider classiﬁcation of data into only two classes of immoral and permitted, the average classiﬁcation rate is obtained as 95%. These new results are shown in Tables 9 and 10 for two cases of textualproﬁle features and combination of all features, respectively. 6.5. Time complexity analysis As for classiﬁcation of an input Web page, the most timeconsuming steps include: ﬁrst, extracting all words or tokens in

Permitted false (FN) 0 29.50% 13 25% 30

the text and searching them within the black keyword lists, and second, detection of skin area for all images within the input page. Regarding the ﬁrst step, the time complexity can be considered in the order of N × W, where N is the number of words in the page and W is the total number of black words for all three categories Cat0, Cat1, Cat2. As for the second step, considering the three layers MLP we exploited for classiﬁcation, the time cost for each image is in order of M × k, where M stands for number of pixels in each image, and k = 20 × 100 where 20 is the number of components in skin feature vector, and 100 is the approximate number of instructions on one input line in each feed-forward transfer of MLP. Therefore, the total time complexity of the system for classiﬁcation of each input page will be determined as N × W + M × K where K is deﬁned as K = 5 × k (as we process maximum 5 images from each page). Using a personal computer equipped with a 2 GHz Pentium processor, the average time for classiﬁcation of each input page was obtained as 0.8 s. 7. Comparison with other works It is very hard to make a practical comparison between the proposed system and other works in the literature. This is mainly due to lack of information about other works and lack of a standard benchmark for evaluating different systems. The experimental results reported by the researchers are typically based on a limited number of data or data of speciﬁc databases which are usually not available to all. There is no benchmark database used by all the works in the ﬁeld. Also, there is no possibility to approve the correctness of reported results. Another problem is that there is no major work in ﬁeld of Persian Web page classiﬁcation and it makes it difﬁcult to do a valid comparison. Nevertheless, in order to give an illustrative comparison, we tried to simulate the performance of one typical and more recent work in the ﬁeld and compare the classiﬁcation results for same validation dataset. We chose for Hammami et al. work [3]. The

Table 9 Classiﬁcation results for test data based on textual and proﬁle features. Classiﬁed to

Test pages Immoral Permitted

Immoral true (TP)

Immoral false (FP)

Permitted true (TN)

Permitted false (FN)

89% 0

0 22%

0 78%

11% 0

Table 10 Classiﬁcation results for test data based on textual, proﬁle, and visual features. Classiﬁed to

Test pages Immoral Permitted

Immoral true (TP)

Immoral false (FP)

Permitted true (TN)

Permitted false (FN)

95% 0

0 11%

0 89%

5% 0

A. Ahmadi et al. / Applied Soft Computing 11 (2011) 1638–1647 Table 11 Classiﬁcation results for test data obtained from the simulated system based on Hammami work. Cat0 Test pages Cat0 Cat1 Cat2

Cat1

Cat2

82.5% 71

10.5% 9

7% 5

21% 9

70% 21

9% 3

4.5% 4

5.5% 13

80% 65

description of the system is given in chapter 2 of this paper. According to the information given in their paper, 14 features were selected as textual-structural features and one feature of skin color (i.e. the percentage of skin area to the whole image) was selected as visual feature. However, the skin color detection method they applied, was not clearly described. The Bayesian classiﬁer was used for classiﬁcation of textual-structural features. The classiﬁcation results of the simulated system for the 200 pages of our prepared test dataset are shown in Table 11. As it can be realized, the classiﬁcation results obtained from our proposed system (Table 8) overall are better. In classiﬁcation of permitted pages (Cat0 data and some part of Cat1), the accuracy of both systems seems relatively the same, but in classiﬁcation of porn pages (Cat2 data and some part of Cat1) which mainly come with images, the accuracy of our system is signiﬁcantly higher. 8. Conclusion In this paper, we have presented a Web pages ﬁltering system based on combination of textual, proﬁle, and visual features. The model employs a hierarchical set of classiﬁers, an ID3 classiﬁer for textual and proﬁle features, and neural networks model for skin color and visual features. A classiﬁcation rate of 95% was obtained when system applied on a dataset of 1295 Web pages. Comparing to other works in the ﬁeld, the proposed model is advantageous in term of using more effective contextual features and combining them with two sorts of robust visual features. There are still many issues to be considered in the future works. Some instances of these issues are: the ability of incremental learning for classiﬁers, how to prepare a representative dataset of Web pages indicating all porn-pages characteristics, how to reduce the FP rate in classiﬁcation of various pages with various types of subjects and images from different categories. Acknowledgements This research was supported by Iranian Telecommunication Research Center (ITRC). Authors would like to thank ITRC for their kind helps and support. References [1] L. Rokach, O. Maimon, Data mining with decision trees: theory and applications, World Scientiﬁc (2008) 71–72. [2] http://www.consumersearch.com/parental-control-softwar. [3] M. Hammami, et al., WebGuard: a web ﬁltering engine combining textual, structural, and visual content-based analysis, IEEE Transactions on Knowledge and Data Engineering 18 (February (2)) (2006).

1647

[4] M. Hammami, et al., Adult content web ﬁltering and face detection using datamining based skin-color model, in: Proceedings of the International Conference on Multimedia and Expo (ICME’04), 2004. [5] Filtering Concepts and Applications, http://www.knowclub.com/paper/?p=322 (in Persian). [6] Family Online Safety Institute, FOSI, http://www.fosi.org. [7] A. Bosson, et al., Non-retrieval: blocking pornographic images, in: Proceedings of the International Conference on the Challenge of Image and Video Retrieval, vol. 2383, Lecture Notes in Computer Science, Springer-Verlag, 2002. [8] Girgis, et al., An approach to image extraction and accurate skin detection from web pages, International Journal of Computer Science and Engineering 1 (2) (2007) (Spring). [9] Z. Chen, et al., A novel web page ﬁltering system by combining texts and images, in: Proceedings of the IEEE Conference on Web Intelligence, 2006. [10] Lee, et al., Neural networks for web content ﬁltering, in: IEEE Intelligent Systems, 2002, ISBN: 1094-7167. [11] Shen, et al., The ﬁltering of internet images based on detecting erotogenicpart, in: Proceedings of the Third IEEE Conference on Natural Computation, 2007. [12] Wang, et al., Classifying objectionable websites based on image content, in: Proceedings of the Fifth International Workshop on Interactive Distributed Multimedia Systems and Telecommunication Services, 1998. [13] B. Stayrynkevitch, et al., Poesia software architecture deﬁnition document, in: Technical Report, Poesia Consortium, December, 2002. [14] W. Ho, P. Watters, Statistical and structural approaches to ﬁltering internet pornography, in: Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, 2004, pp. 4792–4798. [15] M. Jones, J. Rehg, Statistical colour models with application to skin colour detection Technical Report CRL, vol. 11, Compaq Cambridge Research Lab, 1998. [16] D.A. Forsyth, M. Fleck, Automatic detection of human nudes, International Journal of Computer Vision 32 (1) (1999) 63–77. [17] H. Zheng, M. Daoudi, B. Jedynak, Blocking adult images based on statistical skin detection, Electronic Letters on Computer Vision and Image Analysis 4 (2) (2004) 1–14. [18] Y. Xu, B. Li, X. Xue, H. Lu, Region-based pornographic image detection, in: Proceedings of the IEEE Seventh Workshop on Multimedia Signal Processing (MMSP’05), Shanghai, China, 2005. [19] A.A. Abin, M. Fotouhi, S. Kasaei, Skin segmentation based on cellular learning automata, in: Proceedings of the Advances in Mobile Computing and Multimedia (MoMM), Linz, Austria, November, 2008, pp. 254–259. [20] J.L. Shih, C.H. Lee, C.S. Yang, An Adult Image Identiﬁcation System Employing Image Retrieval Technique, 2007. [21] R. Ap-apid, An algorithm for nudity detection, in: Proceedings of the British Machine Vision Conference, vol. 2, 2001, pp. 491–500. [22] M.K. Hu, Visual pattern recognition by moment invariants, in: Proceedings of the IRE Transactions on Information Theory, IT-8, 1962, pp. 179–187. [23] Internet Pornography: Are Children at Risk?, http://www.nap.edu/catalog/ 10261.html?onpi-Webextra-050202. [24] P.Y. Lee, S.C. Hui, A.C.M. Fong, An intelligent categorization engine for bilingual web content ﬁltering, IEEE Transactions on Multimedia 7 (6) (2005) 1183– 1190. [25] W.A. Arentz, B. Olstad, Classifying offensive sites based on image content, Computer Vision and Image Understanding 94 (1–3) (2004) 295–310. [26] B. Liu, J. Su, Z. Lu, Z. Li, Pornographic images detection based on CBIR and skin analysis, in: Proceedings of the Forth International Conference on Semantics, Knowledge and Grid, 2008, pp. 487–488. [27] W. Hu, et al., Recognition of pornographic web pages by classifying texts and images, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (June (6)) (2007) 1019–1034. [28] A. Ahmadi, M. Zamanian, M.M. Takami, Adult web page ﬁltering using textual, structural, and visual features, Journal of Iran Telecommunication Research Center (IJICT) (February) (2010) (in Persian). Ali Ahmadi received his B.Sc. in Electrical Engineering from Amirkabir University, Tehran, Iran in 1991 and M.Sc. and Ph.D. in Artiﬁcial Intelligence and Soft Computing from Osaka Prefecture University, Japan in 2001 and 2004, respectively. He worked as a researcher in Research Center for Nanodevices and Systems in Hiroshima University, Japan during 2004–2007. He has been with Khajeh-Nasir University of Technology, Tehran, Iran as assistant professor from 2007. His research interests include intelligent models, machine vision, hardware-software co-design, and learning algorithms. Mehran Fotouhi was born in Isfahan, Iran, in 1981. He received the B.S. degree in Computer Engineering from Isfahan University of Technology, Iran in 2005, and the M.S. degree in Computer Architecture from Sharif University of Technology at Tehran in 2009. Currently he works on content-based Web page classiﬁcation at Iran Telecommunication Research Center (ITRC).

Intelligent classification of web pages using contextual and visual features

Intelligent classification of web pages using contextual and visual features

Recommend Documents