Application of a staged learning-based resource allocation network to automatic text categorization

Application of a staged learning-based resource allocation network to automatic text categorization

Neurocomputing 149 (2015) 1125–1134 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Appli...

874KB Sizes 0 Downloads 65 Views

Neurocomputing 149 (2015) 1125–1134

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Application of a staged learning-based resource allocation network to automatic text categorization Wei Song a,b,n, Peng Chen a,b, Soon Cheol Park c a

School of Internet of Things Engineering, Jiangnan University, Wuxi 214122, China Engineering Research Center of Internet of Things Applied Technology, Ministry of Education, China c Department of Electronics and Information Engineering, Chonbuk National University, Jeonju, Jeonbuk 561756, Republic of Korea b

art ic l e i nf o

a b s t r a c t

Article history: Received 11 December 2013 Received in revised form 4 April 2014 Accepted 10 July 2014 Communicated by Y. Chang Available online 18 July 2014

In this paper, we propose a novel learning classifier which utilizes a staged learning-based resource allocation network (SLRAN) for text categorization. In the light of its learning progress, SLRAN is divided into a preliminary learning phase and a refined learning phase. In the former phase, to reduce the sensitivity corresponding to input data an agglomerate hierarchical k-means method is utilized to create the initial structure of hidden layer. Subsequently, a novelty criterion is put forward to dynamically regulate the hidden layer centers. In the latter phase a least square method is used to enhance the convergence rate of network and further improve its ability for classification. Such staged learning-based approach builds a compact structure which decreases the computational complexity of network and boosts its learning capability. In order to implement SLRAN to text categorization, we utilize a semantic similarity approach which reduces the input scales of neural network and reveals the latent semantics between text features. The benchmark Reuter and 20-newsgroup datasets are tested in our experiments and the extensive experimental results reveal that the dynamic learning process of SLRAN improves its classifying performance in comparison with conventional classifiers, e.g. RAN, BP, RBF neural networks and SVM. & 2014 Elsevier B.V. All rights reserved.

Keywords: Resource allocation network Neural network Staged learning algorithm Text categorization Novelty criteria

1. Introduction With the rapid development of Internet technology, a large quantity of online documents and information are growing exponentially. The demand of rapidly and accurately finding out the useful information from such a large dataset has become a challenge for modern information retrieval (IR) technologies. Text categorization (TC) is a crucial and well-proven instrument for organizing large volumes of textual information. As a key technique in IR field, TC has been extensively researched and witnessed in recent decades. Meanwhile, TC has become a hot spot and puts forward a series of related applications, including web classification, query recommendation, spam filtering, topic spotting etc. In recent years, an increasing number of approaches based on intelligent agent and machine learning, e.g. support vector machine (SVM) [1], decision trees [2,3], K-nearest neighbor (KNN) [4,5], bayes model [6–8], neural network [9,10] etc, have been applied to text categorization. Although such methods have been extensively

n Corresponding author at: School of IOT Engineering, Jiangnan University, Lihu Avenue, Wuxi, Jiangsu Province 214122, China. E-mail address: [email protected] (W. Song).

http://dx.doi.org/10.1016/j.neucom.2014.07.017 0925-2312/& 2014 Elsevier B.V. All rights reserved.

researched, yet the present automated text classifiers are still with fault and the effectiveness needs improvement. Thus, text categorization is still a major research field. Since artificial neural network is still one of the most powerful tools utilized in the field of pattern recognition [11], we employ it as a classifier. As a kind of basic supervised network, back propagation (BP) neural network suffers the fault of slow training rate and high tendency to trap into local minimum. On the contrary, without slow learning rate, the relatively simple mechanism of radial basis function (RBF) neural network [12–14] displays the robust property of global situation approaching. It has been known that the key to build a successful RBF neural network is to insure a proper number of units in its hidden layer [15]. More specifically, the lack of hidden layer nodes always results in a negative influence on its ability to decision-making. Whereas the redundant hidden layer nodes bring about a result of high computing [16–18]. That is to say, too small architecture of network may cause the problem of under-fitting, while on the other hand, too large architecture of network may lead to over-fitting to data [19,20]. Although more and more learning methods have studied to regulate the hidden nodes to satisfy the demand of the suitable structure for RBF network, the most remarkable approach is resource allocation network (RAN) learning method put forward by Platt [21]. Platt made a significant contribution through the development of the

1126

W. Song et al. / Neurocomputing 149 (2015) 1125–1134

algorithm which regulates the hidden nodes according to the so called novelty criteria. In other words, RAN can dynamically manipulate the number of the hidden layer units by judging the novelty criteria. However, the novelty criteria are sensitive to the initialized data, which would easily cause the growth of the training time for network, and lead to the reduction of the employment effect [22]. Meanwhile, in RAN the least mean-square (LMS) algorithm applied to update its learning parameters usually makes the network suffer from the drawback of lower convergence rate [23,24]. To tackle with these problems, in this paper we propose a staged learning-based resource allocation network (SLRAN) which divides its learning process into a preliminary learning phase and a refined learning phase. In the former phase, to reduce its sensitivity corresponding to the initialized data, an agglomerate hierarchical k-means method is utilized to construct the structure of hidden layer. Subsequently, a novelty criterion is put forward to dynamically add or prune hidden layer centers, and a compact structure is created. That is, the former phase reduces the complexity of the network and builds the initial structure of RAN. Yet in the latter phase a least square method is used to enhance the convergence rate of network and further refine its learning ability. Therefore, SLRAN builds a compact structure which decreases the computational complexity of network and boosts its learning ability for classification. The rest of this paper is organized as follows. Section 2 introduces the basic concepts of the RAN. Section 3 proposes the algorithm of SLRAN as an efficient text classifier and describes its details. The steps to generate the latent semantic feature of text documents, which helps enhance the text categorization performance, are depicted in Section 4. Experimental results and analysis are illustrated in Section 5. Conclusions are discussed in Section 6.

2. Resource allocation network (RAN) RAN is a promising and sequential learning algorithm based on RBF neural network. The architecture of RAN includes three layers, i.e. an input layer, an output layer and a single hidden layer. The topology of the RAN is shown in Fig. 1. During the training process of RAN, a sample of n dimensional input vector is given to the input layer, and based on the assigned input pattern, RAN will compute the output of m dimensional vector in the output layer. That is to say, the aim of the RAN network is to define an approach to map from the input space of n dimensions to output space of m dimensions. Eventually, the network calculates the output vectors that match the desired output vectors. In the structure of RAN, the input layer, the hidden layer and the output layer are x ¼ ðx1 ; x2 ; :::; xn Þ, c ¼ ðc1 ; c2 ; :::; ch Þ and y ¼ ðy1 ; y2 ; :::; ym Þ respectively, b ¼ ðb1 ; b2 ; :::; bm Þ is the offset item of the output layer, where n, h and m are the respective number of units in these three layers. The units of the hidden layer take advantage of Gaussian function as its activation function which implements a locally tuned unit, and the Gaussian function of hidden layer is

Fig. 1. The three layers topology of RAN neural network.

defined as :x  ci : Φi ðxÞ ¼ exp  σi2

2

! ð1Þ

Where ci and σi are the ith node center and the width of this center respectively. While the output of hidden layer node is linearly weighted for the output layer, the function for output layer is given by h

f j ðxÞ ¼ ∑ wij Φi ðxÞ þ bj ðj ¼ 1; 2; :::; mÞ

ð2Þ

i¼0

Where m and h are the respective number of the nodes in output layer and hidden layer, x is a input sample, wij is the connecting weight between the ith hidden layer node and the jth output layer node. At the beginning of the training stage, there is no hidden neuron in the network. RAN initializes the parameters of neural network in terms of the first couple of input training sample. Subsequently, if the training sample satisfies its novelty criteria, it will be added into the network as a new neuron center in hidden layer, otherwise LMS is used to update the parameters of current network including hidden layer centers and the connection weights between hidden layer and output layer. However, the novelty criteria of RAN are sensitive to the initialized data, which causes the growth of the training time. Meanwhile, LMS usually makes the network suffer from the drawback of lower convergence rate.

3. Staged learning-based resource allocation network (SLRAN) To handle the above-mentioned problems of RAN, in this paper we propose a staged learning-based resource allocation network (SLRAN) which divides the learning process into two phases. In order to reduce its sensitivity corresponding to the initialized data, in the preliminary learning phase of SLRAN an agglomerate hierarchical k-means method is utilized to construct the structure of hidden layer. Subsequently, a novelty criterion is put forward to dynamically add or prune hidden layer centers. That is, this phase reduces the complexity of the network and builds a compact structure of RAN. 3.1. Determination of initial hidden layers centers We apply an agglomerate hierarchical k-means algorithm to initialize the structure of hidden layer, i.e. the centers of hidden layer and the widths of the clusters for each center. For a given initial documental dataset D, after the clustering process, the generated cluster centers are defined as C ¼ ðc1 ; c2 ; c3 ; :::; ck Þ, where k is the number of the clusters. The k-means algorithm helps obtain the hidden layer center ci and the cluster width σi. The algorithm process is shown as follows: Step 1: Through random sampling of m times, which ensure that the data would not be distorted after random sampling dataset and can reflect the feature of natural distribution for data. So the primitive dataset is divided into m parts, and the size of each part is n/m, n is the total number of the texts. That is, we get m subsets that can be expressed as S ¼ ðs1 ; s2 ; :::; sm Þ. Clustering analysis is executed for every subset si using k0 0 means algorithm. In this way, we obtain a group of k ðk 4 kÞ clustering centers where k is the predefined number of categories. That is, we take such a step possible to guarantee that the generated centers can represent all clusters and avoid initial object around the isolated point. In our method k0 is empirically defined as 2  k. Thus, the appropriate k0 can ensure the uniform distribution of centers in each sample as possible and lead to a good sampling performance. In comparison with the effect of k0 , m is a secondary coefficient. We set it several

W. Song et al. / Neurocomputing 149 (2015) 1125–1134

times empirically and select the proper m according to the sampling results. Note that we still have m subsets. That is, we 0 will obtain m  k cluster centers after this step. Step 2: As a bottom-up hierarchical clustering strategy, the average-linkage agglomerative algorithm is used to cluster the newly generated m  k0 clustering centers. In this algorithm, the two most similar clusters would be merged to build a new cluster until only k clusters are left. Step 3: Selecting the k cluster centers acquired from the previous step as the hidden layer centers of SLRAN learning algorithm. At the same time, we calculate the center width of the ith neuron by σi ¼

1 ∑ ðx  ci ÞT ðx  ci Þ N i x A ci

ð3Þ

Where Ni is the number of the samples contained in the cluster of center ci. In other words, the center width σi stands for the mean value between the center ci and the relevant training models assigned to this type.

In the preliminary learning phase, a novelty criterion is put forward to dynamically regulate the hidden layer centers. RAN learning algorithm first augments the number of hidden layer units through its novelty criteria. We present the criteria that considers the feature of input and output space, which is shown as   ei ¼ yi  f ðxi Þ 4 ε ð4Þ

1rjrh

And the hidden layer are updated by cij ¼ cij þ Δcij

ð9Þ

σ i ¼ σ i þ Δσ i

ð10Þ

m xkj  cij Φi ðxk Þ  ∑ wis ðf s ðxk Þ  ys Þ 2 σi s¼1

ð11Þ

m ‖xk  ci ‖2 Φi ðxk Þ  ∑ wis ðf s ðxk Þ  ys Þ 3 σi s¼1

ð12Þ

Δcij ¼ 2βi η

Δσ i ¼ 2βi η

where Φi ðxk Þ is the Gaussian activation function for the ith hidden layer neuron given by (1). cij is the jth dimension of the hidden layer center ci, xkj is the jth component of input pattern xk, wis is the connection weight from the ith hidden layer neuron to the sth output layer neuron, η is learning rate. We use βi to represent the inverse similarity between ci and xk, and βi is given by βi ¼

‖xk  ci ‖ ‖cf arthest  ci ‖

ð13Þ

where cf arthest is the farthest center to the input pattern xk. Meanwhile, a gradient descent method is employed to adjust the offset bj and the weight wij from the hidden layer neuron to the output payer neuron, and it is given by

3.2. Novelty criteria

di ¼ min ‖xi  cj ‖ 4 δi

1127

ð5Þ

Where ε is a predefined error precision set empirically, and it is adopted in view of the value of 0.075 in our system for text categorization. h is the number of the hidden layers in current network when the ith example xi is input. yi is the desired output, and f(xi) is the related real output. di is Euclidean distance between the input vector xi and the nearest hidden layer center. δi ¼ max fexpðiγÞδmax ; δmin g, where δmax and δmin represent the maximum Euclidean distance and the minimum Euclidean distance of the entire input space respectively. γ is a decay coefficient whose value is less than zero. In the light of their characteristics, (4) and (5) are named as error criterion and distance criterion separately. Different from other criteria using only once loop for training, we use multiple cycle to ensure that the network obtain sufficient hidden layer units. If an input pattern xk and the corresponding real output are far from the nearest center ci and the desired output respectively, such a pattern xk will regard as the new hidden layer center to reduce output errors. That is, xk will be added as the new hidden layer center if it satisfies both formulas (4) and (5). Subsequently, the hidden layer center cnew, hidden layer center width σnew and connection weight wnewj of this new added hidden layer are refined as below cnew ¼ xk

ð6Þ

σ new ¼ α‖xk  cnearest ‖

ð7Þ

wnewj ¼ yj  f j ðxk Þðj ¼ 1; 2; :::; mÞ

ð8Þ

where cnearest is the nearest hidden layer center to xk, α is a correlation coefficient between 0 and 1, m is the number of output layer neurons. Otherwise, if it does not satisfy both (4) and (5), xk will be assigned to the existed cluster Ci whose center ci is closest to xk.

bj ¼ bj þηðyj  f j ðxk ÞÞ

ð14Þ

wij ¼ wij þηðyj  f j ðxk ÞÞΦi ðxk Þ

ð15Þ

After the growing of hidden layer neurons, we step into the stage of pruning. If a hidden layer neuron is active previously, but its significance is gradually reduced after the input of new samples, such a hidden neuron with less effect will therefore be pruned. That is, for all input samples, the outputs and the connection weights of a hidden layer neuron are relatively little. We will remove this hidden layer neuron. Moreover, the redundant hidden layer neurons may result in the issue of over-fitting. From above analysis, a pruning strategy is given as follows. For each training pair fðxðiÞ; yðiÞÞ; i ¼ 1; 2; :::; ng, we first calculate its output by (1) and then normalize its relative output by Rk ¼

Φk ðxi Þ oξ Φmax ðxi Þ

ð16Þ

Where Φk ðxi Þ is the output of the kth hidden layer neuron for the ith input pattern, and Φmax ðxi Þ is the maximum output for the ith input pattern. ξ is the predefined threshold. Subsequently, we normalize the relative connection weight between hidden layer neurons and output layer neurons by μk ¼

wk;j oθ wk; max

ð17Þ

Where wk,j is the connection weight between the kth hidden layer neuron and the jth output layer neuron, and wk; max is the maximum connection weight for kth hidden layer neuron. θ is a predefined threshold. Thus, for the kth hidden neuron, if more than P input samples meet both formulas (16) and (17), such a neuron will be regarded as not important as other effective hidden neurons. Thus, we need to remove this neuron. Through utilizing pruning strategy, we can effectively delete these undesirable neurons, and the complexity of the network structure is reduced. Thus, the training structure for the preliminary learning phase of SLRAN is shown in Fig. 2. Generally speaking, the preliminary learning phase dynamically adjusts the hidden neurons of SLRAN to maintain a compact training structure which decreases the computational complexity of network and boosts its learning capability, and in the subsequent phase, SLRAN uses the least square method to refine the weight of network.

1128

W. Song et al. / Neurocomputing 149 (2015) 1125–1134

Fig. 2. The training structure for the preliminary learning phase of SLRAN.

3.3. Refine weights

Start

In the refined learning phase, SLRAN uses the least square method to update the weight of network. Assume that N is the number of the training samples, H is the eventual number of the hidden layer centers. Then we achieve the output matrix of hidden layer P which is given by P ¼ ½P 1 ; P 2 ; P 3 ; :::; P i ; :::; P H 

ð18Þ

P i ¼ ½P i1 ; P i2 ; P i3 ; :::; P is ; :::; P iN T

ð19Þ

  ‖xs  ci ‖2 P is ¼ Φi ðxs Þ ¼ exp  σi 2

ð20Þ

Where Φi ðxs Þ is the activation function of the hidden layer which implements Gaussian function as its output. Subsequently, the connection weight from hidden layer to its output layer is calculated by ð21Þ

Where Y is the desired output of the whole network. In the view of the above discussion, the flow chart of SLRAN is shown in Fig. 3. Step 1: Clustering samples by agglomerate hierarchical k-means algorithm. We can obtain the initial hidden layer centers and the widths of the centers in this step. Step 2: Calculating the output of hidden layer neurons with respect to whole input training dataset using (1). Step 3: Increasing the number of hidden layer neurons by judging both criteria (4) and (5). If an input pattern satisfies these two novelty conditions, this pattern is regarded as a new hidden neuron. Subsequently, the new generated hidden layer center, center width, connection weight are calculated by (6)–(8) respectively. Otherwise, the corresponding center and the center width are adjusted by the gradient descent method using (9)–(12). Meanwhile, we need update its offset and weight using (14) and (15) respectively. If there is no hidden layer neuron added to the network for consecutive n ¼ 5 epochs, our algorithm jumps to Step 4. Otherwise, the algorithm goes to Step 2.

Calculate the output of the hidden layer with (1)

Y

Is satisfy (4) and (5)?

Add hidden center using (6) - (8)

N

Update hidden layer with (9) (12) Adjust the weight and offset with (14) and (15)

Is over of the add process?

N

Y

W ¼ ðP T PÞ  1 P T Y

Agglomerate hierarchical k-means

Prune the center of hidden layer Refine the algorithm using (18) (21) End Fig. 3. The flow chart of SLRAN.

Step 4: Calculating and normalizing the output of the hidden layer and its weight. SLRAN prunes the hidden layer centers if it satisfies both (16) and (17). Step 5: After the preliminary learning phase, SLRAN has achieved a compact network structure with lower network complexity. In the refined learning phase, SLRAN utilizes least square method to determine the connection weight of network. That is, this step helps SLRAN refine its learning ability and improve its classification performance.

W. Song et al. / Neurocomputing 149 (2015) 1125–1134

In summary, the first 4 Steps correspond to the preliminary learning phase, and Step 5 is on behalf of the refined learning phase. Specifically, the training examples are firstly input and stored in memory. In the preliminary learning phase, training examples are utilized repeatedly in the loop of epochs to update the hidden layer. In the refined learning phase, least square method is utilized one time to improve the learning ability of network. Accordingly we just utilize training examples once in this stage to refine the connection weight of network.

4.1. Vector space model (VSM) VSM is a commonly used method on behalf of documents through the weight of each word comprised. The model is on the basis of the idea that, the meaning of a document can be conveyed by its words. And the weight of each feature, which represents the contribution of every word, is evaluated by a statistical rule [25–27]. It is implemented through creating a term-document matrix that represents all dataset. In order to create a set of initial feature vectors used for representing the training documents, each document should be transformed into a feature vector. Suppose a document Di is comprised of n terms, the feature vector for Di can be represented by ð22Þ

where wik is the term weight of the kth indexing term in the ith text document Di. Although VSM can reflect the directed relations in original dataset, it still has some weakness. Due to every unique word represents one dimension in feature space, it results in a long vector to represent these high dimensions. Moreover, because of the fact that the same concept can be represented by numerous different words and a word may have different explanations, the ambiguous meaning of words prevents from identifying their semantic similarities. In this paper, we use a latent semantic feature space model to deal with this problem. 4.2. Latent semantic feature space (LSFS) In this study, the latent semantic feature space is used to calculate the semantic similarity between words. Besides, singular value decomposition (SVD) is a well developed mathematical method for reducing the dimensionality of large data sets and extracting dominant features of the data [28–30]. So we apply SVD technique to implement the latent semantic feature space in our system. The training dataset can be firstly expressed as a document-term matrix D(n  m), where n is the number of documents and m is the number of terms. Then we figure out that the transpose of matrix D is a term-document matrix A(m  n): A ¼ DT

ð23Þ

Once the matrix A is created, SVD is employed to decompose it so as to construct a semantic vector space which can represent conceptual term-document associations. The singular value decomposition of A is given below A ¼ UΣV T

ð24Þ

Where U and V are the matrices of term vectors and document vectors respectively, Σ ¼ diagðσ 1 ; σ 2 ; :::; σ n Þ is the diagonal matrix of singular values. For the sake of reducing the dimensionality, we select the k largest singular values to take the place of the original term-document matrix Aðm  nÞ. Then the approximation of matrix A with rank-k matrix Ak is shown as Ak ¼ U k Σ k V k T

Where Uk is made up of the first k columns of U, Σ k ¼ diag ðσ 1 ; σ 2 ; :::; σ k Þ is the diagonal matrix formed by the first k factors of ∑, and V k T is comprised of the first k rows of V T . The matrix Ak captures most of the important underlying structure in the association of terms and documents, while ignoring noise due to word choice. Subsequently, we use the reduced U k matrix and the original document vector d to achieve our target vector of each training and testing sample, which is given by d^ ¼ dU k

ð26Þ

Where d is an original ð1  mÞ feature vector of document, and U k is the ðm  kÞ matrix generated by (25). Once all samples are processed in this way, the new generated document vectors are performed as input for SLRAN classifier which determines the appropriate categories the test samples should be assigned to.

4. Semantic feature selections

Di ¼ fwi1 ; wi2 ; :::; wik ; :::; win g

1129

ð25Þ

5. Experimental results and analysis 5.1. Data sets In order to measure the effectiveness of our approach, we make our experiments over two standard text corpus datasets in this study, i.e. Reuter-21578 corpus and 20-newsgroup corpus. In the former dataset, we choose 1500 documents which contains ten categories, i.e. acq, coffee, crude, earn, grain, money-fix, trade, interest, ship and sugar. In the latter dataset, 1200 documents are selected which comes from ten categories, i.e. Alt.atheism, Comp. windows.x, Sci.crypt, Rec.motorcycles, Rec.sporthockey, Misc.forsale, Talk.politics.guns, Talk.politics.mideast, Sci.space and Sci.med. For each of these two datasets, we divide the documents into three parts. That is, two parts are used for training while the rest one part is used for testing. Subsequently, we preprocess the documents by canceling the stop words, stemming and calculating the weight of each feature. After preprocessing, we totally obtain 7856 indexing terms from the first dataset and 13,642 indexing terms from the second dataset, which are respectively expressed as   dj ¼ wj;1 ; wj;2 ; :::; wj;i ; :::; wj;7856 ð27Þ   d0j ¼ wj;1 ; wj;2 ; :::; wj;i ; :::; wj;13642

ð28Þ

Where wj;i is the word weight of the ith feature word in the jth document. Here, we use Okapi rule [31] to calculate the weight of each feature. The Okapi rule is shown as wij ¼ tf ij  idf j =ðtf ij þ 0:5 þ1:5  dl=avgdlÞ idf j ¼ log

  N n

ð29Þ ð30Þ

Where N is the total number of the documents in each dataset, and n is the sum of the documents in which the ith feature word is contained. tf ij represents the occurrence frequency of the ith feature term in document j, and dl is the length of the jth document sample, avgdl is the average length of whole documents. In order to replace the ordinary tf  idf method, we adopt Okapi rule in this study which empirically normalizes the length of document samples. In the research field of neural network, it has known that the main challenge in its application to text categorization is the high dimensionality of the input feature space. That is, the input size of the neural network depends on the number of the selected features. In consideration of the problem that overloading input may result in the overflow of neural network, while the lack of features may cause a poor expression, we therefore sort features by the sum of their weights. We choose the first 1000 terms for the Reuter dataset and the first 1200 feature for the 20-newsgroup dataset. Thus each

1130

W. Song et al. / Neurocomputing 149 (2015) 1125–1134

document in these two datasets is respectively represented by   ð31Þ dj ¼ wj;1 ; wj;2 ; :::; wj;i ; :::; wj;1000   d0j ¼ wj;1 ; wj;2 ; :::; wj;i ; :::; wj;1200

Table 1 The network size and the associated parameters of SLRAN. Dataset

ð32Þ

So far, we have achieved the corresponding document vectors for these two datasets. Subsequently, the approach of latent semantic feature space is employed to further represent each document and decrease the number of dimensions by (26).

#Input nodes (LSFS)

Dataset 1 60–600 Dataset 2 60–500

#Input nodes (VSM)

#Initial hidden nodes

#Output nodes

α in η in (7) (11)

ξ in (16)

θ in (17)

1000 1200

10 10

10 10

0.80 0.001 0.085 0.085 0.75 0.001 0.085 0.085

5.2. Evaluation measures 0.96

In the literature of information retrieval (IR), precision P, recall R, and F-measure F [32] are the three main quotas to evaluate the performance of IR systems. In this study, we adopt these three quotas to appraise our categorization algorithm. mi Mi

ð33Þ

Ri ¼

mi Ni

ð34Þ

Fi ¼

2  P i  Ri P i þ Ri

ð35Þ

Where M i is the total number of documents in the ith category produced by the algorithm, and Ni is the number of documents in the ith category, which is predefined for the purpose of evaluation. While mi is the number of documents in the intersection of M i and N i , in other words, it is the sum of documents that has been correctly assigned to the ith category. Then the macro-average F-measure F macro is given by 1 k F macro ¼ ∑ F i ki¼1

ð36Þ

Where k is the number of categories. Besides, the micro-average F-measure [33] F micro is utilized, and it is given by F micro ¼

∑ki ¼ 1 mi ∑ki ¼ 1 Mi

ð37Þ

Where ∑ki ¼ 1 mi counts the number of documents correctly assigned to the corresponding category. And ∑ki ¼ 1 Mi is the total number of documents in dataset. In addition, we calculate the mean absolute error (MAE) which reflects the training capability of our SLRAN algorithm, and it's defined as qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 n m MAE ¼ ðyj ðxi Þ  f j ðxi ÞÞ2 ð38Þ ∑ ∑ 2n i ¼ 1 j ¼ 1 Where n is the number of training samples, m is the number of the neurons in output layer, yj ðxi Þ is the desired output, and f j ðxi Þ is the actual output of the network. 5.3. Experimental results The experiments are conducted on a Pentium PC. We use Java language to accomplish the algorithm under MyEclipse 8.5 compiler. In order to sufficiently reveal the performance of SLRAN, for the first dataset, we choose the following dimensions k of LSFS in formula (25), i.e. 60, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550 and 600. And for the second dataset, we vary ks from 60, 100, 150, 200, 250, 300, 350, 400, 450 to 500. The network size and the associated parameters of SLRAN are shown in Table 1. In addition, for the purpose of illustrating the superiority of SLRAN, we compare it with RAN [21], BP [10], RBF [34], and SVM [1]. In the first experiment shown in Figs. 4 and 5, we provide the F macro and

0.92

F-macro

Pi ¼

0.94

0.9

SLRAN with LSFS RAN with LSFS BP with LSFS RBF with LSFS SVM with LSFS SVM with VSM SLRAN with VSM RAN with VSM BP with VSM RBF with VSM

0.88

0.86

0.84 100

150

200

250

300

350

400

450

500

550

600

Dimensionality

Fig. 4. The F macro of the five categorization algorithms respectively using LSFS and VSM for the first dataset.

the F micro of these five categorization algorithms respectively using VSM and LSFS on the first dataset. We can see from Figs. 4 and 5 that the F macro and the F micro of these five algorithms using LSFS model increase gradually and subsequently reach their extreme maximum points which are much higher than those of the five algorithms respectively using VSM. That is to say, no matter which categorization algorithm we use, if the proper number of semantic features is selected, they perform better than using VSM. After the dimension of LSFS where these five algorithms achieve their extreme maximum points, the F macro and the F micro begin to decrease with increasing of dimension. That is because the redundant dimension of semantic features confuses the identification of documents and affects the performance of the categorization algorithms. In other words, the proper dimension of semantic features of LSFS enhances their categorization performances. Moreover, the F macro and the F micro of SLRAN using LSFS achieves the best maximum point in comparison with the performances of other four categorization algorithms. In correspondence to the respective extreme maximum point we compare the performances of these five algorithms using LSFS to those of them using VSM in Table 2. What is more, their computational time (C-time) is also counted. We can see from Figs. 4 and 5 and Table 2 that the performance of SLRAN achieves the best F macro and F micro when the dimension of LSFS reaches 400 for the first dataset. In Table 2, the maximum F macro of SLRAN, SVM, RAN, BP and RBF are 0.9648, 0.9437, 0.9311, 0.9133 and 0.9093 respectively, and the maximum F micro of SLRAN, SVM, RAN, BP and RBF are 0.9640, 0.9440, 0.9320, 0.9140 and 0.9100 respectively. Moreover, the corresponding dimension of

W. Song et al. / Neurocomputing 149 (2015) 1125–1134

1131

0.95

0.96

0.94

0.9

0.92

F-macro

F-micro

0.85 0.9

SLRAN with LSFS RAN with LSFS BP with LSFS RBF with LSFS SVM with LSFS SVM with VSM SLRAN with VSM RAN with VSM BP with VSM RBF with VSM

0.88

0.86

0.84 100

150

200

250

300

350

400

450

500

550

0.8

SLRAN with LSFS RAN with LSFS BP with LSFS RBF with LSFS SVM with LSFS SVM with VSM SLRAN with VSM RAN with VSM BP with VSM RBF with VSM

0.75

600

0.7

Dimensionality

Fig. 5. The F micro of the five categorization algorithms respectively using LSFS and VSM for the first dataset.

100

150

200

250

300

350

400

450

500

Dimensionality

Fig. 6. The F macro of the five categorization algorithms respectively using LSFS and VSM for the second dataset.

Table 2 The comparisons of dimensions, F macro , F micro and C-time for the first dataset.

0.95

Reuter-21578 corpus Algorithms

Dimensions

F macro

F micro

C-time

SLRAN SVM RAN BP RBF SLRAN SVM RAN BP RBF

1000(VSM) 1000(VSM) 1000(VSM) 1000(VSM) 1000(VSM) 400(LSFS) 350(LSFS) 300(LSFS) 300(LSFS) 300(LSFS)

0.9442 0.9131 0.9099 0.8911 0.8891 0.9648 0.9437 0.9311 0.9133 0.9093

0.9440 0.9140 0.9100 0.8920 0.8900 0.9640 0.9440 0.9320 0.9140 0.9100

307.1 s 105.8 s 328.6 s 360.3 s 265.4 s 52.6 s 40.9 s 56.4 s 89.2 s 53.4 s

0.9

semantic features using LSFS (400) is much smaller than that using VSM (1000), which greatly decrease the time consumption of categorization. Meanwhile, in view of the algorithm of SLRAN per se, it dynamically adds or prunes hidden layer centers and builds a compact structure, which reduces the complexity of the network. Thus, the C time of SLRAN to categorize the first dataset is 52.6 s, which is shorter in comparison with that of RAN, BP and RBF in LSFS, and just little longer than that of SVM in LSFS. The subsequent experiment is employed to demonstrate the effectiveness of SLRAN using LSFS for the second dataset. Figs. 6 and 7 respectively shows the corresponding F macro and the F micro of these five categorization algorithms. We can see from Figs. 6 and 7 that the F macro and the F micro of the five categorization algorithms using VSM are higher than those of them using LSFS firstly. That is because in the beginning of the dimensions of LSFS, the insufficient number of semantic features cannot well express the content of document and lowers the performance of categorization algorithms. And then with the raise of the dimensions, the document can be well described using LSFS model. Thus, they subsequently perform better than using VSM and achieve their respective extreme maximum point. That is to say, we can obtain a better result if we utilize LSFS instead of VSM, and the variation trend is similar to Figs. 4 and 5. Note that, the performance of SLRAN with LSFS still achieves the best F macro and F micro in comparison

F-micro

0.85

0.8

SLRAN with LSFS RAN with LSFS BP with LSFS RBF with LSFS SVM with LSFS SVM with VSM SLRAN with VSM RAN with VSM BP with VSM RBF with VSM

0.75

0.7

100

150

200

250

300

350

400

450

500

Dimensionality

Fig. 7. The F micro of the five categorization algorithms respectively using LSFS and VSM for the second dataset.

with that of SVM, RAN, BP and RBF in LSFS or VSM space. In Table 3, we compare of the dimension, F macro , F micro and C-time of these five categorization algorithms respectively for the second dataset. From Table 3 we can see that the respective best F macro of SLRAN, SVM, RAN, BP and RBF are 0.9500, 0.9000, 0.8669, 0.8529 and 0.8239, and the best F micro of SLRAN, SVM, RAN, BP and RBF are 0.9500, 0.9050, 0.8675, 0.8525 and 0.8250 respectively. In addition, the respective running time of the five algorithms using LSFS is 48.5, 30.2, 44.4, 78.6 and 42.4, which are much shorter than those of them using VSM. For the running time of SLRAN with LSFS, it is just a little longer than that of SVM, RBF and RAN with LSFS, and it is still much shorter than BP with LSFS. That is,

1132

W. Song et al. / Neurocomputing 149 (2015) 1125–1134

Table 3 The comparisons of dimensions, F macro , F micro and C-time for the second dataset. 20-newsgroup corpus Algorithms

Dimensions

F macro

F micro

C-time

SLRAN SVM RAN BP RBF SLRAN SVM RAN BP RBF

1200(VSM) 1200(VSM) 1200(VSM) 1200(VSM) 1200(VSM) 300(LSFS) 250(LSFS) 200(LSFS) 200(LSFS) 200(LSFS)

0.9218 0.8492 0.8403 0.8088 0.7965 0.9500 0.9000 0.8669 0.8529 0.8239

0.9225 0.8500 0.8450 0.8125 0.8025 0.9500 0.9050 0.8675 0.8525 0.8250

310.2 s 97.6 s 299.6 s 354.3 s 232.4 s 48.5 s 30.2 s 44.4 s 78.6 s 42.4 s

0.16 0.15 0.14 0.13

0.11 0.1 0.09

0.24 0.23 0.22 0.21 0.2 0.19

0.08 0.07

SLRAN with LSFS250 SLRAN with LSFS300 SLRAN with LSFS350 SLRAN with LSFS400 SLRAN with LSFS450 SLRAN with LSFS500 SLRAN with LSFS550

0.065 0.06 0.055 0.05

1

2

3

4

5

6

7

8

9

10

0.18 0.17

11

epochs

Fig. 8. The mean absolute error of SLRAN versus the number of epochs for the first dataset.

although the dimension of 300 in LSFS space expand the input space of network and relatively increase the time consumption on one hand, and on the other hand, the compact structure of SLRAN reduces the complexity of the network and decreases the its categorization time. Furthermore, we calculate the mean absolute error of SLRAN to reflect the performance of its training ability in the following experiments. In Figs. 8 and 9, we provide the MAE of SLRAN algorithm versus the number of epochs for the first dataset and the second dataset respectively. Note that each epoch here refers to that all learning data samples complete a training process. We can see from Figs. 8 and 9 that the MAE becomes high first, and then with the increase of epoch, the MAE decrease gradually and fall abruptly at the last epoch. That is mainly because the local attribute of the Gaussian function. That is, when a training sample is added into the hidden layer as a new hidden center, Gaussian function can only ensure that this sample has no influence on the existing network, while other samples input may cause errors for this kind of new added centers. Moreover, the rapid increase of hidden layer centers will lead to the rise of MAE shown at the beginning epoch of Figs. 8 and 9. Then in the following epochs the MAE decreases gradually because of the relative smaller variation of hidden layer centers and the positive effect of the gradient decent method employed. Last but not least, we can see that the MAE decreases abruptly at the last epoch. That is because the least square method

Mean absolute error

Mean absolute error

0.12

well refines the parameters of network at the end of the algorithm. Thus, the learning ability and the classification accuracy are further improved. What is more, we show the variations of the number of hidden layer nodes during the stage of training for the two datasets in Table 4. In Table 4, in the light of the best performance of SLRAN discussed above, we count the number of hidden layer nodes in each training epoch and the corresponding MAE. Note that, the last epoch is regard as the refined learning phase, and the other epochs are on behalf of the preliminary learning phase. During the preliminary learning phase, we can see that the number of the hidden layer nodes increase rapidly at the second epoch and the MAE becomes relatively higher due to the local attribute of the Gaussian function. Subsequently, with the raising of hidden node the MAE decrease gradually. In other words, such a phase dynamically adjusts the hidden neurons to create a sound foundation of network. With respect to the refined learning phase, both of the number of the hidden layer nodes and MAE reduce rapidly. This phase refines the parameters of network and achieves the best performance which provides optimal (or near optimal) solution to text categorization. Finally, we evaluate SLRAN by recording the computational time (C-time) which includes the training time and SVD decomposition time. And the relevant mean absolute error (MAE) is also recorded in

0.16 0.15 0.14 0.13 0.12

SLRAN with LSFS150 SLRAN with LSFS200 SLRAN with LSFS250 SLRAN with LSFS300 SLRAN with LSFS350 SLRAN with LSFS400 SLRAN with LSFS450

0.11 0.1 0.095 0.09 0.085 0.08

1

2

3

4

5

6

7

8

9

10

11

epochs

Fig. 9. The mean absolute error of SLRAN versus the number of epochs for the second dataset.

Table 4 The comparisons of MAE in view of the variations of the number of hidden nodes. Reuter-21578 corpus

20-newsgroup corpus

Epochs MAE

#Hidden nodes

Variations Epochs MAE

#Hidden nodes

Variations

1 2 3 4 5 6 7 8 9 10 11

10 209 256 274 287 292 292 292 292 292 56

– þ 199 þ 47 þ 18 þ7 þ5 0 0 0 0  236

10 182 227 242 251 257 257 257 257 257 51

– þ172 þ35 þ15 þ9 þ6 0 0 0 0  206

0.1398 0.1503 0.1460 0.1420 0.1380 0.1344 0.1314 0.1292 0.1270 0.1256 0.0510

1 2 3 4 5 6 7 8 9 10 11

0.2170 0.2260 0.2192 0.2140 0.2093 0.2050 0.2016 0.1987 0.1970 0.1950 0.0800

W. Song et al. / Neurocomputing 149 (2015) 1125–1134

Table 5 The comparisons of MAE and C-time of SLRAN with different dimensions. Reuter-21578 corpus

20-newsgroup corpus

Dimensions

MAE

C-time

Dimensions

MAE

C-time

LSFS LSFS LSFS LSFS LSFS LSFS LSFS

0.0596 0.0566 0.0532 0.0510 0.0541 0.0553 0.0579

38.2 s 42.7 s 47.1 s 52.6 s 58.8 s 65.3 s 72.1 s

LSFS LSFS LSFS LSFS LSFS LSFS LSFS

0.0920 0.0895 0.0871 0.0800 0.0827 0.0860 0.0880

37.7 s 41.2 s 44.6 s 48.5 s 53.4 s 58.9 s 64.2 s

250 300 350 400 450 500 550

150 200 250 300 350 400 450

Table 5. In our experiments, SVD is performed using OpenCV (Open Source Computer Vision), which is a library of programming functions mainly aimed at real time computer vision. From Table 5, the C-time of SLRAN increases with the raising of the dimensions. The testing time of SLRAN is very fast. For all dimensions listed, it takes only 0.010–0.030 s. Note that, for Reuter-21578 and 20-newsgroup corpus, SLRAN achieves the best MAE of 0.0510 and 0.0800, respectively, with the C-time of 52.6 s and 48.5 s, which is close to the C-time of SLRAN with the lower dimensions.

6. Conclusion and discussion In this paper, we propose a learning classifier which utilizes a staged learning-based resource allocation network (SLRAN) for text categorization. In terms of its learning progress, we divide SLRAN into a preliminary learning phase and a refined learning phase. In the former phase, an agglomerate hierarchical k-means method is utilized to create the initial structure of hidden layer. Such a method can reduce the sensitivity corresponding to the input data and can effectively prevent it from falling into the limitation of local optimum. Subsequently, a novelty criterion is put forward to dynamically add and prune hidden layer centers. That is to say, this phase ensures the preliminary structure of SLRAN and reduces the complexity of network. In the latter phase, the least square method is used to refine the learning ability of network and improve the categorization accuracy. In summary, SLRAN builds a compact structure in the former phase which decreases the computational complexity of network, and then the later phase boosts its learning capability. Moreover, the semantic similarity approach is used to organize documents, which greatly decrease the input scale of network, and reveals the latent semantics between text features. In order to demonstrate the superiority of our categorization algorithm, the Reuter and the 20-newsgroup datasets are tested in our experiments and the extensive experimental results reveal that the dynamic learning process of SLRAN improves its classifying performance in comparison with conventional classifiers.

Acknowledgments The authors thank the editors and reviewers for providing very helpful comments and suggestions. Their insight and comments led to a better presentation of the ideas expressed in this paper. This work was sponsored by the National Natural Science Foundation of China (61103129), the fourth stage of Brain Korea 21 Project. Natural Science Foundation of Jiangsu Province (SBK201122266), SRF for ROCS, SEM, and the Specialized Research Fund for the Doctoral Program of Higher Education (20100093120004).

1133

References [1] H.T. Lin, J.C. Lin, R.C. Weng, A note on platt's probabilistic outputs for support vector machines, Mach. Learn. 68 (10) (2007) 267–276. [2] M.C. Wu, S.Y. Lin, C.H. Lin, An effective application of decision tree to stock trading, Expert Syst Appl. 31 (2) (2006) 270–274. [3] A. Kumar, M. Hanmandlu, H.M. Gupta, Fuzzy binary decision tree for biometric based personal authentication, Neurocomputing 99 (1) (2013) 87–97. [4] E.K. Plakua, L. Avraki, Distributed computation of the KNN graph for large highdimensional point sets, J. Parallel Distrib. Comput. 67 (3) (2007) 346–359. [5] Y. Gao, F. Gao, Edited AdaBoost by weighted kNN, Neurocomputing 73 (16–18) (2010) 3079–3088. [6] J.N. Chen, H.K. Huang, F.Z. Tian, Method of feature selection for text categorization with bayesian classifiers, Comput. Eng. Appl. 44 (13) (2008) 24–27. [7] H.C. Lin, C.T. Su, A selective Bayes classifier with meta-heuristics for incomplete data, Neurocomputing 106 (15) (2013) 95–102. [8] J. Wang, P. Neskovic, L.N. Cooper, Bayes classification based on minimum bounding spheres, Neurocomputing 70 (4–6) (2007) 801–808. [9] S.F. Guo, S.H. Liu, G.S. Wu, Feature selection for neural network-based Chinese text categorization, Appl. Res. Comput. 23 (7) (2006) 161–164. [10] C.H. Li, S.C. Park, An efficient document classification model using an improved back propagation neural network and singular value decomposition, Expert Syst. Appl. 36 (2) (2009) 3208–3215. [11] H.H. Song, S.W. Lee, A self-organizing neural tree for large-set pattern classification, IEEE Trans. Neural Netw. 9 (5) (1998) 369–380. [12] T. Poggio, F. Girosi, Networks for approximation and learning, Proceedings of the IEEE 78 (9) (1990) 1481–1497. [13] V. Fathi, G.A. Montazer, An improvement in RBF learning algorithm based on PSO for real time applications, Neurocomputing 111 (2) (2013) 169–176. [14] D. Du, X. Li, M. Fei, G.W. Irwin, A novel locally regularized automatic construction method for RBF neural models, Neurocomputing 98 (3) (2012) 4–11. [15] R. Parekh, J. Yang, Constructive neural-network learning algorithms for pattern classification, IEEE Trans. Neural Netw. 11 (2) (2000) 436–451. [16] H.G. Han, Q.L. Chen, J.F. Qiao, An efficient self-organizing RBF neural network for water quality prediction, Neural Netw. 24 (7) (2011) 717–725. [17] G.B. Huang, P. Saratchandran, N. Sundararajan, An efficient sequential learning algorithm for growing and pruning RBF (GAP–RBF) networks, IEEE Trans. Syst., Man, Cybern.—Part B: Cybern. 34 (6) (2004) 2284–2292. [18] G.B. Huang, P. Saratchandran, N. Sundararajan, A generalized growing and pruning RBF (GGAP–RBF) neural network for function approximation, IEEE Trans. Neural Netw. 16 (1) (2005) 57–67. [19] M.M. Islam, M.A. Sattar, M.F. Amin, X. Yao, K. Murase, A new adaptive merging and growing algorithm for designing artificial neural networks, IEEE Trans. Syst., Man, Cybern.—Part B: Cybern. 39 (3) (2009) 705–722. [20] Z. Miljković, D. Aleksendrić, Artificial neural networks—solved examples with theoretical background, University of Belgrade, Faculty of Mechanical Engineering, Belgrade, 2009. [21] J. Platt, A Resource allocating network for function Interpolation, Neural Comput. 3 (2) (1991) 213–225. [22] W. Manolis, T. Nicolas, K. Stefanos, Intelligent initialization of resource allocating RBF networks, Neural Netw. 18 (2) (2005) 117–122. [23] L. Xu, Least mean square error reconstruction principle for self-organizing neural-nets, Neural Netw. 6 (5) (1993) 627–648. [24] H.S. Yazdi, M. Pakdaman, H. Modaghegh, Unsupervised kernel least mean square algorithm for solving ordinary differential equations, Neurocomputing 74 (12) (2011) 2062–2071. [25] W. Song, L.C. Choi, S.C. Park, Fuzzy evolutionary optimization modeling and its applications to unsupervised categorization and extractive summarization, Expert Syst. Appl. 38 (8) (2011) 9112–9121. [26] W. Song, S.C. Park, Latent semantic analysis for vector space expansion and fuzzy logic-based genetic clustering, Knowl. Inf. Syst. 22 (3) (2010) 347–369. [27] W. Song, S.T. Wang, C.H. Li, Parametric and nonparametric evolutionary computing with a content-based feature selection approach for parallel categorization, Expert Syst. Appl. 36 (9) (2009) 11934–11943. [28] B. Yu, Z.B. Xu, C.H. Li, Latent semantic analysis for text categorization using neural network, Knowl. Based Syst. 21 (8) (2008) 900–904. [29] F.F. Riverola, E.L. Iglesias, F. Díaz, J.R. Méndez, J.M. Corchado, Spam hunting: an instance-based reasoning system for spam labeling and filtering, Decis. Support Syst. 43 (3) (2007) 722–736. [30] C.H. Li, W. Song, S.C. Park, An automatically constructed thesaurus for neural network based document categorization, Expert Syst. Appl. 36 (8) (2009) 10969–10975. [31] S.E. Robertson, S. Walker, S. Jones, M.M. Beaulieu, M. Gatford, Okapi at TREC-3, in: Proceedings of the Third Text Retrieval Conference, Gaithersburg, USA, 1994, pp. 109–123. [32] W. Song, C.H. Li, S.C. Park, Genetic algorithm for text clustering using ontology and evaluating the validity of various semantic similarity measures, Expert Syst. Appl. 36 (5) (2009) 9095–9104. [33] Y. Yang, An evaluation of statistical approaches to text categorization, Inf. Retr. 1–2 (1999) 69–90. [34] Y. Feng, Z.F. Wu, J. Zhong, An enhanced swarm intelligence clustering-based RBFNN classifier and its application in deep web sources classification, Front. Comput. Sci. China 4 (4) (2010) 560–570.

1134

W. Song et al. / Neurocomputing 149 (2015) 1125–1134

Wei Song received his MS degree in Information and Communication Engineering from the Chonbuk National University, Jeonbuk, Korea, in 2006. He received his PhD in Computer Science from the Chonbuk National University, in 2009. Upon graduation, he joined the Jiangnan University in School of Internet of things (IOT). His research interests include pattern recognition, information retrieval, evolutionary computing, neural networks, artificial intelligence, data mining and knowledge discovery.

Peng Chen received his B.Eng. degree in 2012 from the Hubei University of Technology, Wuha, China, and he is currently the MS candidate at Jiangnan University in School of Internet of things (IOT). His research interests include information retrieval, neural networks, data mining and knowledge discovery.

Soon Cheol Park received his BS degree in Applied Physics from the Inha University, Incheon, Korea, in 1979. He received his PhD in Computer Science from Louisiana State University, Baton Rouge, Louisiana, USA, in 1991. He was a senior researcher in the Division of Computer Research at the Electronics & Telecommunications Research Institute, Korea, from 1991 to 1993. He is currently a Professor at the Department of the Electronics and Information Engineering at Chonbuk National University, Jeonbuk, Korea. His research interests include pattern recognition, information retrieval, artificial intelligence, data mining and knowledge discovery.