Applied Soft Computing 34 (2015) 251–259
Contents lists available at ScienceDirect
Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc
Incorporating self-organizing map with text mining techniques for text hierarchy generation Hsin-Chang Yang a,∗ , Chung-Hong Lee b , Han-Wei Hsiao a a b
Department of Information Management, National University of Kaohsiung, Kaohsiung, Taiwan Department of Electrical Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan
a r t i c l e
i n f o
Article history: Received 28 October 2011 Received in revised form 21 April 2015 Accepted 5 May 2015 Available online 21 May 2015 Keywords: Text mining Self-organizing map Topic identification Hierarchy generation
a b s t r a c t Self-organizing maps (SOM) have been applied on numerous data clustering and visualization tasks and received much attention on their success. One major shortage of classical SOM learning algorithm is the necessity of predefined map topology. Furthermore, hierarchical relationships among data are also difficult to be found. Several approaches have been devised to conquer these deficiencies. In this work, we propose a novel SOM learning algorithm which incorporates several text mining techniques in expanding the map both laterally and hierarchically. On training a set of text documents, the proposed algorithm will first cluster them using classical SOM algorithm. We then identify the topics of each cluster. These topics are then used to evaluate the criteria on expanding the map. The major characteristic of the proposed approach is to combine the learning process with text mining process and makes it suitable for automatic organization of text documents. We applied the algorithm on the Reuters-21578 dataset in text clustering and categorization tasks. Our method outperforms two comparing models in hierarchy quality according to users’ evaluation. It also receives better F1-scores than two other models in text categorization task. © 2015 Elsevier B.V. All rights reserved.
1. Introduction In past decades, lots of artificial neural network (ANN) models had been proposed to emulate the activities, functions, and mechanisms of brain, and focused on the development of learning strategies. Amongst these models, self-organizing map (SOM) [1] model has emerged to be one of the widely used models. According to the bibliography1 compiled by the original SOM developing team [2–4], there are more than ten thousands research papers published regarding the SOM up to now covering applications of pattern recognition, text clustering, data visualization, data mining, and many more. Such popularity mainly comes from the fact that SOM could effectively form clusters of high-dimensional data [5]. Besides, it is also easy to be implemented. The major strength of the SOM model is that it could reveal the similarity among high dimensional data and map them onto a low, often 2, dimensional map in a topology-preserving manner. That is, data that are close together in high-dimensional space will also be close in the mapped low-dimensional space. Such capabilities make the SOM being
∗ Corresponding author. Tel.: +886 7 5919512. E-mail addresses:
[email protected] (H.-C. Yang),
[email protected] (C.-H. Lee),
[email protected] (H.-W. Hsiao). 1 Accessible from http://www.cis.hut.fi/research/refs/. http://dx.doi.org/10.1016/j.asoc.2015.05.005 1568-4946/© 2015 Elsevier B.V. All rights reserved.
widely applied in data visualization, clustering, and other related tasks. Although the SOM is easy and effective in data clustering, there are three main disadvantages in its learning algorithm. First, the training time of the SOM is often long since the input data is often high-dimensional and the SOM requires repeated training over input data [6]. Second, the size and topology of the map are fixed. That is, the number and formation of neurons in the map should be determined a priori. However, there is no gold standard for determining such size and topology. Some rules of thumb were always suggested to select the appropriate number of neurons. Third, the original SOM can reveal only lateral correlations among data, but not hierarchical ones. That is, SOM can tell that a data point is close to another, but not the subset relation between them. To remedy these deficiencies, several schemes have been proposed [7,8]. To overcome the fixed structure problem, it is often to make the map expandable during training since it is difficult to know the optimal sizes of maps a priori. Generally the map contains a small amount of neurons initially and expands in later training stages according to the distributions of data. Such expanding maps may also reduce the training time since we will not allocate too many neurons for sparse data. On the other hand, there are two major approaches from existing works to conquer the lack of hierarchical relationships. The first is to apply agglomerative hierarchical clustering or divisive hierarchical clustering processes on the two-dimensional
252
H.-C. Yang et al. / Applied Soft Computing 34 (2015) 251–259
map to generate hierarchies. The second is to retrain each cluster in the map using a lower-level map and obtain the hierarchical structure. These approaches on map expansion or hierarchy generation all relies on the characteristics of data to decide when and how to expand the map. Since the SOM will cluster similar data together, many similar data will be clustered into few neurons. Thus we can decide whether to expand the map or not by detecting the distribution of data on the map. Here we may call such approaches the data-centric or data-oriented approaches. Although data-oriented approaches may effectively generate adequate number of clusters or even hierarchies, they provide no insight into the meaning of data, let along guiding the training process. In recent years, the SOM was widely used for text document clustering and categorization [9–16]. Text categorization concerns of classifying documents into some categories according to their contents, characteristics, and properties. When documents are properly categorized, documents in a cluster should have a common theme. However, classical SOM only clusters documents into clusters without specifying their themes or topics. Therefore, another process should be applied to discover the topics of each cluster [17,18]. Furthermore, text categorization tasks usually rely on some categorization taxonomies which are often hierarchical. Obviously classical SOM is not able to fulfill the requirements of such tasks. Techniques for building hierarchies during [18,19] or after [20] SOM training were thus developed to amend such insufficiency. Even in methodologies of developing hierarchies during training, the identification of category topics is performed after the training. An interesting argument is that is it possible to develop a SOM learning algorithm for text categorization based on lessons learned from applying SOM on text categorization? This argument naturally leads to the idea of incorporating topic identification and hierarchy generation into SOM learning, which is the main focus of this work. In this work, we propose a novel self-organizing map algorithm which incorporates text mining techniques. The underlying idea behind the proposed approach is to guide the learning process by cluster topics instead of document centroids. A topic detection process is continuously performed through the entire learning process. We then use these topics to decide if the map needs to be expanded laterally or hierarchically. The core difference between our method and traditional data-centric SOMs lies on the intervention of topics in the learning process. Therefore we named the proposed SOM as topic-oriented SOM, or TOPSOM. Using the higher-level knowledge such as identified topics during training should provide deeper insight of the underlying documents and better guidance of the training. Besides, our method can naturally generate a categorization taxonomy with identified themes and hierarchical structure. Thus it is also well fit for tasks regarding text documents such as text categorization, text clustering, and text hierarchy generation. The remain of this article is organized as follow. Section 2 describes some works related to our research. In Section 3 we will introduce the proposed topic-oriented self-organizing map algorithm. Section 4 gives the experimental result. Finally, we give conclusions and discussions in the last section. 2. Related work We review some related works regarding the two main themes of this work, namely adaptable SOMs and applications of SOM on text mining. 2.1. Adaptable SOM Although SOM performs well on clustering high-dimensional data, it suffers from the facts that the map topology should be
defined prior to training and the lack of ability to reveal hierarchical relations. To overcome these deficiencies, many models have been proposed, which are collectively called adaptable SOMs. Here we further divide adaptable SOMs into two categories. The first category, named laterally expandable SOMs, contains models that can expand the map horizontally, i.e. add neurons in 2-dimensional manner. The other category, named hierarchically expandable SOMs, contains models that are able to develop hierarchical maps. Note that some hierarchical models may require predefined map topology as well and are not included here. For laterally expandable SOMs, one of the earliest work is the incremental grid growing proposed by Blackmore and Miikkulainen [21]. In this network neurons are added to the perimeter of a grid that initially consists of a 2 × 2 map. The boundary neurons with the greatest accumulated error is said to represent the area of the input space most inadequately represented in the map and hence new neurons will grow on this around this neuron. Another early work is Fritzke’s growing cell structures (GCS) [22] and his later related work the growing neural gas [23] where the later combines the former with the topology generation of competitive Hebbian learning [24]. Fritzke’s models insert neurons according to the accumulated errors. That is, a neuron learns worst will be splitted by inserting a new neuron between it and its farest neighbor. Fritzke also developed the growing grid model [25] with similar growing strategies with differences on the neuron formation. The accumulated error criterion is also adopted by the growing SOM model [26] in which the map grows neurons on the boundary neurons with errors over a growth factor. All the models mentioned above rely on the learning error in evaluating the growth criterion. Several models to furnish SOMs with hierarchical structure were proposed, e.g. hierarchical feature map [27] and tree-structure SOM [28,29]. These models, however, lack the ability of dynamically constructing hierarchies. Here we should focus on the SOM models that could expand hierarchies on the fly. A hierarchical version of growing cell structure, HiGS, was proposed by Burzevski and Mohan [30]. Each node in a HiGS may contain a sub-GCS network. Sub network will be added when a node is being added to a leaf node according to some error measure. The Competitive Evolutionary Neural Tree (CENT) [31], which is based on Dynamic Neural Tree Networks (DNTN) [32], may expand child map according to the node activity, which is the number of times the node classifies any input vector, rather than error. The S-TREE model [33] adopted a divisive approach that divides the data into regions to construct a tree. A leaf node will be splitted if it is a winning node of the input vector and its cost is greater than a threshold. Hodge and Austin [34] extended the Fritzke’s growing cell structure [22]. Their work is similar to HiGS with additional ability of topology configuration. The growing hierarchical self-organizing map (GHSOM) [35,13,36] will expand the SOM if a current map cannot learn the data well. When a cluster contains too much incoherent data, it can also expand another layer to further divide data into finer cluster. The depth of hierarchy and the size of each SOM are determined dynamically. 2.2. Text organization based on SOM Here we refer text organization to the efforts involved in organizing a corpse of texts into some predefined structures. Typical text organization processes include text clustering, text categorization, and text structure generation. Text organization research has been active for several decades. Many methods have been proposed to conquer this problem. Here we mention some works that make use of SOM. SOM was widely used in text clustering. A famous example is the WEBSOM project [10]. Liu et al. [37] proposed ConSOM which use concepts along with traditional features to guide the learning process. Their method demonstrated better result compared to traditional SOM due to its semantic sensibility.
H.-C. Yang et al. / Applied Soft Computing 34 (2015) 251–259
Rauber et al. [35] used GHSOM to organize multilingual text documents. Yang and Lee [18,38] developed a top-down approach based on the clustering result of SOM to identify themes and construct hierarchy from a set of text documents. 3. Topic-oriented SOM algorithm In this work we try to develop a new training method for SOM. The core of the proposed approach is the identification of topics which are used to guide the training process as well as to change the structure of the map. Fig. 1 depicts the flowchart of our method. The detailed explanations of the steps are described as follows. 3.1. Preprocessing We need to transform the text documents into vectors before training since the input to SOM need to be vector-form. In this step common procedures for processing text documents, such as word segmentation, stopword elimination, and stemming, were first applied to obtain a set of keywords that can describe the contents of a document. All keywords were collected into the vocabulary of the corpse. A document is then transformed into a vector according to the keywords it contains. Let document dj = {kij |1 ≤ i ≤ nj }, 1 ≤ j ≤ N, where N is the number of documents, nj is the number of distinct keywords in dj , and kij represents the ith keyword indj . The vocabulary, denoted by V, is just the union of all dj , i.e. V = j dj = {vi |1 ≤ i ≤ |V |}, where vi is the i-th keyword in V. A document is encoded into a binary vector of length |V| according to those keywords that occur in it. When a keyword vi occurs in this document, the ith element of the vector will have value 1; otherwise, the element will have value 0. With this scheme,
a document dj will be encoded dj = {ıij |1 ≤ i ≤ |V|} as follows:
ıij =
1 if vi ∈ dj , 0
253
into
a
binary
vector
(1)
otherwise.
We did not apply the classical term weighting scheme such as tf · idf here since we concerned about the co-occurrence of keywords rather than their significance during the text mining process. 3.2. SOM training The document vectors were trained by classical SOM algorithm [39] on an initial map which is rather small, typically a 2 × 2 map, and named initial SOM, denoted by M0 . The initial SOM was trained by the following algorithm: Step 1 Initialize the epoch count t to 0. Randomly initialize the synaptic weight vectors wtk , 1 ≤ k ≤ M, of all neurons in the map, where M is the number of neurons in the map. Step 2 Randomly select a training vector xi . The selected one should not be chosen previously within the same epoch. Step 3 Find the winning neuron j which synaptic weight vector is the closest to xi , i.e. j = argmin||wtk − xi ||.
(2)
k
Step 4 For every neuron in the map, update its synaptic weight vector by wt+1 = wtl + ˛(t)(j, t)(xi − wtl ), l
(3)
where ˛(t) is the learning rate and (j, t) is the neighborhood function at epoch t. The neighborhood function (j, t) determines that only those neurons that close to j will be updated. Step 5 Repeat steps 2–4 until all training vectors have been selected. Step 6 Increase t by 1. If t reaches a predetermined maximum epoch count T, halt the training process; otherwise decrease the magnitudes of ˛(t) and (j, t), goto step 2. When the training process on M0 is accomplished, we then perform a labeling process on M0 to establish the association between each document and one of the neurons. The labeling process is described as follows. The feature vector of document dj , dj , is compared to the synaptic weight vectors of every neuron in M0 . We then label dj to the lth neuron of M0 if its synaptic weight vector is the closest to dj , i.e. l = argminm ||dj − wm ||, where wm is the synaptic weight vector of neuron m, wm = {wim |1 ≤ i ≤ |V |}. After the labeling process, each document is labeled to some neuron or, from a different point of view, each neuron is labeled with a set of documents. Formally, the labeling process can be described as a function L : D → M where D is the set of training documents and M is the set of neurons in the map. We record the labeling result and obtain the document cluster map (DCM). In the DCM, each neuron is labeled by a list of documents which are considered similar and are in the same cluster. 3.3. Topic identification
Fig. 1. Flowchart of the proposed method.
A topic identification process is applied to the trained map to identify topics of neurons. We identify topics by examining the weight vectors of neurons. The underlying idea is that SOM will cluster similar documents together. Such documents should have similar topics and share many keywords. Therefore, a neuron’s synaptic weight vector should have some elements with larger values which reflect the importance of these keywords shared by
254
H.-C. Yang et al. / Applied Soft Computing 34 (2015) 251–259
those documents labeled to this neuron. Finding those keywords that have higher weights should in effect identify the topics of this neuron (or cluster). According to above realization, we design the following topic identification schemes. The topics of the cluster on neuron l are a set of keywords which their corresponding elements in wl have values higher than some threshold. That is, Cl = {vi | wil > NNWT },
(4)
where vi is the ith keyword in V, Cl is the set of topics for neuron l, wil is the ith element of wl , and NNWT is a threshold. Eq. (4) only considers weight vector of single neuron and may be noisy due to imperfect learning. By virtue of SOM, neighboring neurons should represent similar clusters. That is, the topic of a neuron should likely be topic of neighboring neurons. Therefore we can obtain more reliable topics by averaging the weights from neighboring neurons as follow: Cl =
⎧ ⎨
vi |
⎩
1 Nc (l)
⎫ ⎬
wim > ANWT
m∈Nc (l)
⎭
,
(5)
Il =
where Nc (l) is the set of neighboring neurons of neuron l. Using these schemes, we can obtain the set of topics for each neuron. Note that we will not identify topics for unlabeled neurons that were not labeled by any documents. Eqs. (4) and (5) are called näive and aggregated neuron weight thresholding schemes, abbreviated NNWT and ANWT schemes, respectively. Another scheme, namely average document weight thresholding (ADWT) scheme, identifies topics using the documents labeled to a neuron rather than weight vector of the neuron itself: Cl =
⎧ ⎨
vi |
⎩
1 |Al |
⎫ ⎬
ıij > ADWT
j∈I(Al )
⎭
,
(6)
where Al denotes the set of documents labeled to neuron l and I is the index set function. Note that the thresholds NNWT , ANWT , and ADWT in Eq. (4)–(6), respectively, are not necessary the same. The determination of these threshold values is a difficult task. An easy approach is to set the thresholds to some predetermined values, e.g. 0.8. However, there is no general guideline to choose the values for different training data using various training parameters. It is common to determine them experimentally. However, such process is often time-consuming and lacks of robustness. Moreover, some clusters may contain many topics and others may contain nothing using such fixed thresholds. To remedy such deficiencies, we adopted the following approach in NNWT scheme to allow the top ranking terms being included in the topic set Cl : NNWT = (maxwil − minwil ) + minwil , i
i
need of lateral expansion for a neuron relies on the incompatibility of topics on it. When the neuron was labeled with documents with different themes, it should have diverse, dissimilar, or even contrary topics. We call such topics being incompatible. When a neuron contains incompatible topics, it is necessary to expand this neuron, i.e. giving more neurons, to allow these documents forming much precise clusters. To measure the incompatibility between topics, we can use some predefined ontology such as WordNet [40]. When topics on a neuron are dissimilar enough, we could then expand the neuron. To measure the similarity, we could adopt the common measurements, such as those compiled in [41], defined via WordNet. In this work, we should define the topic incompatibility through the map itself without referencing outside resources. Our approach is based on the distribution of weights for a topic keyword across the neurons on the map. The distribution is defined as the histogram of the weight corresponding to the topic keyword in each neuron. Let Hi denote the histogram of keyword vi . We can define Hi = {wij | 1 ≤ j ≤ N}. The topic incompatibility is measured as the average pairwise differences among topic histograms:
(7)
i
where determines the threshold for the weight range. For example, if the maximum and minimum weight values for neuron l are 0.8 and 0.3, respectively, and = 0.7, those keywords corresponding to elements with weights greater than (0.8 − 0.3) × 0.7 + 0.3 = 0.65 will be included in Cl . The same approach is also applied to ANWT and ADWT schemes with modifications using appropriate weights. This approach will allow different neurons containing different number of topics. Besides, it relieves the need to choose thresholds for different neurons. 3.4. Lateral expansion Lateral expansion is performed after the initial map has been trained and its topics have been identified. We need to decide if lateral expansion is necessary before actually doing it. In this work the
1 |Bl | 2
|V |
||Hp − Hq ||,
(8)
p,q∈I(Bl )
where Il and Bl denote the topic incompatibility and the set of topic keywords on neuron l, respectively. Note that the histograms should be normalized to obtain correct result. We only expand one neuron with the largest topic incompatibility which over some threshold, TI , after each SOM training. When a neuron needs lateral expansion, we can expand the neuron in two ways. The first is the omnidirectional approach which inserts two rows and two columns of neurons around this neuron, as shown in Fig. 2. The synaptic weight vector of an added neuron is set as the average of its two vertical, horizonal, or diagonal neighboring neurons in the original map according to their locations. When the newly added neuron is on the boundary of the map, the weight vector of its closest neuron in the original map is copied. The second approach is adopted by GHSOM [13] which inserts a row or a column of neurons between the neuron and its worst neighbor, as shown in Fig. 3. The worst neighbor of the expanded neuron is the one, among all neurons surrounding this neuron, with the largest distance between their weight vectors. Note that we only show the case of adding a column of neurons in Fig. 3. In the case that the worst neighbor locates on the same column of the neuron, a row of neurons will be added between the expanded neuron and its worst neighbor. The weights of the added column or row of neurons are set as the average of their closest neighbors in the original map. We shall retrain the map using SOM algorithm after lateral expansion. The same topic identification and lateral expansion process described above will be applied on the newly training map to see if we need further expansion. This procedure will repeat, as depicted in Fig. 1, until the topic incompatibility of every neuron is below the threshold which completes the lateral expansion process. 3.5. Hierarchical expansion The ability of hierarchical expansion is crucial to reveal the hierarchical relationships among documents. Therefore, our model not only dynamically expands the map horizontally, but also vertically. The need for hierarchical expansion is decided after lateral expansion is completed. In this work, a cluster needs to be hierarchical expanded if the topics of this cluster needs further investigation. One simple criterion is the number of documents labeled on this neuron. When a cluster contains too many documents, although with similar topics, it should be expanded since there is a good
H.-C. Yang et al. / Applied Soft Computing 34 (2015) 251–259
255
(a) (b) Fig. 2. The omnidirectional lateral expansion approach. (a) The size of the original map is m × n. The black disk depicts the neuron to be expanded. (b) After expansion, the size of the map becomes (m + 2) × (n + 2). The grey disks depict the added neurons.
Fig. 3. The worst-neighbor lateral expansion approach. (a) The size of the original map is m × n. The black disk depicts the neuron to be expanded. The slashed disk is its worst neighbor. (b) After expansion, the size of the map becomes m × (n + 1) since the worst neighbor is in the same row of the neuron. The grey disks depict the added neurons. In this example, the weight of an added neuron is the average of the weights of those two neurons on its left and right.
chance that sub-topics will be developed under such topics. Therefore, the first criterion for hierarchical expansion is cluster size thresholding (CST) which is formulated as follow: Neuron l needs hierarchical expansion if |Al | > CST .
(9)
The threshold CST could be determined by a predetermined fixed value or some multiple of average cluster size defined as follow: CST = ˇ
1
M
|Al |,
(10)
1≤l≤M
where ˇ is a predetermined constant. Besides the cluster size criterion, a neuron may also need to be expanded hierarchically if it contains too many topics. In Section 3.4 we use topic incompatibility to evaluate the necessity of lateral expansion. After lateral expansion, the documents labeled to the same neuron should have compatible topics, which means they have similar topic distribution. These topics, however, may be rather discrete and need further separation. For example, a cluster may contain topics such as ‘baseball’, ‘player’, ‘basketball’, ‘MLB’, ‘NBA’, and ‘statistics’. The documents in this cluster may thus very likely be categorized into sub-categories such as ‘baseball’ and ‘basketball’. When a cluster have large number of compatible topics, we can say that documents in this cluster may be multifaceted. Therefore, we design the second criterion for hierarchical expansion according to the number of topics in a neuron l, namely topic size thresholding (TST), as follow: Neuon l needs hierarchical expansion if |Bl | > TST .
(11)
We may use the same schemes in determining CST for TST , i.e. the threshold TST could be determined by a predetermined fixed value or some multiple of average topic size defined as follow: TST = ˇ
1 M
|Bl |,
(12)
1≤l≤M
where ˇ is a predetermined constant. Note that CST and TST criteria could be used conjunctively or multiplicatively in evaluating the necessity of hierarchical expansion, where the conjunctive form is: Neuron l needs hierarchical expansion if |Al | > CST ∧ |Bl | > TST ,
(13)
and the multiplicative form is Neuron l needs hierarchical expansion if |Al ||Bl | > CST TST .
(14)
Table 1 Training parameters used in our experiments. Corpus
C1
C2
Size of initial SOM Number of synapses in a neuron Initial training gain ˛(t) for SOM training Maximal epoch count T for SOM training
2×2 563 0.4 200
2×2 1976 0.4 500
256
H.-C. Yang et al. / Applied Soft Computing 34 (2015) 251–259
Table 2 The result of training process. Corpus
NNWT
ANWT
ADWT
Yang04
Average number of topic in each neuron
C1 C2
2.37 4.32
2.78 5.68
3.67 6.57
6.07 9.34
LabelSOM 3.23 6.54
Average number of neurons in each map
C1 C2
5.75 8.81
6.24 9.27
5.47 9.31
N/A N/A
6.45 11.46
Average number of map in each training
C1 C2
4.32 7.37
3.67 8.26
5.27 9.31
N/A N/A
6.31 12.29
Average depth of hierarchies
C1 C2
2.3 3.25
3.21 3.94
3.45 4.37
6.25 8.58
5.76 7.58
Table 3 The evaluation result of generated hierarchies. We adopted five-point Likert scales to evaluate each criterion where score 1 stands for ‘strongly disagree’ and 5 for ‘strongly agree’. Corpus
NNWT
ANWT
The topics are descriptive for the clusters.
C1 C2
4.1 3.9
4.5 4.1
The number of topics is appropriate.
C1 C2
3.9 3.5
The depth of hierarchy is adequate.
C1 C2
The hierarchy reveals the parent–child relationships between clusters. The sizes of clusters are adequate.
ADWT
Yang04
LabelSOM
4.4 4.1
3.8 3.1
4.2 4.0
4.0 3.8
4.1 3.9
3.2 3.2
3.8 3.8
4.1 3.7
4.3 4.2
4.3 4.1
3.7 2.8
4.0 3.9
C1 C2
4.2 4.0
4.4 4.3
4.3 4.3
3.5 3.3
4.1 4.1
C1 C2
3.7 3.8
4.2 4.2
4.3 4.3
3.1 3.4
4.2 4.3
The bold values indicate that they outperform the LabelSOM approach, which is considered as state of the art, in the last column.
When a neuron l in a level-n map needs hierarchical expansion, we should create an initial SOM as described in Section 3.2. This SOM, which is a level-(n + 1) map, is then trained by the documents labeled to this neuron, i.e. dj ∈ Al . We should increase the topic incompatibility threshold TI in lateral expansion since all documents used for training in higher level maps should be much compatible in topics. In the mean time, the number of training epochs as well as the initial training gains may also need to be updated since the number of training samples is reduced for higherlevel maps. The whole training process stops when there is no neuron in the highest level maps meets the hierarchical expansion criteria. 4. Experimental result Since our method focuses on the organization of text documents, traditional experiments on clustering using SOM may not fit. Therefore, we performed our experiments to see if our method works on text hierarchy generation, in the general framework of text categorization. The experiments are two-fold. First, we performed experiments on smaller datasets to reveal the ability of hierarchy generation and topic identification. Second, we performed experiments on larger dataset to check the feasibility of the generated hierarchy on text categorization. The details are described in the following subsections. 4.1. Experiments on text hierarchy generation We performed the experiments on two corpora which were used in our previous research [12,18] for comparison purpose. The first corpus (C1) contains 100 Chinese news articles posted in August 1, 2, and 3, 1996 by Central News Agency.2 The second corpus (C2) contains 3268 documents posted during October 1 to October 9, 1996. A word extraction process is applied to the 2
http://www.cna.com.tw/.
corpora to extract Chinese words. There are 1475 and 10937 words extracted from C1 and C2, respectively. To reduce the dimensionality of the feature vectors we discarded those words which occur only once in a document. We also discarded the words appeared in a manually constructed stoplist. This reduces the number of words to 563 and 1976 for C1 and C2, respectively. We trained the corpora with the proposed algorithm. The sizes of the initial SOMs for C1 and C2 are both 2 × 2. Table 1 shows the parameters used in the experiments. The parameters were those that produced the best result for later comparisons. After SOM training, we performed topic identification using three topic identification schemes described in Section 3.3. We then find neurons which need lateral expansion according to the topic incompatibility defined in Section 3.4. The thresholds for topic identification were set to 0.8, 0.85, and 0.8 for NNWT , ANWT , and ADWT , respectively. Besides, the threshold for topic incompatibility, TI , was set to 0.7. Both omnidirectional and worst neighbor schemes were adopted to expand a neuron. However, the worst neighbor scheme produced better result so we only show its result. The hierarchical expansion was then performed according to the thresholds CST and TST . We use Eq. (10) to determine CST with ˇ = 1.2, i.e. when the cluster contains 20% more documents than average. The same scheme was adopted to determine TST where ˇ was also set to 1.2. We then used these two thresholds conjunctively as shown in Eq. (13). A neuron was expand to a lower SOM with sizes initialized as those in Table 1 with the maximal epoch count reduced by a factor of 0.9 for every additional level. We retrained the two corpora 100 times and obtained statistical information about the proposed algorithm. Table 2 shows the statistics of the training process. Note that the values are averaged over 100 times of training. We also conducted the same experiments using our previous approach and GHSOM, which are denoted as Yang04 and LabelSOM, respectively, in Table 2. The quality of the generated hierarchies were examined by 10 human inspectors. A questionnaire containing 5 questions regarding the quality of the hierarchies was evaluated by the
H.-C. Yang et al. / Applied Soft Computing 34 (2015) 251–259
257
Fig. 4. An hierarchy built by our method. We randomly selected 100 articles from the dataset to build the hierarchy.
inspectors. The criteria and average scores are shown in Table 3. Two variants of our approach outperform the GHSOM according to the test result. 4.2. Experiments on text categorization To evaluate the effectiveness of the generated hierarchies on text categorization, we performed experiments using the popular Reuters-21578 dataset.3 The dataset contains 135 flat categories which are not organized into hierarchy. In order to use the collection as benchmark of hierarchical text categorization and to prove its superiority to flat categorization several authors organized the categories into hierarchies. We followed the experiment setup in [42] which used 3 hierarchies proposed by D’Alessio et al. [43] (Experiments E3, E4, E5). We adopted the same approach as [42] which used the mode-Apté [44] split that removes all unlabelled documents resulting in total 7760 training and 3009 test documents. We then used the document preprocessing steps described in Section 3.1 to process both training and test documents into vectors. The proposed training algorithm was applied on the training
3
http://www.daviddlewis.com/resources/testcollections/reuters21578/.
Table 4 The training parameters for text categorization experiment. Parameter
Value/approach
NNWT ANWT ADWT TI Lateral expansion scheme ˇ for CST ˇ for TST CST and TST is used
0.8 0.8 0.85 0.75 Worst neighbor 1.2 1.2 Conjunctively
vectors. The size of the initial SOM is 2 × 2. The training parameters were determined in similar ways as described in Section 4.1 and are listed in Table 4. A neuron was expand to a lower SOM with sizes initialized as those in Table 1 with the maximal epoch count reduced by a factor of 0.9 for every additional level. Fig. 4 depicts one hierarchy generated by our method. Note that we used a smaller dataset for demonstration purpose. In Fig. 4, the layer 1 map consists of 20 neurons in a formation of 5 × 4 which are enumerated from 0 to 19. For each neuron we will show its identified topics as well as the documents labeled on it. It a neuron contains a ‘down’ link, it should mean that there is a lower-level map under this
258
H.-C. Yang et al. / Applied Soft Computing 34 (2015) 251–259
Table 5 Performance comparison for text categorization experiments. Hierarchies
Our method
Tikk et al. [42]
D’Alessio et al. [43]
E3 E4 E5
91.47 92.68 92.45
91.19 93.61 92.39
79.8 82.8 82.5
neuron. In this example, there only is one second layer map under neuron 6 in the first layer map. It is clear in this example that the identified topics are consistent for nearby neurons. Since we intentionally arranged similar documents in consecutive order, we may also observe that, in some clusters, similar documents, which have filenames with consecutive or close numbers, will be clustered together in the same neuron or nearby neurons. Upon further investigation on the documents labeled on a neuron, we were convinced that these documents also reflect the topics associated with this neuron. We trained the data for 100 times and used the test data on the training hierarchy. For evaluation, we adopted the F1 measure which was used in [42]. Table 5 shows the result of the experiments. It is clear that our method produced comparable results against related work. 5. Conclusions The self-organizing map model has been widely and successfully used in data clustering and visualization, as well as other wide scope of applications. Traditional SOM models cluster data according to their similarity. Besides, the structure of the map is always fixed. Various models have been proposed to overcome such insufficiency. In this work, we try to develop a novel learning algorithm which is suitable for text organization. When a set of text documents are being trained, we will identify the topics of a neuron which represents a document cluster after SOM training. We then use such topics to decide if a neuron needs to be lateral expanded or hierarchical expanded. Since our method incorporates various text mining approaches in training, it is feasible to use our method on text documents. The experimental results suggest that our method is adequate for developing structure which can be used for text categorization and organization. The major contributions of this work come from the incorporation of topic identification process into expansion of maps during SOM learning. Such incorporation is novel and well fit to text document organization. Various approaches on topic identification and their usage on expanding maps were suggested and tested. There are still problems undermined, such as the choice of parameters. In the next phase of our research, we try to develop a scheme to control the learning process to obtain sub-optimal result. We will also try to confine the learning process to some external constraints. References [1] T. Kohonen, Self-Organizing Maps, Springer-Verlag, Berlin, 1997. [2] S. Kaski, J. Kangas, T. Kohonen, Bibliography of self-organizing map (SOM) papers: 1981–1997, Neural Comput. Surv. 1 (1998) 102–350. [3] M. Oja, S. Kaski, T. Kohonen, Bibliography of self-organizing map (SOM) papers: 1998–2001 Addendum, Neural Comput. Surv. 3 (2003) 1–156. [4] M. Pöllä, T. Honkela, T. Kohonen, Bibliography of Self-Organizing Map (SOM) Papers: 2002–2005 Addendum, Tech. Rep. TKK-ICS-R24, Information and Computer Science, Helsinki University of Technology, Espoo, Finland, 2009. [5] J. Vesanto, E. Alhoniemi, Clustering of the self-organizing map, IEEE Trans. Neural Netw. 11 (3) (2000) 586–600. [6] D. Roussinov, H. Chen, A scalable self-organizing map algorithm for textual classification: a neural network approach to thesaurus generation, Commun. Cognit. 15 (1–2) (1998) 81–112. [7] C. Astudillo, B. Oommen, Topology-oriented self-organizing maps: a survey, Pattern Anal. Appl. 17 (2) (2014) 223–248, http://dx.doi.org/10.1007/s10044014-0367-9, ISSN:1433-7541.
[8] X. Qiang, G. Cheng, Z. Li, A survey of some classic self-organizing maps with incremental learning, in: Proc. 2010 International Conference on Signal Processing Systems (ICSPS), vol. 1, 2010, pp. 804–809. [9] T. Honkela, S. Kaski, K. Lagus, T. Kohonen, Newsgroup Exploration with WEBSOM Method and Browsing Interface, Tech. Rep. A32, Laboratory of Computer and Information Science, Helsinki University of Technology, Espoo, Finland, 1996. [10] S. Kaski, T. Honkela, K. Lagus, T. Kohonen, WEBSOM-self-organizing maps of document collections, Neurocomputing 21 (1998) 101–117. [11] D. Merkl, Text classification with self-organizing maps: some lessons learned, Neurocomputing 21 (1-3) (1998) 61–77. [12] C.H. Lee, H.C. Yang, A web text mining approach based on self-organizing map, in: Proceedings of the ACM CIKM’99 2nd Workshop on Web Information and Data Management, Kansas City, MI, 1999, pp. 59–62. [13] A. Rauber, D. Merkl, M. Dittenbach, The growing hierarchical self-organizing map: exploratory analysis of high-dimensional data, IEEE Trans. Neural Netw. 13 (6) (2002) 1331–1341. [14] D. Isa, V.P. Kallimani, L.H. Lee, Using the self organizing map for clustering of text documents, Expert Syst. Appl. 36 (5) (2009) 9584–9591. [15] Y.-c. Liu, C. Wu, M. Liu, Research of fast SOM clustering for text information, Expert Syst. Appl. 38 (8) (2011) 9325–9333. [16] M.-S. Paukkeri, A.P. García-Plaza, V. Fresno, R.M. Unanue, T. Honkela, Learning a taxonomy from a set of text documents, Appl. Soft Comput. 12 (3) (2012) 1138–1148, http://dx.doi.org/10.1016/j.asoc.2011.11.009, ISSN:15684946, http://www.sciencedirect.com/science/article/pii/S1568494611004340 [17] A. Rauber, LabelSOM: on the labeling of self-organizing maps, in: Proc. International Joint Conference on Neural Networks, vol. 5, Washington, DC, 1999, pp. 3527–3532. [18] H.C. Yang, C.H. Lee, A text mining approach on automatic generation of web directories and hierarchies, Expert Syst. Appl. 27 (4) (2004) 645–663. [19] M. Dittenbach, D. Merkl, A. Rauber, Using growing hierarchical self-organizing maps for document classification, in: Proceedings of the 8th European Symposium on Artificial Neural Networks (ESANN’2000), Bruges, Belgium, 2000, pp. 7–12. [20] R. Chau, C. Yeh, K. Smith, A neural network model for hierarchical multilingual text categorization, in: J. Wang, X.-F. Liao, Z. Yi (Eds.), Advances in Neural Networks – ISNN 2005, Vol. 3497 of Lecture Notes in Computer Science, Springer, Berlin, Heidelberg, 2005, pp. 238–245. [21] J. Blackmore, R. Miikkulainen, Incremental grid growing: encoding highdimensional structure into a two-dimensional feature map, in: Proceedings of the IEEE International Conference on Neural Networks, vol. 1, 1993, pp. 450–455. [22] B. Fritzke, Growing cell structures: a self-organizing network for unsupervised and supervised learning, Neural Netw. 7 (9) (1994) 1441–1460. [23] B. Fritzke, A growing neural gas network learns topologies, in: G. Tesauro, D.S. Touretzky, T.K. Leen (Eds.), in: Advances in Neural Information Processing Systems, vol.7, MIT Press, Cambridge MA, 1995, pp. 625–632. [24] T.M. Martinetz, K.J. Schulten, A. “neural-gas” network learns topologies, in: T. Kohonen, K. Mäkisara, O. Simula, J. Kangas (Eds.), Artificial Neural Networks, North-Holland, Amsterdam, Netherlands, 1991, pp. 397–402. [25] B. Fritzke, Growing grid – a self-organizing network with constant neighborhood range and adaptation strength, Neural Process. Lett. 2 (5) (1995) 9–13. [26] D. Alahakoon, S.K. Halgamuge, B. Srinivasan, Dynamic self-organizing maps with controlled growth for knowledge discovery, IEEE Trans. Neural Netw. 11 (3) (2000) 601–614. [27] R. Miikkulainen, Script recognition with hierarchical feature maps, Connect. Sci. 2 (1–2) (1990) 83–101. [28] P. Koikkalainen, E. Oja, Self-organizing hierarchical feature maps, in: Proceedings of 1990 International Joint Conference on Neural Networks, vol. II, IEEE, INNS, San Diego, CA, 1990, pp. 279–284. [29] P. Koikkalainen, Tree structured self-organizing maps, in: E. Oja, S. Kaski (Eds.), Kohonen Maps, Elsevier, The Netherlands, 1999, pp. 121–130. [30] V. Burzevski, C.K. Mohan, Hierarchical growing cell structures, in: Proceedings of the IEEE International Conference on Neural Networks, vol. 3, 1996, pp. 1658–1663. [31] R.G. Adams, K. Butchart, N. Davey, Hierarchical classification with a competitive evolutionary neural tree, Neural Netw. 12 (1999) 541–551. [32] T. Li, Y. Tang, S. Suen, L. Fang, A structurally adaptive neural tree for recognition of a large character set, in: Proceedings of the 11th IAPR International Joint Conference on Pattern Recognition, vol. II, 1992, pp. 187–190. [33] M.M. Campos, G.A. Carpenter, S-TREE: self-organizing trees for data clustering and online vector quantization, Neural Netw. 14 (2001) 505–525, http://dx.doi. org/10.1016/S0893-6080(01)00020-X [34] V.J. Hodge, J. Austin, Hierarchical growing cell structures: TreeGCS, IEEE Trans. Knowl. Data Eng. 13 (2) (2001) 207–218, http://dx.doi.org/10.1109/69.917561 [35] A. Rauber, M. Dittenbach, D. Merkl, Towards automatic content-based organization of multilingual digital libraries: an English, French and German view of the Russian information agency Nowosti news, in: Proceedings of the Third All-Russian Scientific Conference on Digital Libraries: Advanced Methods And Technologies, Digital Collections, 2001, pp. 11–13. [36] M. Dittenbach, A. Rauber, D. Merkl, Recent Advances with the Growing Hierarchical Self-Organizing Map, in: N. Allinson, Y. Ahujun, L. Allinson, J. Slack (Eds.), Advances in Self-Organizing Maps, Springer, Lincoln, England, 2001, pp. 140–145. [37] Y. Liu, X. Wang, C. Wu, ConSOM: a conceptional self-organizing map model for text clustering, Neurocomputing 71 (4–6) (2008) 857–862.
H.-C. Yang et al. / Applied Soft Computing 34 (2015) 251–259 [38] H.C. Yang, C.H. Lee, Automatic category theme identification and hierarchy generation for chinese text categorization, J. Intell. Inf. Syst. 25 (1) (2005) 47–67. [39] T. Kohonen, Self-organizing formation of topologically correct feature maps, Biol. Cybern. 43 (1) (1982) 59–69. [40] G.A. Miller, WordNet: a lexical database for English, Commun. ACM 38 (11) (1995) 39–41. [41] T. Pedersen, S. Patwardhan, J. Michelizzi, WordNet::similarity – measuring the relatedness of concepts, in: D.M. Susan Dumais, S. Roukos (Eds.), HLTNAACL 2004: Demonstration Papers, Association for Computational Linguistics, Boston, Massachusetts, USA, 2004, pp. 38–41.
259
[42] D. Tikk, G. Biró, J.D. Yang, A hierarchical text categorization approach and its application to FRT expansion, Aust. J. Intell. Inf. Process. Syst. 8 (3) (2004) 123–131. [43] S. D’Alessio, K. Murray, R. Schiaffino, A. Kershenbaum, The effect of using hierarchical classifiers in text categorization, in: Proceeding of RIAO00 6th International Conference Recherche dInformation Assistee par Ordinateur, 2000, pp. 302–313 http://www.iona.edu/cs/FacultyPublications/riao2000New. pdf [44] C. Apté, F. Damerau, S.M. Weiss, Automated learning of decision rules for text categorization, ACM Trans. Inf. Syst. 12 (1994) 233–251, ISSN:1046-8188.