BioSystems 88 (2007) 334–342
Keyword extraction, ranking, and organization for the neuroinformatics platform S. Usui a,∗ , P. Palmes b,a , K. Nagata a , T. Taniguchi a , N. Ueda c b
a RIKEN Brain Science Institute, 2-1 Hirosawa, Wako City, Saitama 351-0198, Japan Ateneo de Manila University, School of Science and Engineering, Department of Information Systems and Computer Science, Loyola Heights, Quezon City 1108, Philippines c NTT Communication Science Laboratories, 2-4 Hikaridai, Seika-cho, Soraku-gun Kyoto, Japan
Received 24 March 2006; accepted 3 August 2006
Abstract Brain-related researches encompass many fields of studies and usually involve worldwide collaborations. Recognizing the value of these international collaborations for efficient use of resources and improving the quality of brain research, the International Neuroinformatics Coordinating Facility (INCF) started to coordinate the effort of establishing neuroinformatics (NI) centers and portal sites among the different participating countries. These NI centers and portal sites will serve as the conduit for the interchange of information and brain-related resources among different countries. In Japan, several NI platforms under the support of NIJC (NI Japan Center) are being developed with one platform called, Visiome, already operating and publicly accessible at “http://www.platform.visiome.org”. Each of these platforms requires their own set of keywords that represent important terms covering their respective fields of study. One important function of this predefined keyword list is to help contributors classify the contents of their contributions and group related resources. It is vital, therefore, that this predefined list should be properly chosen to cover the necessary areas. Currently, the process of identifying these appropriate keywords relies on the availability of human experts which does not scale well considering that different areas are rapidly evolving. This problem prompted us to develop a tool to automatically filter the most likely terms preferred by human experts. We tested the effectiveness of the proposed approach using the abstracts of the Vision Research Journal (VR) and Investigative Ophthalmology and Visual Science Journal (IOVS) as source files. © 2006 Elsevier Ireland Ltd. All rights reserved. Keywords: Neuroinformatics; Relevance ranking; Weighting; Indexing; Automatic extraction; Co-occurrence; Clustering
1. Introduction Understanding the brain as a system requires worldwide collaboration of scientists specializing in different areas of the brain. With the advancement and widespread ∗
Corresponding author. Tel.: +81 48 462 1111x7601; fax: +81 48 467 7498. E-mail addresses:
[email protected] (S. Usui),
[email protected] (P. Palmes),
[email protected] (K. Nagata),
[email protected] (T. Taniguchi),
[email protected] (N. Ueda).
adoption of information technology among the scientific communities, scientists nowadays working together attain a much richer level of understanding of a certain phenomenon. These rich interactions, while hastening the discovery of new science, produce new information at a rapid rate that makes understanding of the entire system like the brain become overwhelmingly complex for any individual. Consequently, further understanding and development in a particular field are difficult to achieve due to information overload. These issues confronting many areas of research and much more compounded in the fields of brain research, prompted for the
0303-2647/$ – see front matter © 2006 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.biosystems.2006.08.015
S. Usui et al. / BioSystems 88 (2007) 334–342
development of a field called neuroinformatics (NI). Its main goal is to help brain scientists handle the analysis, modeling, simulation, and management of the information resource before, during, and after the conduction of research. Scientists in different places working together need to have a common and remotely accessible environment that provides them tools for easy data organization and storage of their research findings. Also, the environment should allow them smooth integration of their results with other collaborators. The neuroinformatics platform such as “Visiome” (http://platform.visiome.org) aims to address these issues by providing portal sites to different fields of brain research (Usui, 2003a,b). These portal sites allow collaborators to share research resources which include not only published papers but also the papers’ corresponding support files such as source codes of algorithms and mathematical/statistical models, experimental data, movies, slides/images, presentations, etc. One vital component of the neuroinformatics platform is the index tree which is used to organize the materials submitted by the contributors. Since each contributor, upon submission of his/her work, has to choose the appropriate terms from the index tree, it is important that the elements of the index tree are reasonably chosen so that the submitted work can be properly organized and characterized in a coherent manner. These index terms should be able to cover almost all areas that are deemed highly relevant by the human experts and organized in a structure where the resources they point can easily be located. As the different fields of study evolve, the structure and composition of the index tree will also evolve. With the current manual scheme, it does not scale well. Automating the index keyword extraction is necessary to support the evolution of the platform in operation and in the establishment of new platforms. 2. Extracting keywords This section describes the data sets used as well as the data processing techniques and the rationale behind the formulation of the proposed weighting measures. 2.1. Data sets In this study for the automatic extraction of technical keywords, we use the collections of research abstracts from 1992 to 2004 of the Vision Research (VR) and Investigative Ophthalmology and Visual Science (IOVS) journals as test cases. Although using the full paper contents could have provided us better data quality
335
Fig. 1. VR and IOVS abstracts basic statistics. The IOVS database has a relatively smaller number of keywords and significantly wider search space compared to the VR database. Moreover, most of IOVS keywords are non-unigrams. These properties make the task of extracting IOVS keywords harder than the VR keywords.
and accuracy, we preferred the practicality of analyzing research abstracts because they are readily available free of charge in majority of the cases. In order to assess the effectiveness of the different weighting schemes, it is important to have a good basis of expert knowledge in determining the most relevant terms among the collection of abstracts being studied. For evaluation purposes, this paper considers the two sets of keywords defined by VR and IOVS editorial boards/publishers, respectively, to constitute the correct sets of keywords. Fig. 1 summarizes the basic statistics of the databases derived from both journals. Although VR and IOVS are somewhat related due to their focus in vision science, they relatively differ in scope and perspective. One prominent difference is in the list of their keywords. While majority of the VR keywords are unigrams (single term keywords), IOVS keywords are mostly bigrams with a smaller fraction composed of unigrams and trigrams. Also, the number of IOVS Ngrams (single or multiple-term keywords) (219, 244) is almost twice as many as that of VR (112, 520). However, IOVS keywords (476) are just about half the total number of VR keywords (1055). In this sense, extracting the IOVS main keywords is more difficult due to its large data size but relatively smaller number of keywords. The differences in the statistical property between VR and IOVS will allow us to determine which among the approaches is the most consistent, stable, and robust in keyword extraction. 2.2. Data processing All approaches included in the study utilized the vector-space or bag-of-words representation between
336
S. Usui et al. / BioSystems 88 (2007) 334–342
Fig. 2. Data processing flowchart. The pre-processing scheme employs similar techniques such as stemming and stopword removal which are popularly used in the text-mining community to extract terms for vector-space representation.
terms and documents (Salton and McGill, 1983). It is the most common and generally accepted representation in the text mining community. Each unique term in the collection of research abstracts (Fig. 2) is assigned to a unique term-id after stopword removal and stemming (Porter, 1980). Also, each unique abstract is assigned to a unique document-id. The combination of term-ids and doc-ids facilitates the construction of the term-document matrix where each cell (i, j) corresponds to the frequency of occurrence of term i in document j as shown in Fig. 3. The rows of this matrix represent the vectorspace representation of terms embedded in the document space while its columns represent the vector-space rep-
resentation of documents embedded in the term space. Statistical and machine learning approaches in this study use the base information encoded in this matrix to derive other tables. 2.3. Term ranking As shown in Fig. 2, automatic keyword extraction is performed based on term ranking by using term weighting measures. Different weighting measures have significant influence on the relevance ranking of terms or its interestingness. The typical measure of interestingness is based on the term’s specificity and generality of occurrence. The simplest measure is the term frequency TF(i) given by TF(i) =
N
TF(i, j).
(1)
j=1
Here, TF(i, j) is the frequency of term i that appeared in document j while N is the total number of documents. As shown in Fig. 3, TF(i, j) is the (i, j)th element of a termdocument matrix in the bag-of-words representation. Clearly, this measure has a problem that noninteresting general terms have high weights. To relax the problem, Salton (1991) proposed the TF–IDF (term frequency and inverse document frequency) weighting. The inverse document frequency, IDF(i), is defined as IDF(i) = log
N , DF(i)
(2)
where DF(i) or document frequency of term i is the number of documents with term i. Finally, TF–IDF is defined by Fig. 3. Vector-space or bag-of-words representation. Each cell (i, j) corresponds to the number of occurrence of term i in document j. The rows represent the collections of terms embedded in the document space while its columns represent collections of documents embedded in the term space.
TF–IDF(i) = TF(i) × IDF(i).
(3)
The additional IDF measure penalizes general terms that appear frequently in many documents and favors
S. Usui et al. / BioSystems 88 (2007) 334–342
337
Fig. 4. Rank assignment for unigrams and Ngrams. Two-way ranking is carried out by ranking first the unigrams vertically followed by ranking the Ngrams in each row horizontally.
specific terms that appear frequently in a relatively smaller number of documents. However, according to our exploration, we found that technical keywords often have the following properties: (P1) Keywords often appear in documents that deal with similar topics. (P2) Some keywords co-occur with other keywords in a document. Clearly, these properties are not directly considered in TF–IDF. Hence, we propose new measures.
To incorporate (P1) into the measure, we consider inverse topic frequency, ITF, instead of IDF. More specifically, ITF(i) is given by ITF(i) = log
K , TPF(i)
(4)
where TPF(i) or topic frequency of term i is the number of topics to which documents with term i belong while K is the total number of topics. This measure penalizes those general terms appearing in many topics in similar way as IDF penalizes general terms appearing in many documents.
Fig. 5. Ngram evaluation method. Two-way ranking induces the accumulation of interesting terms towards the leftmost and uppermost part of the rectangle.
338
S. Usui et al. / BioSystems 88 (2007) 334–342
To compute TPF, we first have to obtain document clusters, each of which corresponds to a latent topic (research field). For this purpose, we just apply the spherical K-means (SKM) algorithm (see Duda et al., 2001; Dhillon et al., 2003, 2001, for the details] to the document vectors shown in Fig. 3. Clearly, ITF depends on the value of K. However, as we will describe later, the choice of K is not so sensitive to the optimal performance of the final ranking of terms. Next, to incorporate (P2) into the measure, we propose the term-document co-occurrence frequency, TDCF(i), which is defined by TDCF(i) = log
1 , minj=i Fisher(i, j)
(5)
where Fisher(i, j) denotes Fisher’s exact probability value (p-value) for the co-occurrence relationship between terms i and j. We will omit the details, but intuitively, when the number of co-occurrences of terms (i, j) within documents is large, the value of Fisher(i, j) becomes small. Thus, the inverse of minj=i Fisher(i, j) value provides the maximum degree of co-occurence of term i with some terms except i. We experimentally confirmed that incorporating the log-scale into its measure
Fig. 6. Precision/recall performance. Both plots indicate that TF–ITF–TDCF has better performances in VR (a) and IOVS (b) compared to the conventional approaches.
Fig. 7. Confidence interval plots. Using 10 trials, it is apparent that the TF–ITF–TDCF has superior average recall performance in both VR (a) and IOVS (b).
Fig. 8. Recall performances of TF–ITF–TDCF for different K. For K in the range of 80–200, its recall performances do not significantly vary in both VR (a) and IOVS (b).
S. Usui et al. / BioSystems 88 (2007) 334–342
was appropriate, like in IDF and ITF. Finally, our new ranking measure for term i is given by TF–ITF–TDCF(i) = TF(i) × ITF(i) × TDCF(i).
(6)
2.4. Performance evaluation For general applicability, the extraction process has to take into account the presence of multiple-term keywords. Fig. 4 outlines how both unigrams and Ngrams are ranked. First, vertical ranking is used to rank the
339
unigrams, followed by horizontal ranking involving Ngrams. It shall be noted that the Ngrams in each row contain the corresponding unigram on their leftmost side which serves as their anchor or root word. Because ranking covers both the vertical and horizontal directions, the evaluation criterion utilizes the concept of ideal rectangle. The evaluation is carried out by counting the number of keywords inside this rectangle (Fig. 5) anchored from the uppermost part of the final table. For the ideal case, all keywords are located inside this rectangle.
Fig. 9. VR (a) and IOVS (b) colormaps. The hot pixels along the left side represent terms that are part of the keyword vocabulary list. It is apparent that TF–ITF–TDCF ranking produces a greater number of hot pixels at the upper-leftmost part of the VR and IOVS colormap tables than the TF.
340
S. Usui et al. / BioSystems 88 (2007) 334–342
We initially tested the effect of combining the different weighting schemes on the two-way ranking scheme described in Fig. 4. The test indicates that the optimal ranking performance is dominated by the weighting scheme used on the vertical ranking. Hence, the final implementation employs just one particular type of weighting in both horizontal and vertical ranking. Fig. 6(a and b) shows the precision and recall performances of the three weighting schemes in returning the topmost terms (topN) of VR and IOVS, respectively. Among the three schemes, it is apparent that the TF–ITF–TDCF has the most optimal performance in both precision and recall. Since the proposed scheme relies on the used of spherical K-means clustering (Dhillon et al., 2003), it is important to check the significance and consistency of its optimal performance. Hence, we conducted significant testing using 10 trials for each database and used pairwise t-test for the analysis. Our tests indicate that the perceived optimal performances of TF–ITF–TDCF in VR and IOVS are significant at 0.05 level of confidence. Fig. 7(a and b) shows the confidence interval plots for VR and IOVS. The plots clearly indicate the superior performance of the proposed approach over the two conventional approaches. As mentioned in the previous section regarding the choice of K for the TPF, we tested how sensitive is the performance of the TF–ITF–TDCF in different values of K. Fig. 8 indicates that its optimal ranking in both VR and IOVS is not so sensitive to values of K between 80
and 200 clusters. With these results, we decided to use K = 100 for both VR and IOVS computations of ITF. 2.5. Visualization of results To help visualize the effectiveness of the different ranking schemes involving Ngrams, Eq. (7) describes a colormap assignment based on the relative rank of each Ngram term with respect to the term with minimum rank. Fig. 9(a) shows the corresponding color weight lookup table. The higher is the rank of the term, the lighter is the color. In similar fashion, the darker is the color, the lower is its relative rank. color-index(i, j)=
log10 [rank(i, j)− min Rank] log10 [max Rank− min Rank]
(7)
Fig. 9(a and b) shows the colormaps of the worst and best ranking schemes (TF versus TF–ITF–TDCF) for VR and IOVS, respectively. We only show the colormap of the worst ranking scheme and the optimal ranking scheme to highlight the significant differences between these two extremes. The TF–ITF–TDCF colormaps in VR and IOVS indicate a relatively higher concentration of keywords compared to TF using the same size of pixel window (350 × 250). As previously demonstrated by Wilcoxon test, all the different schemes have positive influence in filtering keywords which is indicated by a relatively high density of hot pixels (terms in the keyword vocabulary list) lying near the leftmost column of their colormap tables. However, the greater density
Fig. 10. Stability of VR and IOVS rank assignments of terms. TF–ITF–TDCF rank assignments of terms were applied to 400 randomly selected documents using 256 trials. The purpose is to determine which terms have stable rank assignments (constant rank assignments in 256 different trials) depicted by darker colors. Results show that the VR and IOVS stable terms are unigrams and bigrams, respectively. These results strongly suggest that most of the interesting terms in VR are unigrams while bigrams for IOVS. These results agree with the true nature of keywords in VR and IOVS.
S. Usui et al. / BioSystems 88 (2007) 334–342
of hot pixels in the TF–ITF–TDCF compared to the TF colormaps suggests that the TF–ITF–TDCF has a much better ability to discriminate keywords from the non-keywords. Since the proposed algorithm uses spherical K-means which rely on the random initialization of centroids, it is important to check the algorithm’s consistency and stability in generating optimal ranking. The stability test (Fig. 10) involves applying TF–ITF–TDCF in 400 randomly selected documents using 256 independent trials. The experiments record the number of times each term appears in a particular rank order assigned by TF–ITF–TDCF which implies that terms with high occurrence rate have stable rank order and are most likely keywords. It is interesting to observe that the terms with high occurrence rate in VR are unigrams while in IOVS are bigrams. As one may recall in Fig. 1, VR keywords are mostly unigrams while IOVS keywords are mostly bigrams. The experiments were able to detect this particular difference between VR and IOVS although the raw data do not explicitly provide this information. 3. Conclusion Extracting relevant terms that closely match experts’ preferences is a great challenge because of the nature of data which has high dimensionality, sparsity, and
341
noise. Inspite of these realities, we have demonstrated encouraging results that would help lessen the burden of identification and coherent organization of the highly relevant terms. One main target of this work is the development of a tool that automates the entire process of term extraction and keyword/topic identification. This tool (Fig. 11) will allow experts to immediately recognize the most relevant terms and easily incorporate their preferred terms over the list of suggested terms. The main engine for weighting will be TF–ITF–TDCF. However, to allow greater flexibility, the tool will also support other weighting schemes which users can opt to use based on their assessment of the accuracy of the chosen weighting scheme’s results. In the future, we would like to embed this technology in the portal sites of neuroinformatics to help users organize their local databases by indexing relevant terms and incorporate their preferences to the automated output. It will help ordinary users in organizing information of their particular interest. Information overload is one side effect of the advancement in information technology. Surely, we also need information technology to combat this problem. Our research is one attempt to help scientists manage their information resource and we expect that more tools similar to what we developed will become increasingly important with the maturation of the information era.
Fig. 11. Keyword extractor. The main window is composed of three panes. The main pane (leftmost) lists the results of terms and they are ranked according to a preferred weighting measure. The middle pane contains the list of Ngram keywords corresponding to the selected unigram keyword of the main pane. Finally, the rightmost pane lists the terms selected to become part of the index keywords in a particular database.
342
S. Usui et al. / BioSystems 88 (2007) 334–342
References Dhillon, I., Fan, J., Guan, Y., 2001. Efficient clustering of very large document collections. In: Data Mining for Scientific and Engineering Applications. Kluwer Academic Publishers, pp. 357–381. Dhillon, I., Mallela, S., Kumar, R., 2003. A divisive informationtheoretic feature clustering algorithm for text classification. J. Mach. Learn. Res. 3, 1265–1287. Duda, R., Hart, P., Stork, D., 2001. Pattern Classification. John Wiley & Sons, U.S.A.
Porter, M., 1980. An algorithm for suffix stripping. Program 14 (July), 130–137. Salton, G., 1991. Developments in automatic text retrieval. Science 253, 974–979. Salton, G., McGill, M., 1983. Introduction to Modern Retrieval. McGraw-Hill Book Company. Usui, S., 2003a. Neuroinformatics research for vision science: NRV project. Biosystems 71, 189–193. Usui, S., 2003b. Visiome: neuroinformatics research in vision project. Neural Netw. 16, 1293–1300.