Electronic editor: automatic content-based sequential compilation of newspaper articles

Electronic editor: automatic content-based sequential compilation of newspaper articles

Neurocomputing 43 (2002) 91–106 www.elsevier.com/locate/neucom Electronic editor: automatic content-based sequential compilation of newspaper articl...

275KB Sizes 0 Downloads 24 Views

Neurocomputing 43 (2002) 91–106

www.elsevier.com/locate/neucom

Electronic editor: automatic content-based sequential compilation of newspaper articles Ville Ollikainena; ∗ , Christer B,ackstr,oma , Samuel Kaskib b Neural

a VTT Information Technology, P.O. Box 1204, 02044 VTT, Finland Networks Research Centre, Helsinki University of Technology, P.O. Box 5400, 02015 HUT, Finland

Abstract New information carriers, such as electronic books and MP3 players, can be utilized for displaying customized content. Using these carriers, however, only browsing forwards and backwards is easy. The crucial question in making these carriers user-friendly is then to construct an order of presentation that enhances readability. We have developed a tool that uses the self-organizing map algorithm of Kohonen to automatically organize a collection of text articles into a meaningful content-based sequential order. The article sequence constructed by the system was compared to the sequences made by 21 humans, and in our c 2002 Elsevier Science B.V. All rights small-scale case study they were comparable.  reserved. Keywords: Content; Compilation; Order; Self-organizing map

1. Introduction The publishing sector has changed radically in recent decades. Earlier, certain content, for example, a newspaper article, was prepared for a certain publication and not used in other contexts. The content was organized manually into a suitable order. Now that electronic publishing is more common, it has become easier to publish the same content simultaneously through multiple channels and to reuse the same content later. The content is stored in a database as individual articles and retrieved when needed. The compilation is still largely a manual process, but the order of presentation is less critical whenever there is a possibility to provide hyperlinks for ∗

Corresponding author. E-mail address: ville.ollikainen@vtt.: (V. Ollikainen). c 2002 Elsevier Science B.V. All rights reserved. 0925-2312/02/$ - see front matter  PII: S 0 9 2 5 - 2 3 1 2 ( 0 1 ) 0 0 6 2 2 - 1

92

V. Ollikainen et al. / Neurocomputing 43 (2002) 91–106

Fig. 1. An electronic book, for example, Rocket Book by NuvoMedia, Inc., provides an easy means for browsing forwards and backwards.

the end-user. By using internet-style hyperlinks, the end-user may easily control the order of reading interactively. New information carriers, such as electronic books (Fig. 1), electronic paper, WAP phones, electrically erasable ink, and MP3 players, will be utilized extensively as reusable carriers for the content. The display capacity of most of these carriers is, however, limited in that they mainly provide a means of browsing forwards and backwards. Hyperlinks are either more cumbersome to use or even impossible to implement. These media are linear, resembling radio and television broadcasts: the content is presented sequentially. Under these circumstances a carefully compiled sequential order enhances readability. At the same time the increased supply has created a need for customized content that is tailored to meet each user’s individual needs. User-speci:c :lters can be created to select relevant information from the database or databases. As publications become increasingly personalized, the pro:tability of manual compilation becomes questionable. In personalized publication, the material has to be compiled for end-users, according to their personal interest pro:les, and presented Buently on a suitable information carrier. The smaller the target group of a compilation, the greater the relative costs of manual labor. In a meaningful sequence, articles covering the same topic should be located close to each other (Fig. 2). Furthermore, the topics should follow each other in a meaningful order and the last article of the previous topic should preferably have something in common with the :rst article of the next topic. The automatic

V. Ollikainen et al. / Neurocomputing 43 (2002) 91–106

93

Fig. 2. A sequential presentation compiled from diCerent sources. As a minimal requirement, articles covering the same topics should be close to each other.

compilation must be based purely on the textual content since in general the content comes from various sources and there is no common external information that could help in the process. The topic of this study has been to develop a tool that will automatically organize a collection of text articles into a meaningful content-based sequential order. In general, the meaningfulness of the order depends on two kinds of information: explicit and implicit. Explicit information can be derived directly from the content. For instance, two articles covering oil prices can be easily deduced from the terminology to belong to the same subject matter, as can two articles on Formula 1 racing. Implicit information, on the other hand, is based on general knowledge and is open to various interpretations. As an example of implicit information, the topics Christmas and Bethlehem may have a logical connection that is obvious to human readers. In this work, we aim at an automatic system which puts explicitly similar articles close to each other into a sequential order. This task is made more complicated by the fact that we cannot directly measure the Buency of such an order, which ultimately should be an end user’s subjective impression. The material used for the study was the news database of more than 20,000 news articles and TV news topics created in the IMU project [18] of the Technical Research Centre of Finland (VTT). The material was produced mainly for an electronic book and a speech synthesizer, with the aim of building a narrative proceeding smoothly from one topic to the next. The technical goal of the present work is to construct a viable automatic system for ordering documents that meets the urgent practical need for such a system. Thus, the system can be considered as a proof of concept. The viability of the system will be judged empirically by comparing the order it has made with orders

94

V. Ollikainen et al. / Neurocomputing 43 (2002) 91–106

made by humans. The system also needs to ful:ll a set of technical conditions: It must be fast enough, the organizing process needs be repeatable for new articles without changing the organizing criteria unnecessarily, and, hopefully, it should be possible at a later date to complement the system with methods that analyze the document sets further, for example, suggesting topic changes based on the detected cluster structure. The speci:c technical choices that were made to ful:ll these criteria will be justi:ed in this paper, but an empirical search for optimal solutions will be left to subsequent papers. 2. The ordering problem Assuming that the set of documents is :xed and suitable pairwise distances between the documents are known, the problem of ordering the documents according to their similarity is analogous to the well-known traveling salesman problem (TSP). In the TSP the shortest path that travels through all the cities is sought; here the cities are simply replaced by the documents. If the documents are then ordered according to their location on the shortest path, the overall distance between neighboring documents will be minimized. Unfortunately, solving the TSP problem is very hard; the problem is NP-complete, and must in practice be solved using approximations. TSP problems have been previously tackled successfully [1,2] with the self-organizing map (SOM) algorithm of Kohonen [9,10]. Besides providing a solution to the TSP problem the SOM has the additional advantage that it can potentially be extended into a data exploration tool and even as an user interface, as in the WEBSOM system [6,7,12,13]. The WEBSOM organizes text document collections on SOM-based interactive graphical displays that aid the user in browsing, searching, and :ltering texts. The present system diCers from the WEBSOM not only in many technical details but also in that the main purpose is to organize the articles into a sequential order for their presentation in sequential media. The WEBSOM articles published so far have not taken account of the restrictions imposed by the medium. The need for both kinds of systems is urgent, and the example provided by the WEBSOM ensures that a SOM-based solution for the sequential ordering problem has potential for subsequent extensions. Several kinds of clustering algorithms have been applied to text documents, from established standard methods (for a review see [14]) to recent developments such as “distributional clustering” or clustering of text documents based on the distribution of words in them, and vice versa [4,15,17]. Plain clustering does not, however, solve the TSP-type problem of lining the documents up for their presentation on a sequential medium. The problem of ordering the documents is additionally complicated by the need for suitable distance measures: if the distances do not reBect the perceived similarity of the document topics, the automatic methods will be useless. Several kinds of distance measures have been proposed; for a review see [14]. The well-known latent semantic indexing method [3] is perhaps the best known example of methods that

V. Ollikainen et al. / Neurocomputing 43 (2002) 91–106

95

Fig. 3. The compilation process. Ordering of articles produces an order index for each article. When a presentation is compiled, the articles are sorted by the index.

use the “latent structure” in the term and document space to derive meaningful distance measures. Such ideas have recently been extended by using information geometric methods [5]. Moreover, methods such as the distributional clustering mentioned above use implicitly distance metrics derived from the co-occurrences of terms and documents when performing their task. The main goal of the present paper is to construct one possible method for organizing text articles into a topically sensible sequential order, in order to show that such a system is viable. The system will be compared empirically with human experts. Our main contribution is a proof of concept by constructing a system for the novel task for which, according to our knowledge, no solution exists so far, and veri:cation of its usefulness by empirical comparison with human experts using methods introduced in this paper. We have adopted an extensible SOM-based method for solving the TSP problem, and we argue that the SOM-based system ful:lls the special requirements of our task. We will use one possible simple distance measure for assessing the similarity between text documents. The system will be described in detail in Section 3. Other alternative technical choices will be investigated later, assuming the method is proven viable in the empirical validation. 3. The compilation process The compilation process can be divided into four parts: extraction of the key data required to create a sequential order, preprocessing of the key data, organizing the material into a sequential order on the basis of the preliminarily processed data, and compiling a presentation from the sequential material (Fig. 3).

96

V. Ollikainen et al. / Neurocomputing 43 (2002) 91–106

The organizing process is performed as a batch run, which starts a few minutes after new articles have arrived in the IMU database. In each batch, the most recent 100 articles are organized. As a result of organizing the articles, an order index is returned to the database for every article. In the compilation stage the articles are sorted by the order index to get the desired sequential order. 3.1. Extracting key data Key words are extracted from the contents of the articles to form the feature vectors of the articles. In the IMU project, good results have been obtained by using the nouns appearing in the articles as key words, and we decided to use the same method here. The nouns were extracted from the sentences by using a morphological analysis program (Morfo of Kielikone Oy [8]) that has been designed for Finnish language vocabularies; similar programs exist for other languages. On average, one article contains less than 100 key words. 3.2. Preprocessing key data It is particularly important that the information to be analyzed has been appropriately preprocessed. The objective is to extract the most essential data from all the available information, as the order will be based on all the extracted data. It is almost as important not to discard useful information. However, a de:nite choice was made that preferably too much rather than too little data was passed through. This choice was made at an early phase of development, when it became obvious that too much :ltering caused deteriorated performance. We decided to use all the nouns in the articles as key words, excluding only those that occurred within a single article only. Such words would not have contributed to the organizing process. By gathering more words from the articles, the chances of obtaining stray words from a common terminology and thus obtaining also implicit information increases. Because the objective was to present a proof of concept, we tried to keep the preprocessing simple without compromising functionality. Hence we decided to use the traditional vector space model [16] of information retrieval, with binary encoding of the word occurrences. Table 1 shows an example of the feature vectors of nine articles. If a key word appears in an article, the value of the corresponding component in the feature vector is 1.0. Otherwise the value is 0.0. It is usual in information retrieval systems to weight the words with, for instance, their frequency of occurrence in the article times the inverse of their frequency of occurrence in diCerent documents. In this study the number of articles processed in one batch was too small to compute inverse document frequencies. Moreover, component values directly or inversely proportional to the frequency of a word gave subjectively inferior results compared to the binary coding. It is plausible that there is less noise in binary coding when the documents are relatively short. In our case an optimal coding—as well as in extracting key data and in the whole preprocessing—would extract and weigh information in a similar unknown manner as a human reader

V. Ollikainen et al. / Neurocomputing 43 (2002) 91–106

97

Table 1 Examples of feature vectors

Fig. 4. A 1-dimensional self-organizing map, SOM. The circles around the map units mark a neighborhood (here of radius 2) around the winner.

does when he compares the similarity of two newspaper articles. To conclude, we preferred to maintain simplicity rather than to work on more sophisticated methods. In the example shown in Table 1 the feature vectors point at the corners of a 7-dimensional hypercube in 7-dimensional space. In practice, 100 articles retrieved from the IMU database will provide 1000–1200 key words, and the feature vectors are, therefore, 1000–1200-dimensional. 3.3. Organizing the articles into a sequence by Kohonen’s self-organizing map The sequential ordering is related to the traveling salesman problem: how to visit a set of cities with minimal mileage. In the ordering process two similar articles are located close to each other in the N -dimensional feature space and should be visited consecutively, in the same way as the nearby locations in the traveling salesman problem. The SOM can be used to construct such an order (for detailed accounts on solving the TSP with SOMs, see [1,2]). A 1-dimensional SOM consists of a row of regularly spaced map units (cf. Fig. 4). The SOM may be higher dimensional but in the present application we only need a 1-dimensional SOM. There is a model vector m attached to each unit i. The dimensionality of the weight vectors is equal to the dimensionality of the inputs; and as a result of the SOM learning algorithm the weights form an ordered regression in the input space: map units

98

V. Ollikainen et al. / Neurocomputing 43 (2002) 91–106

that are neighbors on the map have similar weight vectors, and the density of the weight vectors in the input space reBects the density of the input data. The original version of the SOM algorithm consists of the repeated application of two steps: For a randomly chosen input vector, :rst the winning unit c, the unit for which the Euclidean distance between the input x and the model vector mc is the smallest, is chosen according to. c = arg min i

||x − mi ||:

(1)

The winning unit and its neighbors on the map are then adapted towards the input x. When these steps are applied iteratively with diCerent randomly chosen inputs x while narrowing the neighborhood gradually, the map units begin to gradually specialize in representing diCerent kinds of inputs in an ordered fashion. Neighbors represent similar domains in the input space. A more detailed description of the algorithm and of the newer more eMcient batch version can be found in [10]. The amount of adaptation in each iteration t is de:ned by a coeMcient (t): mi (t + 1) = mi (t) + (t)hci (t)(x(t) − mi (t)):

(2)

Here hci (t) is the neighborhood function discussed in more detail below. The coeMcient (t) decreases linearly from its initial value 0 to zero as the iteration count t reaches the total number of iterations Niter . Here we use the following formula: (t) = 0 (Niter − t)=Niter :

(3)

The neighborhood function hci (t) was de:ned as a bubble-neighborhood, which means that if the distance between the map units c and i is smaller than a certain radius r; hci (t) is one, otherwise it is zero. The map units within a certain radius then are adapted the same amount given by (t), while the map units outside the radius remain as they are. The neighborhood radius r decreases linearly from its original value r0 to one as the iteration count t reaches the total number of iterations Niter . r(t) = 1 + (r0 − 1)(Niter − t)=Niter :

(4)

In this study the feature vectors described in Section 3.2. were used as the inputs x. The maps consisting of 100 neurons were computed using the SOM PAK program package [11]. The SOM was organized in two phases: In the :rst phase the number of iterations Niter was 1000, 0 = 0:45 and r0 = 18. The second phase was a :ne-tuning with Niter = 10; 000, 0 = 0:02 and r0 = 3. After the SOM has become organized, every feature vector is entered in the map once again. In this :nal process every article gets an order index. The order index of each article is the index of the corresponding winner neuron. Similar articles have winner neurons close to each other and thus the order indexes are close to each other as well. To get the articles into a sequential order, they will be sorted by the order index, as described in Section 3.4.

V. Ollikainen et al. / Neurocomputing 43 (2002) 91–106

99

3.3.1. Incremental learning Normally when computing a SOM, the values of the model vectors are initialized either randomly or evenly spaced along the principal components of the data. The goal of the computation is then to :nd the optimal con:guration of the model vectors. In the present application the goal is to compute easily digestible batches by organizing the latest 100 articles, as it as not feasible to process all 20,000 articles in the IMU database every time. On the other hand, if the batches were to be organized without any guidance, there would be no guarantee that the orders in consecutive batches would resemble each other. This would be a drawback, because the user collecting more than 100 articles would get order indexes from separate overlapping batches and they would not be compatible with each other. A satisfactory order should reBect the contents of the articles, but additionally processing successive batches should be based on similar criteria. To attain this goal, an incremental learning procedure was developed. In the incremental learning process we try to gently advise the SOM to make the consecutive orders alike. This is done by picking up the vector components of those key words that previously existed in the material and not initializing the respective values by random numbers, but using their existing values instead. Only the vector components related to new key words were initialized in the normal way. This method does not prevent the map from becoming freely organized, but gives it a chance to develop from the state it arrived at last time. The incremental learning was introduced to obtain similarity to the consecutive batches, thereby making it possible to run smaller batches that have a tendency to produce similar order indexes. The incremental learning procedure will be studied in more detail in further papers. 3.4. Compiling the presentation The presentation is compiled from the database using a separate program, which collects the articles from the IMU database according to the order index and composes the material in a form suitable for each presentation platform. Three diCerent information carriers were used: • Web browser (html), • Rocket Book electronic book (see Fig. 1), • Windows Media Player (WAV).

WAV :les were created by a speech synthesizer (Mikropuhe by Timehouse Oy [19]). They can be recorded for a MP3 player or written onto an audio CD. The presentation platform is selected from a pull-down menu in the compilation program. It is also possible to set the desired number of the latest articles to download, or to choose the desired period of time. If the output format is Rocket Book, the content will be downloaded directly to the unit through a docking station. If it is WAV, the appropriate speech synthesizer DLL’s will be called. Plain

100

V. Ollikainen et al. / Neurocomputing 43 (2002) 91–106

Fig. 5. User interface of the compilation program.

HTML is possible as well. Fig. 5 shows the user interface of the compilation program. 4. Case study The quality of the article sequence constructed by the SOM was evaluated by comparing it with the sequences constructed manually by human subjects. Twenty-one people participated in the test. Each participant was given 11 newspaper articles retrieved from the IMU database, printed on paper and mixed, and the goal was to order the articles according to the similarity of their content. All articles were from the same day. All extraneous information, such as the date, time stamps, and database indices were removed from the printed articles. Instructions were given to create an order that would allow reading to progress as naturally as possible from the :rst to the last article. The SOM was treated as if it were the 22nd participant. 4.1. Measures of similarity of the sequences We assessed the similarity of pairs of sequences, constructed by two diCerent participants, using three measures. The :rst measure focuses on global and the third on local similarity. The second measure has more weight on local similarity, but takes account of global aspects as well. Denote the number of participants (22) by N and the number of articles (11) by A. Denote by I (i; k) the index or ordinal given by participant k to article

V. Ollikainen et al. / Neurocomputing 43 (2002) 91–106

101

Table 2 An example of article sequences made by two participants, k and l Heading

i

I (i; k)

I (i; l)

Kyl,al,aiset ryhtyv,at: : : SAK virittelee: : : Kaavojen k,asittely: : : Asuntopula hidastaa: : : H,ameenkyr,oo, n puuhataan: : : Narkomaanien hoito: : : Abdua ei ole unohdettu Vahvuusrooli: : : CD-romin el,av,a kuva: : : Tribadien y,o: : : Glamouria ja hiuslakkaa

1 2 3 4 5 6 7 8 9 10 11

9 1 8 11 10 6 7 2 3 4 5

7 5 8 9 10 4 6 3 2 1 11

Table 3 An example of a distance matrix, participant k D(k) i\j 1 2 3 4 5 6 7 8 9 10 11

1 0 8 1 2 1 3 2 7 6 5 4

2 8 0 7 10 9 5 6 1 2 3 4

3 1 7 0 3 2 2 1 6 5 4 3

4 2 10 3 0 1 5 4 9 8 7 6

5 1 9 2 1 0 4 3 8 7 6 5

6 3 5 2 5 4 0 1 4 3 2 1

7 2 6 1 4 3 1 0 5 4 3 2

8 7 1 6 9 8 4 5 0 1 2 3

9 6 2 5 8 7 3 4 1 0 1 2

10 5 3 4 7 6 2 3 2 1 0 1

11 4 4 3 6 5 1 2 3 2 1 0

i. An example of article sequences made by two participants is presented in Table 2. The set of distances between each pair of articles in the sequence constructed by the participant k are collected into the A × A distance matrix D(k). The entry in the ith row and jth column in this matrix is de:ned by dij (k) = |I (i; k) − I (j; k)|. The value dij (k) therefore measures how far apart articles i and j are in the sequence constructed by the participant k. If the articles were located next to each other, the value would be 1. The distance matrices are, of course, symmetric and their diagonals consist of zeroes. As an example, the distance matrix of the participant k of Table 2 is presented in Table 3. The distance matrices were then used to measure how closely the sequences constructed by two participants, say k and l, match each other. We used three diCerent kinds of measures. The :rst, a global one, is the average of the diCerences

102

V. Ollikainen et al. / Neurocomputing 43 (2002) 91–106

in the distances of all article pairs in the sequences,  E1 (k; l) = |dij (k) − dij (l)|=A2 ;

(5)

where the sum is taken over all pairs (i; j). Compared to the :rst measure, the second measure puts less weight on global diCerences by taking the square roots of the distances before calculating the difference,  |dij (k)1=2 − dij (l)1=2 |=A2 ; (6) E2 (k; l) = where the sum is taken over all pairs (i; j) as well. The third, local measure measures how often the article pairs that occur consecutively in one sequence occur consecutively in the other as well. The measure is de:ned by E3 (k; l) = 1=2 #{(i; j)| |I (j; k) − I (i; k)| = 1

and

|I (j; l) − I (i; l)| = 1};

(7)

where #{ } denotes the cardinality of the set de:ned within the parentheses. By using factor 1=2; E3 describes how many consecutive articles k and l have in common. Finally, we computed the average deviance of each participant from all the other participants, for each of the three measures:  E1 (k; l)=N; (8) E1 (k) = l

E2 (k) =



E2 (k; l)=N;

(9)

E3 (k; l)=N:

(10)

l

E3 (k) =

 l

These measures show how closely the sequence constructed by a participant, according to the corresponding measure, resembled the sequences constructed by the other participants. In addition to these person-speci:c measures, we also calculated article-speci:c measures. The motivation is that some articles are more diMcult to put into an order than some other articles. The aim here is to study whether the SOM made similar mistakes as the human participants. With respect to the three measures, article-speci:c deviances for subject k are  |dij (k) − dij (l)|=AN; (11) EA1 (k; i) = l

EA2 (k; i) = EA3 (k; i) =

j

 l

j

l

j



|dij (k)1=2 − dij (l)1=2 |=AN;

(12)

#{(i)| |I (j; k) − I (i; k)| = 1 and |I (j; l) − I (i; l)| = 1}=N; (13)

V. Ollikainen et al. / Neurocomputing 43 (2002) 91–106

103

Table 4 Summary of test results Measure

Preference

Mean of the “participant” SOM

Mean of mean values

Deviation (s)

Median of mean values

Expectation of random results

Deviation ()

E1

Smaller is better Smaller is better Greater is better

2.03

2.08

0.14

2.21

2.93

2.22

0.53

0.56

0.04

0.56

1.532

0.765

3.38

3.56

0.9

4

0.263

0.541

E2 E3

where #{ } denotes the cardinality of the set de:ned within the parentheses. The corresponding mean values are  EA1 (k; i)=N; (14) EA1 (i) = k

EA2 (i) =



EA2 (k; i)=N;

(15)

EA3 (k; i)=N:

(16)

k

EA3 (i) =

 k

4.2. Results Table 4 summarizes the results obtained with each measure. The mean deviance of the SOM from the other participants (“Mean of ‘participant’ SOM”) was in all cases very close to the mean deviation between all participants (“Mean of mean values”). In order to quantify the closeness we computed how much two entirely random sequences would diCer on the average (“Expectation of random results”), and what the standard deviation of the random diCerences would be. We additionally measured article-speci:c results. As far as the :rst and second measures were concerned, the SOM had problems in the same articles as an average person did. In the third measure there was no obvious resemblance. Fig. 6 illustrates the article-speci:c results of EA1 ; EA2 and EA3 . 5. Discussion This study has focused on a problem that is not widely known. There is no absolute answer to the question: What should be the correct order of newspaper articles in a sequential presentation? It is more or less a personal and subjective issue. In this study, we have ordered articles automatically based on a relatively simple and computationally advantageous measure of similarity, and we have shown that

104

V. Ollikainen et al. / Neurocomputing 43 (2002) 91–106

Fig. 6. Article-speci:c results using measures EA1 ; EA2 and EA3 . In measures EA1 and EA2 the SOM had problems with the same articles as an average participant. In EA3 the correspondence was less clear.

the results are of the same quality as orderings by human experts. We have introduced and applied three measures to evaluate how closely one article sequence resembles other article sequences. The :rst measure (linear) was most sensitive to diCerences on the macro level. Macro-level diCerences include diCerences in the order of whole clusters, for example. The third measure (consecution) is sensitive only to micro-level diCerences, i.e. diCerences in which articles are consecutive within the clusters. In comparison to the :rst measure, the second measure (square root based) gives more weight to micro-level diCerences, but takes the macro level into account as well. A decision was made that all the nouns in the articles are used as key words, excluding only those that occurred within a single article only. This is a

V. Ollikainen et al. / Neurocomputing 43 (2002) 91–106

105

straightforward method, presumably language independent, and does not need any prede:ned word sets or rules for selecting words. It also provides some chances to discover common terminology and thus obtain implicit information as well. Its cost, however, is an increased noise level, which is caused by the most commonly used nouns. This may be the reason why the SOM did not attain the testee average with the micro-level consecution measure. This phenomenon will be a subject for further studies, in which the possibilities of computing inverse document frequency-type weightings and diCerent metrics will be investigated as well. We will later also pay more attention to the properties of the incremental learning procedure. In addition to other bene:ts, it could provide us with some temporal attributes for the key words. These temporal attributes may give us some idea of the importance of diCerent words. Further development would presumably bene:t from cognitive studies that would clarify the criteria that human beings use when sorting material to form a Buent presentation. Without this knowledge the sequential ordering has a number of unknown factors, which we tried to eliminate by keeping the system simple, but without compromising its functionality. Succeeding in this endeavor, we believe that our study is helpful as a baseline for future studies. Furthermore, we believe that the three measures introduced here make it possible to evaluate improvements of further development. References [1] B. AngUeniol, G. de La Croix Vaubois, J.-Y. Le Texier, Self-organizing feature maps and the traveling salesman problem, Neural Networks 1 (1988) 289–293. [2] M. Budinich, A self-organizing neural network for the traveling salesman problem that is competitive with simulated annealing, Neural Comput. 8 (1996) 416–424. [3] S. Deerwester, S. Dumais, G. Furnas, T. Landauer, Indexing by latent semantic analysis, J. Am. Soc. Inf. Sci. 41 (1990) 391–407. [4] T. Hofmann, J. Puzicha, M. Jordan, Learning from dyadic data, in: M. Kearns, S. Solla, D. Cohn (Eds.), Advances in Neural Information Processing Systems, vol. 11, Morgan KauCmann, San Mateo, CA, 1998, pp. 466–472. [5] T. Hofmann, Learning the similarity of documents: an information-geometric approach to document retrieval and categorization, in: S. Solla, T. Leen, K.-R. M,uller (Eds.), Advances in Neural Information Processing Systems, vol. 12, MIT Press, Cambridge, MA, 2000, pp. 914–920. [6] T. Honkela, S. Kaski, K. Lagus, T. Kohonen, Newsgroup exploration with WEBSOM method and browsing interface, Technical Report A32, Helsinki University of Technology, Laboratory of Computer and Information Science, FIN-02150 Espoo, Finland, 1996. [7] S. Kaski, T. Honkela, K. Lagus, T. Kohonen, WEBSOM—self-organizing maps of document collections, Neurocomputing 21 (1998) 101–117. [8] Kielikone Oy, http:==www.kielikone.:=english=. [9] T. Kohonen, Self-organized formation of topologically correct feature maps, Biol. Cybernet. 43 (1982) 59–69. [10] T. Kohonen, Self-Organizing Maps, Springer, Berlin, Germany, 1995, (Third, extended edition 2001). [11] T. Kohonen, J. Hynninen, J. Kangas, J. Laaksonen, SOM PAK: the self-organizing map program package, Technical Report A31, Helsinki University of Technology, Laboratory of Computer and Information Science, FIN-02150 Espoo, Finland, 1996.

106

V. Ollikainen et al. / Neurocomputing 43 (2002) 91–106

[12] T. Kohonen, S. Kaski, K. Lagus, J. Saloj,arvi, J. Honkela, V. Paatero, A. Saarela, Self-organization of a massive document collection, IEEE Trans. Neural Networks 11 (2000) 574–585. [13] K. Lagus, T. Honkela, S. Kaski, T. Kohonen, WEBSOM for textual data mining, Artif. Intell. Rev. 13 (1999) 345–364. [14] C. Manning, H. Sch,utze, Foundations of Statistical Natural Language Processing, MIT Press, Cambridge, MA, 1999. [15] F. Pereira, N. Tishby, L. Lee, Distributional clustering of English words, Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics, 1993, pp. 183–190. [16] G. Salton, M. McGill, Introduction to modern information retrieval, McGraw-Hill, New York, 1983. [17] N. Slonim, N. Tishby, Document clustering using word clusters via the information bottleneck method, Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2000, pp. 208–215. [18] C. S,odergaX rd, M. Aaltonen, S. Hagman, M. Hiirsalmi, T. J,arvinen, E. Kaasinen, T. Kinnunen, J. Kolari, J. Kunnas, A. Tammela, Integrated multimedia publishing: combining TV and newspaper content on personal channels, Comput. Networks 31 (1999) 1111–1128. [19] Timehouse Oy, http:==www.timehouse.:. Ville Ollikainen received the M.Sc. degree in Technical Physics in 1989 from Helsinki University of Technology. He worked in the Technical R& D of MTV Finland in 1991–1998 with the main responsibility for developing broadcast automation and interactive teletext systems. Since 1999 he has been working at the Technical Research Centre of Finland as a senior research scientist focusing on media integration. Neural networks have a substantial role in his post-graduate studies.

Christer B*ackstr*om received the M.Sc. degree in the Laboratories of Digital Arts and Control Theory at Helsinki University of Technology (HUT). He spent some years at HUT working in the Department of Control Theory. Since 1987 he has been working at the Technical Research Centre of Finland as a research scientist. His main interests are in image analysis, 3D graphics and neural networks.

Samuel Kaski received his D.Sc. (Tech) degree in Computer Science from Helsinki University of Technology in 1997. He is currently Professor of Computer Science at the Laboratory of Computer and Information Science (Neural Networks Research Centre), Helsinki University of Technology. His main research areas are neural computation and data mining.