A neural model that implements probabilistic topics

A neural model that implements probabilistic topics

Author's Accepted Manuscript A neural model that implements probabilistic topics Álvaro Cabana, Eduardo Mizraji, Juan C. ValleLisboa www.elsevier.co...

866KB Sizes 0 Downloads 58 Views

Author's Accepted Manuscript

A neural model that implements probabilistic topics Álvaro Cabana, Eduardo Mizraji, Juan C. ValleLisboa

www.elsevier.com/locate/neucom

PII: DOI: Reference:

S0925-2312(15)01068-1 http://dx.doi.org/10.1016/j.neucom.2015.07.061 NEUCOM15844

To appear in:

Neurocomputing

Received date: 6 March 2015 Revised date: 21 May 2015 Accepted date: 18 July 2015 Cite this article as: Álvaro Cabana, Eduardo Mizraji, Juan C. Valle-Lisboa, A neural model that implements probabilistic topics, Neurocomputing, http://dx.doi. org/10.1016/j.neucom.2015.07.061 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

A neural model that implements probabilistic topics Álvaro Cabana

a,b

a

, Eduardo Mizraji , Juan C. Valle-Lisboa



a,b,

a Facultad

de Ciencias, Universidad de la República, Iguá 4225, Montevideo 11400, Uruguay b Facultad de Psicología, Universidad de la República, Tristán Narvaja 1674, Montevideo 11200, Uruguay

Abstract

We present a neural network model that can execute some of the procedures used in the information sciences literature. In particular we oer a simplied notion of topic and how to implement it using neural networks that use the Kronecker tensor product. We show that the topic detecting mechanism is related to Naive Bayes statistical classiers, and that it is able to disambiguate the meaning of polysemous words. We evaluate our network in a text categorization task, resulting in performance levels comparable to Naive Bayes classiers, as expected. Hence, we propose a simple scalable neural model capable of dealing with machine learning tasks, while retaining biological plausibility and probabilistic transparency.

Keywords:

topic models, neural networks, text categorization, Kronecker

product,

1. Introduction

Data-driven approaches to natural language processing rely on the extraction of meaningful knowledge from large amounts of data.

Many of

these techniques have also been applied as models of human cognition, in particular as models of lexical, semantic and pragmatic knowledge. In recent years, tools like Latent Semantic Analysis (LSA),[5], HAL [25, 24],



Corresponding author

Email addresses: [email protected] (Eduardo Mizraji),

[email protected]

Preprint submitted to Neurocomputing

(Álvaro Cabana),

[email protected]

(Juan C. Valle-Lisboa)

July 31, 2015

generative topic models [4, 7], and lately BEAGLE [16] have been developed and proved to be successful in modeling several cognitive activities [15, 10]. The fact that these tools match some capabilities of human cognition constitutes a good opportunity to understand what is required to implement complex data processing abilities as those seen in humans [2]. Nevertheless, the fact that some methods are good at processing language-related material does not mean that the way they operate is necessarily the same as those of brains.

Given the specicity of the problems these tools tackle,

their capacities might only be limited to particular applications [46] and so despite the impressive success of recent machine learning applications there is still much to understand about how the brain solves similar challenges. Clearly, many of these algorithms are not able to be considered as brain-like procedures due to their weak biological plausibility. Moreover, the hardware used to run those algorithms diers from the brain on many aspects, in a sense that many of them would run extremely slow in the brain, making them unfeasible as neurally plausible models of human cognition. What we propose is to use the insights gained from successful information sciences methods and explore their possible implementation in a neural-based framework. As our eorts are directed to understand the neural implementation of cognitive functions, we have been studying the relationship between these methods and neural network models [50, 31, 34]. In particular, we have shown that extensions (see below) of associative matrix models overlap both in capacities and formal properties with LSA. Our approach is related to the work of Serrano

et al.

[43, 44] on cognitive mod-

els of reading, although we use neural network models instead of spreading activation on semantic networks. A recent advancement in these eld has been the advent of deep learning methods, a powerful approach (or set of approaches) to machine learning and natural language processing. It is structured around coupled hierarchical layers (or modules) that via dierent learning procedures extract and store structured information. In these computational procedures, information reaches higher levels of abstraction through the connections between these layers (or modules) ([3]). Many deep learning procedures use Articial Neural Networks (ANN), which acquire knowledge by means of unsupervised algorithms or learning algorithms reecting the self-organization of sensory inputs (involved in language), together with supervised learning algorithms linking the hierarchical modules ([47]). 2

Usually the ANN are

modules containing hidden layers and the supervised learning is based in gradient descent algorithms (eg. backpropagation (BP)). A related inuential approach was developed by Hinton and Collaborators (see, for instance [45]) using multi modular systems with an unsupervised learning phase carried out by Deep Boltzmann Machines. In the present work we describe a modular device that process contexts using the Kronecker product between context vectors and key patterns. This kind of formalism can be thought of as another building block for deep learning approaches, where layers are conceived as modules and that can be connected with other modules in order to build a deep learning hierarchy. We recently described how this modeling approach can be related to factual fMRI data obtained from the brain alterations associated with language impairments in schizophrenia ([48]). In addition, from the computational point of view, the use of the Kronecker product can have some advantages. On the one hand, in each module, any multi-layer perceptron trainable by algorithms of the type of BP can be replaced by a one layer network expanded by the extra dimensions produced by the Kronecker product and trainable with the Widrow-Ho algorithm, another gradient descent procedure that usually exhibits more rapid convergence (see for instance [38] and [49]). On the other hand, the factorization implied by the Kronecker product allows in many cases to reduce signicantly the computational complexity of computations ([14], [13]). There has been some recent attempts to use of tensor operations (as the Kronecker product) in deep learning architectures (see [12]) something that adds to the relevance of this type of model to machine learning. A further interest in using the Kronecker products in neural modeling emerges from the fact that it belongs to a set of powerful matrix operations with remarkable algebraic properties ([23]).

These properties allow

the construction of mathematical theories able to describe the dynamics of complex hierarchical neural networks. An example of this kind of mathematical construction shows how the neural computations of order relations in the neural processing of natural language, and usually coded in simple propositions, can be understood as a three level hierarchical device that transports concrete factual conceptualized data embedded in a particular contextual query (eg. Is this cat larger than this dog?) towards abstract neural modules capable of providing the correct answer ([32], [33]). When we pursue our objective of inspiring neural models in machine learning techniques, the many techniques available poses somewhat a chal3

lenge, since most of them have something of value to add, but the connections between them are seldom expressed. We want to move pass this isolation of approaches toward the unication in a powerful theory. To this end, here we explore the connection between probabilistic models and neural network models in the context of topic detection. The link between neural models and inference and statistics has been recognized from the beginning of current neural modeling eorts [36, 30], and given the relationship between LSA and the other methods mentioned above, it should come as no surprise that neural models are related to probabilistic topic models. Nevertheless, there are several recent developments that call to renew interest in connecting both worlds. First, the last few years have seen a tremendous development of probabilistic models and associated methods. Second, the amount of data that can be used to test these methods has increased. In a more theoretical vein, there are some ongoing debates in the literature that oppose probabilistic models and neural network models [41, 29, 8, 28]. In the present work we propose a basic neural architecture displaying a probabilistic topic identication capacity. In the second section we discuss the notion of topic. Then we present the model and show how it can be used to achieve word sense disambiguation. In the third part we show how our model is related to probabilistic models. In the fourth section we submit our model to the stringent tests of text categorization benchmarks, showing its potential as a viable implementation of the approaches of information sciences to topic detection. The main results are that one version of our model implements a denite statistical procedure.

Despite its simplicity our model has a reasonable

capacity to categorize texts. It is not our goal to match the current stateof-the-art machine learning algorithms, a highly developed and dynamic research eld. However, the discover of neurally plausible procedures can open new promising research avenues and provide building blocks for future models capable of exploiting the potentialities of the human brain to decode highly structured linguistic patterns.

2. The notion of topic

Before describing our neural model of topic detection, let us discuss what we mean by topic.

Although dierent disciplines refer to dierent

variants of the concept of topic, we assume that most share the notion that a topic is a brief description of a certain domain of knowledge. Within this broad denition there surely co-exist many conceptions. For instance, for 4

probabilistic topic models [4] a topic is a probability distribution over words, whereas for vector space methods it is a subspace of the semantic space [50], and still for others is a text summary of the domain of knowledge involved [18]. Walter Kinstch assimilates the topic notion to the representation of a text in a semantic space such as LSA. As such the topic is what we already know of a particular text or passage. In his own words ([17]) The LSA representation of a text, therefore, tells us what the text is about in terms of what we already know. It is therefore, the old information in a discourse, its theme or topic ...

The

distinction made here is commonly made in everyday language use. Texts have a topic or theme, but then they tell us something new about that . To include the notion of topic as a neural entity, we conceive it as something close to a schema or frame, and in some simplied sense, close to a subspace in a vector space. Nevertheless, certain manifestations of topics can be well captured by probabilistic models. Indeed, a restricted domain of knowledge implies a certain probability distribution of words.

In our

model we consider that there are topic vectors, with an internal structure irrelevant for the present purposes.

3. The neural model of topic detection

To instantiate the notion of topic mentioned in the previous section, we will use context-dependent matrix memory modules (CDMM), in particular a recurrent variant that has been used to model language phenomena [49].

Figure 1: Schematic depiction of the topic detection model. Kronecker product between incoming vectors.

5

Circled crosses stand for

In a CDMM, a memory module receives two vectors, the input and the context, and the module associates an output to the Kronecker product of the inputs. The Kronecker product of matrices

A = [a]ij

and

B = [b]ij

is

dened as

A ⊗ B = [aij B], and can be applied to matrices of any dimensions. It is valid for conformable matrices

A, C

and

B, D

that

(A ⊗ B)T (C ⊗ D) = AT C ⊗ B T D, thus for vectors

u, v

and

t, p

(u ⊗ t)T (v ⊗ p) = hu, viht, pi. The Kronecker operation allows a exible, context-dependent association between pairs of inputs and outputs.

The advantages and properties of

these models have been described elsewhere [39]. Here we will illustrate our ideas with the simple model depicted in Figure 1. The model receives a stream of words that are mapped to an internal representation in the form of a vector. Each word is mapped to one orthogonal vector, irrespective of their meaning. This implies a dimensionality close to the number of words used, even if some degree of correlation between word vectors is tolerated. These vectors are then fed one by one to a topic selector module (M 1 ) together with the current topic; the output of this module is the next topic. The module

M1

is a CDMM trained to associate each word vector with

dierent topics. In particular, the associative matrix of this module is

M1 =

n X

ti

ti ⊗

i=1

m X

!T Ωj (i)pj

(1)

j=1

ti are topic vectors, the pj are internal representations of words Ωj (i) are weights that measure the importance of word j in topic i

where the and the

(see below). The output can be either the bare output of the matrix vector product,

t(s) = M1 × (t(s − 1) ⊗ p(s)),

(2)

t(s) = M1 × (t(s − 1) ⊗ p(s)) + βt(s − 1),

(3)

or

6

where the output is linearly combined with the previous output by means of a weighting parameter

β.

In order to illustrate how the topic selector module can help in language processing tasks, we use a simplied version of the interpreter module. For this purpose thus we use the following output context-dependent matrix memory

M2 =

n X m X

cij (ti ⊗ pj )T

(4)

i=1 j=1 When the Kronecker product of the incoming word and the diagnosed topic enters module 2, the output is

mkl = M2 × (tk ⊗ pl ) which is the interpretation of word

k

(5)

in the topic

l.

3.1. Three modes of operation of the topic selector As we discussed above, in this model, topic vectors have no internal structure and they only work as markers or pointers to the correct association in other modules. module

M2 ,

This is inherited from the simple workings of

and can be made more complex, but the simplicity allows us

to concentrate in the way topics are selected. Due to the multiplicative nature of the CDMM modules, if there is no input the output of

M1

would be null. Hence, we add the requirement that

M1 is a mixture X w= εi ti

if there is no input, the output of white vector:

of topics that we call the (6)

i When a word internally represented by

pj 

enters

M1 ,

it joins either

the white vector or the previous topic vector, and their Kronecker product triggers the next topic vector. In the rst operation mode, that we call the

serial mode,

the outcome

of a sequence of words is a linear combination of the topic vectors that lie at the intersection of the topics that each word in the sequence is relevant to. This can be readily seen if the expression (2) is successively presented with words

j

whose weight within a topic

i

is

Ωj (i).

After

k

presentations

the topic diagnosed is:

t(k) =

X i

εi ti 7

k Y j=1

Ωj (i),

(7)

thus the output is a linear combination of those topics for which the presented words are all relevant, that is, a topic is included in the output if and only if it lies at the intersection of the sets of topics associated with each word.

We discuss the weight coecients below, after presenting a

probabilistic interpretation of the topic selector. The second operation mode involves the arrival of many words at once, so the processing of the words occurs in parallel hence we call this the

parallel mode 

instead of a perfect sequence.

Consider for instance that

some words are retained in a short term memory, and the input to

M1

is a

linear combination of word vectors. If r words are retained, the input to M2 Pr is ps ⊗ t = j=1 ph(j) with h(j) a function that assigns the correct number for the word. In this case, the topic selector would yield,

t(r) =

X

εi ti (

r X

i

Ωh(j) (i)),

(8)

j

which has to be compared to equation (7). In essence, in the case of equation (7) what happens is that a topic is included in the output if all words

p1 . . . p r

are relevant to it, whereas in equation (8) a topic is included in the output if

p1

or

p2

or . . . or

pr

are relevant to it.

The third operation mode implements something in between the serial and the parallel modes, that is in fact what we show in equation (3). After

k

words are presented the output would be:

t(k) =

X i

The parameter

β

k Y i ti (Ωj (i) + β).

(9)

j=1

controls how much of the history is retained, and in

some sense interpolates between the serial presentation of words and the parallel presentation of words.

When

β = 1

the

k

products generate -

among other terms- the sum of the weight of each of the presented words (as in the parallel mode) and the product of all the weights, as in the serial mode. When

β = 0, equation (9) contains only the products of the weights,

as in the serial mode. We call this third mode of operation the

mixed mode.

3.2. The connection of the topic selector with statistical classiers It is instructive to analyze the workings of the topic selector in connection to classical devices used in the language processing community. this end we consider the meaning of the weights 8

Ωj (i)

To

in a probabilistic

interpretation. When the system has experienced a representative corpus, a convenient estimation of the weights is given by

nj (i) , Ωj (i) = Pm s=1 ns (i) where

i,

and

nu (i) (u = j, s) is the number of m is the number of words in the

(10)

times the word

i.

appears in topic

corpus. In this interpretation, the

weights represent the probability of selecting a word discourse, given that the topic is

u

j , when generating the

If the white vector topic is dened as in

equation (6), that is:

w=

X

t i εi

i where

P j nj (i) , εi = Pn Pm l=1 s=1 ns (l)

(11)

then the normalized output is

tek tk = = ktk k If we interpret

εi

P

i ti εi Ω1 (i)Ω1 (i) . . . Ωk (i) pP Q . 2 2 i εi s Ωs (i)

(12)

i,

as the a priori probability of selecting topic

the

normalized output is exactly the same as the expected normalized topic of a multinomial classier, which is in eect a version of a naive Bayesian classier. Specically, notating the conditional expectancy of topic

p1 is encountered α1 times, p2 is encountered α2 αm times, as E(ti |α1 p1 α2 p2 , ...αm pm ), we have:

the word present

E(ti |α1 p1 ,α2 p2 , ...αm pm ) = where

P (ti )

Z

pj

times . . . ,

m Y α ti P (ti ) qjij ,

i

is the a priori probability of topic

selecting word factor.

1X

given that the chosen topic is

ti

given

pm

is

(13)

j

i, qji is the ti , and Z is

probability of a normalizing

This demonstrates that in its simplest version the topic selector

behaves as a multinomial classier. Multinomial classiers are classical devices in natural language processing, and their capabilities and limitations are well known [40].

As all

Bayesian classiers they can work surprisingly well under more general conditions than those implied by their denitions (i.e.

conditional inde-

pendence, see [20]). In spite of that, there are cases where other types of 9

functioning are required, something that could be reasonably implemented using alternative operations modes, and in particular equation (8).

3.3. Illustration of the topic selector: Word Sense Disambiguation In order to illustrate how the topic selector disambiguates meanings we retort to a toy example we used in a previous paper about the workings of LSA [50] and we refer the reader to that text for the details. We use a subset of the OHSUMED collection [11] consisting of 110 documents with 881 words gathered from the Medline abstracts collection. Each document is categorized as pertaining to one or more from a set of ten topics, and we used this information to train the topic model. As an example consider the word INFARCTION which is mainly present in two of the topics; one of the topics (topic 2 in our data) is mainly concerned with cerebral problems and another topic (topic 3) mainly concerned with cardiac problems. Two other words, CAROTID and PRESSURE help to disambiguate whether INFARCTION refers to a cardiac or cerebral infarction. In Table 1 we present the normalized weights of each of the words in the two relevant topics. Notice that in this case the word INFARCTION is associated more strongly with a cerebral context. The weight of the word CAROTID is also higher in the cerebral context, whereas the word PRESSURE is stronger in the cardiac topic. Table 1 represents three cases; a) The word INFARCTION enters with the white vector; b) The word INFARCTION follows CAROTID and c) the word INFARCTION follows PRESSURE. The last column shows that despite the initial weight being biased for the brain infarction interpretation, the initial presentation of the word PRESSURE is enough to change this bias and make the interpretation of a heart infarction more likely.

The

model can be made more exible by incorporating dierent weights in the module M2 equation 4 but we did not pursue the issue further.

4. Topic selector performance: Text Categorization

To better assess the capability of the model to recognize topics properly, we tested the topic selector module on a text categorization task. In the eld of information retrieval, text categorization is concerned with the assignment of predened categories to novel texts [52], and usually deals with supervised methods and data collections that include separate document sets for training and testing. In this context, it can be used as a reference 10

Word

Weight T2

Weight T3

INF

0.8

0.2

CAR

0.8519

0.1481

PRES

0.065

0.8750

Sequence

Table 1:

Topic

Likelihood

w,INF.

0.97BI+0.2425HI

4.00 BI:1.00 HI

w,CAR., INF.

0.9991BI+0.043HI

23.23 BI:1.00 HI

w, PRES, INF

0.2747BI+0.9615HI

1.00 BI :3.5 HI

Example of word disambiguation mentioned in text.

Above:

Each word is

associated with a weight in topics 2 (cerebral problems) and 3 (heart problems). Below: For each of the 3 sequences (leftmost column), the topic output and the likelihood of each infarction concept are given. INF: infarction, CAR: carotid, PRES: pressure, BI: is brain infarction, HI: heart infarction, w corresponds to the white vector.

task for evaluating topic selector performance through the correspondence of predened categories to distinct topics.

Here, we use three standard

datasets in the text categorization literature: the Mod-Apte Reuters-2178 corpus [1], the OHSUMED 87-91 Heart Diseases subset of Medline abstracts collection (OHSUMED HD-119)[11, 51], and the RCV1-v2 corpus [21]. The Reuters-2178 dataset is composed of 7768 training documents and 3019 test documents consisting of short articles from the Reuters news agency, categorized into one or more of 90 distinct categories such as acq (merges & acquisitions), wheat, oil, etc.

After removing stop words

and splitting hyphenated words, we obtained 10000 unique terms occurring more than three times, from which we selected the 9000 with the highest χ2 statistic (see for example [52]). The OHSUMED dataset consists of 12823 training documents and 3758 test documents categorized into one or more of 103 (out of 119 possible) MeSH (Medical Subject Headings) hierarchical categories descending from the Heart Diseases category. After preprocessing as before, we obtained 14000 unique terms occurring more than three times, from which we selected 2 the 3000 with the highest χ statistic. The RCV1-v2 corpus contains 804414 Reuters news articles from August 1995 to August 1996, categorized into one or more of 103 hierarchical topic categories. Using the token lists obtained by [21], we performed two analyses with the corpus. First, we evaluated our algorithm in the original task 11

performed in [21], that is, splitting the corpus in 23149 training documents and 781265 test documents, using 17139 unique terms occurring more than 5 times. To further compare our method to other neural classiers, we obtained 1000 random partitions of 794414 training and 10000 test documents using the 10000 most frequent words, and calculated the mean performance over these partitions, following [45].

4.1. Performance indicators Two standard measures of performance in document retrieval are precision (P) and recall (R), which relate to positive predictive value and sensitivity, respectively, and can be dened as (see [52]):

P =

#found and correct instances #total found instances

R=

#found and correct instances #total correct instances

Given a document, text categorization algorithms may provide a ranking of category scores, or a binary category assignment.

In the former case,

precision and recall can be evaluated as relating to which categories are found relevant for a given document, that is, categories whose ranking scores are above a certain threshold. For high thresholds, precision may be high but recall may be low, and the opposite situation may be true for low thresholds. The overall behavior of the classier for all documents can be integrated into a measure for the whole dataset, such as the interpolated 11-point averaged precision, that averages the precision values obtained at recall levels of

0, 0.1, 0.2, . . . ,

and 1. [26].

Conversely, if the task involves nding relevant documents for a given category, a standard measure is the mean average precision. average precision for a given category is evaluated at documents retrieved, where

k

First, the

1, 2, 3 . . . k

relevant

is the number of total relevant documents for

that category. Then, the mean average precision is calculated over all the categories. In the case of binary classiers, precision and recall may be evaluated as relating to which documents are found relevant for a given category. Values for each each category can be integrated through

micro -

or

macro -

averaging, that is, considering as instances all documents for all categories and then computing the global measures, or simply averaging per-category 12

precision and recall values. The rst method gives equal weight to all documents, and hence more weight to categories that have many documents in them, whereas the second gives equal weight to all classes regardless of their size. For binary category assignment, then, instead of computing 11point averaged precision we use (following [52] among others) another way of capturing the balance between precision and recall in a single score, the macro-averaged

F

measure [26]:

F =

2P R P +R

that is the harmonic mean of P and R.

4.2. Implementation To use the topic selector module for text categorization we identied topics with categories in the dataset, and obtained

Ωj (i) and i

values from

the training documents. Then, for each document in the test set, we calculated the topic output

t(k)

at each word

k.

The output was averaged

over all words (except for a lead-in of 1/5 of total length in documents that had more than 10 words), and we used the average topic weights as the category ranking for each document.

Using a varying threshold, we con-

verted the weights of the category rankings to binary category assignments. The reported

F

values correspond to thresholds that yielded the highest

performances.

4.3. Results A comparison between the serial and the mixed modes is shown in Figure 2. Two documents taken from the Reuters test collection were introduced to two topic modules implementing parallel and mixed modes. Each document belongs to one or two categories, and the components of the topic output (one component for each of the possible 90 topics) are shown at each word. From this example, it is clear that the parallel mode fails to properly categorize both documents, whereas the mixed mode recognizes the corresponding categories properly.

In the case of document 2 that belongs to

two dierent categories, it is shown that for the module implementing the mixed mode, the topic output components associated with both categories alternatively attain very high levels. The results of using a topic module implementing the mixed mode for categorization of the test documents of both datasets are summarized in 13

Figure 2: Topic module output at each word for two documents from the Reuters collection (see text), using two modes of operation: namely the mixed mode of equation (9), and the parallel mode of equation (8). Each trace represents the value of each component of the topic output, and the two traces that attain the highest peaks are identied by the numbers adjacent to them. Since each component corresponds to a topic/category, it illustrates the category assignment of the document the module has selected at each word.

Table 2.

The relationship between performance and the

β

parameter is

shown in Figure 3. It is clear that performance is maximum for small values of the parameter, and at dierent values for both performance indicators. In comparison with other categorization procedures, it can be seen that the the topic selector performs slightly better than simple Bayesian classiers described in the literature, and comparably to implementations such as kNearest Neighbors (kNN) or Linear Least Squares Fit [22, 52]. Nevertheless, the performance does not match that of some state-of-the-art techniques for text categorization such as support vector machines or hierarchical methods (not shown) [53, 22]. 14

Figure 3: Performance of the topic module in text categorization tasks for the Reuters and OHSUMED collections (see text), as a function of the

β

parameter.

The results for the RCV1-v2 corpus are summarized in Table 3. In the original split by [21], as was the case with the previous datasets, the topic model fell behind other algorithms, especially support vector machines. However, the macro-F obtained was not very far from the one obtained by k-nearest neighbors. For the random splits, the mean averaged precision (mavgP) obtained was far smaller than that obtained by advanced neural and non-neural devices such as Latent Dirichlet Allocation, and Deep Boltzmann Machines [45]. That the model does not reach the other model's level of performance is surely due to the fact that we are using a single layer that diagnoses topics. This leads to the model being only able to nd the most important categories and not the smallest or the ones deepest in the category tree. Some evidence that this is indeed the case is that we found a correlation between topic −20 frequency and average precision (Pearson's ρ = 0.822, p − value < 10 ). The topic module performs far better for the more frequent categories that for the less frequent ones, where it fares poorly.

5. Discussion

The main objective of this work is to connect one particular type of neural network and probabilistic models. We have been working for many years in context dependent memory modules based on the Kronecker product. This type of network allows for a clear probabilistic interpretation. To demonstrate the viability of the proposed model as a part of a language processing network we show how some text processing related tasks can 15

Reuters-2178 ModApte kNN

11pt-AvgP 1 0.93

macro-F 2 0.59 2 0.47 2 0.64

0.91

0.48

Naive Bayes SVM topic module

OHSUMED HD-119 LLSF

11pt-AvgP 3 0.78

macro-F 3 0.55 4 0.58

0.67

0.47

SVM topic module

Table 2: 11-point averaged precision and macro-F scores for the topic module and other commonly used text categorization methods.

References: 1: [52] 2: [22] 3:[51] 4:[53].

kNN: k-nearest neighbors, LLSF: linear least squares t, SVM: support vector machines.

RCV1-v2  Original partition micro-F

macro-F

kNN

0.765

0.56

SVM

0.816

0.607

topic module

0.46

0.54

RCV1-v2  Srivastava

et al.

partition

mavgP LDA

0.351

DocNADE

0.417

DBM

0.453

topic module

0.12

Table 3: Performance of the topic module and other commonly used text categorization methods for the RCV1-v2 corpus.

Macro-F and micro-F scores are reported for the

original partition [21], and mean-average precision and macro-F for the partitions used by [45]. kNN: k-nearest neighbors, SVM: support vector machines, LDA: latent dirichlet allocation, DBM: deep Boltzmann machines.

16

be implemented in this model. In that sense, our network should be considered as a building block for a multi-modular neural network, which has many similarities with current models of lexical and semantic representation based on large vector spaces. We show that, in its most restricted form, our model works as a Naive Bayes Classier [20]. With some particular ways of working it can surpass Naive Bayes in some of the tasks (in particular in the text categorization tasks, see 3). In previous works we have shown how the type of neural model we use is related to methods like Latent Semantic Analysis (e.g. [31]). The upsurge of methods based on deep learning have reinstated neural networks as one of the most important tools for machine learning [42]. It is no surprise thus that the methods based on deep learning are much more successful than our single module network. Nevertheless, the fact that in a more stringent test the model still fares well at least with the most frequent categories, suggest that this can be a construction block for multi modular models. In those models the most general and frequent categories would be recognized by one module like our M1, and then other modules would deal with sub-categories or exceptions. In order to do so, each topic and word can be associated with several sub-topics in another module, and in this way ner distinctions can be learned. We are pursuing this strategy nowadays, but here our intention is to present the basic module and its probabilistic interpretation. An important advantage of the model presented here is that it shows in a transparent way how a neural model can be used to implement probabilistic topic models. Most neural instantiations of Bayesian models use localized coding where each neuron performs a basic Bayesian computation [6]. A nice feature of our model is that the Bayesian computation can be performed in parallel using a distributed representation. There is no limit in principle to the number of modules that can be added to the network, and so provided a good training procedure is devised, multi-modular models could match the observed behavior of probabilistic topic models. Ultimately we believe that complex cognitive processes are the result of the dialogue of several modules, of which our topic detector is a simplied version [48]. One restriction we try to honor is to maintain the biological plausibility of our neural implementations.

In this way we want to nd the relevant

features that make a model work in order to understand the type of neural structures and procedures that certain cognitive assumptions entail.

Of

course, our method has several limitations. In particular, learning the topics 17

may be a dicult task since a sense or topic tagged corpus is needed. We are currently analyzing some procedures that may overcome this problem, leading for instance to new synaptic modication rules.

Although in the

current formulation the performance of the topic selector module in text categorization tasks was below state-of-the-art methods,there is ample space for improvement. We did not want to pursue an unrealistic biology, so we did not include any complex computations. We are currently introducing well known (and existing) nonlinear mechanisms to enhance our model's performance.

For instance, disparate analysis of cognitive problems have

shown that some form of competition can be used to implement either-or computations that are at the core of several cognitive operations [19, 37]. Although we insist in their biological plausibility, it could be argued that our models are unrealistic [27, 37]. For instance, in many of our models it is implicitly assumed the presence of some memory buers that retain inputs and/or topic information. This points to the necessity to analyze the way neural tissue could be organized to create a writable temporary memory, but this transcends the objective of this work. The other assumptions we make are usually accepted by the neurobiological community. For instance, in the basis of our approach is the idea that the collective activity of several neurons underlies cognition. Furthermore, it is usually assumed that activity is related to the action potential ring rate. Given that the basic units of our model are a simple variety of sigma-pi units [49] with linear activation functions, we believe these models can be implemented using spiking neuron models. Somewhat more objectionable is the fact that our implementations suppose a special type of connectivity and multiplying synapses in order to compute the Kronecker product, but we have argued that some of these assumptions can be relaxed [39]. In particular the requirement of a special connectivity to implement the Kronecker product is not very critical. What is required is certain capacity to modulate the gain of signals by other inputs to a network, something that is used in other models [9]. Our Kronecker product based memories can be seen as convenient theoretical tools, that is, they are particular instances of a law of qualitative structure referred by Newell and Simon [35] as a basis for the explanation of cognition.

In that sense our target here was not precision in machine

learning tasks, but an attempt to show that networks of multiplicative context modules can be used to model some cognitive functions, while retaining at the same time their biological plausibility and some level of transparency that allows the connection with probabilistic models to be explicitly made. 18

We believe that by connecting several of these modules we can maintain this transparency while scaling up their computational power. We are currently investigating this kind of model.

6. Acknowledgements

JCVL and EM acknowledge the partial nancial support by PEDECIBA and CSIC-UdelaR. AC was supported by PEDECIBA and ANII. [1] Chidanand Apté, Fred Damerau, and Sholom M. Weiss. decision rules for text categorization.

Automated learning of

ACM Trans. Inf. Syst., 12:233251, 1994.

[2] Francesco P Battaglia, Gideon Borensztajn, and Rens Bod. Structured cognition and neural systems: from rats to language.

Neuroscience & Biobehavioral Reviews ,

36(7):16261639, 2012. [3] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A

Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(8):17981828, 2013. review and new perspectives.

[4] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet Allocation.

Journal of Machine Learning Research, 3(4-5):9931022, 2003.

[5] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and

[6]

Richard Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391407, 1990. S Deneve. Bayesian spiking neurons i: inference. Neural computation, 20(1):91117, 2008.

Proceedings of the National Academy of Sciences of the United States of America, 101 Suppl:522835,

[7] Thomas L Griths and Mark Steyvers. Finding scientic topics. 2004.

[8] TL Griths, N Chater, C Kemp, A Perfors, and JB Tenenbaum. Probabilistic mod-

Trends in cognitive sciences, 14(8):357364, 2010. Stephen Grossberg. Neural models of normal and abnormal behavior: what do schizophrenia, parkinsonism, attention decit disorder, and depression have in common?, volume 121, chapter 21, pages 375406. Elsevier Science, BV, 1999. els of cognition: exploring representations and inductive biases.

[9]

[10] M. Hare, M. Jones, C. Thomson, S. Kelly, and K. McRae. Activating event knowledge.

Cognition, 111(2):151167, 2009.

[11] W. Hersh, C. Buckley, T. J. Leone, and D. Hickam.

Ohsumed:

retrieval evaluation and new large test collection for research. In

17th Annual ACM SIGIR Conference., pages 1292001, 1994.

an interactive

Proceedings of the

IEEE Transactions on Pattern Analysis & Machine Intelligence , 35(8):19441957, 2013.

[12] B Hutchinson, L Deng, and Yu D. Tensor deep stacking networks.

[13] R. Grosse J. Martens. Optimizing neural networks with kronecker-factored approx-

arXiv preprint arXiv:1503.05671 , 2015. Fundamentals of Digital Image Processing . Prentice-Hall,

imate curvature. [14] AK Jain.

New Jersey,

1989. [15] M. Jones, W. Kintsch, and D. Mewhort. High-dimensional semantic space accounts of priming.

Journal of Memory and Language, 55(4):534552, 2006. 19

[16] M.N. Jones and D.J. K. Mewhort. Representing word meaning and order information

Psychological Review, 114(1):137, 2007. Thematics : Interdisciplinary Studies, in psychological process models of text comprehension., M. Louwerse & W. van Peer (Eds), pages 157170, 2002. in a composite holographic lexicon.

[17] W Kintsch.

On the notions of theme and topic.

[18] W Kintsch and T A Van Dijk. Toward a model of text comprehension and produc-

Psychological Review, 85(5):363394, 1978.

tion.

Computer, 21(3):1122, 1988.

[19] Teuvo Kohonen. The'neural'phonetic typewriter. [20] D D Lewis.

Naive (Bayes) at forty:

The independence assumption in informa-

Lecture Notes in Computer Science, volume 1398 of Lecture Notes in Computer Science, pages 415. tion retrieval. In Claire N'edellec and C'eline Rouveirol, editors, Springer, 1998.

[21] David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. mark collection for text categorization research.

Research, 5:361397, 2004.

[22] Fan Li and Yiming Yang.

Rcv1:

A new bench-

The Journal of Machine Learning

A loss function analysis for classication methods in

Proceedings of the Twentieth International Conference on Machine Learning(ICML-2003), pages 472479, Washington DC., 2003. text categorization.

In

[23] S. Liu and G. Trenkler. Hadamard, khatri-rao, kronecker and other matrix products.

International Journal of Information and System Sciences, 4:160177, 2008.

[24] K. Lund and C. Burgess. icial co-occurence.

Producing high-dimensional semantic spaces from lex-

Behavior Research Methods, Instrumentation, and Computers,

28(2):203208, 1996. [25] K. Lund, C. Burgess, and R.A. Atchley. Semantic and associative priming in highdimensional semantic space. In J.D. Moore and J.F. Lehman, editors,

of the 17th Annaul Conference of the Cognitive Science Society, Lawrence Elbaum, 1995.

[26] C D Manning and H Schütze.

ing.

[27] G

Foundations of Statistical Natural Language Process-

The MIT Press, 1999.

F

Marcus.

Rethinking

Proceedings

pages 660665.

eliminative

connectionism.

Cognitive psychology,

37(3):24382, 1998. [28] GF Marcus. Neither size ts all: Comment on mcclelland et al and griths et al.

Trends in cognitive sciences , 14(8):346, 2010.

[29] JL McClelland, MM Botvinick, DC Noelle, DC Plaut, TT Rogers, MS Seidenberg, and LB Smith.

Letting structure emerge:

connectionist and dynamical systems

Trends in cognitive sciences , 14(8):348356, 2010. J.L. McClelland and D.E Rumelhart, editors. Parallel distributed processing: Explorations in the microstructure of cognition. Vol. 1: Foundations . MIT Press, 1986. E. Mizraji. Neural memories and search engines. International Journal of General Systems, 37(6):715732, 2008. Eduardo Mizraji and Juan Lin. Logic in a Dynamic Brain. Bulletin of Mathematical Biology, 73:373397, 2011. approaches to cognition.

[30] [31] [32]

[33] Eduardo Mizraji and Juan Lin. Modeling spatial-temporal operations with contextdependent associative memories.

Cognitive Neurodynamics, 2015.

[34] Eduardo Mizraji, Andrés Pomi, and Juan Valle-Lisboa. Dynamic searching in the brain.

Cognitive Neurodynamics , 3:401414, 2009. 20

[35] Allen Newell and Herbert A Simon. Computer science as empirical inquiry: Symbols and search.

Communications of the ACM , 19(3):113126, 1976.

[36] Erkki Oja. Simplied neuron model as a principal component analyzer.

mathematical biology, 15(3):267273, 1982.

Journal of

[37] Randall C O'Reilly. Biologically plausible error-driven learning using local activation dierences: The generalized recirculation algorithm. 938, 1996. [38] Y. H. Pao.

Neural computation, 8(5):895

Adaptive pattern recognition and neural networks .

Addison-Wesley, 1989. [39] A Pomi Brea and E Mizraji. Memories in context. [40] S E Robertson and Karen Sparck Jones.

Reading, MA:

Bio Systems, 50(3):17388, 1999.

Relevance weighting of search terms.

Journal of the American Society for Information Science , 27(3):129146, 1976.

[41] Timothy T. Rogers and James L. McClelland. Précis of Semantic Cognition: A Parallel Distributed Processing Approach.

Behavioral and Brain Sciences,

2008. [42] Jürgen Schmidhuber. Deep learning in neural networks: An overview.

works, 61:85117, 2015.

[43] J Ignacio Serrano, M Dolores del Castillo, and A Iglesias.

31(06):689,

Neural Net-

Dealing with written

language semantics by a connectionist model of cognitive reading.

Neurocomputing,

72(4):713725, 2009. [44] J Ignacio Serrano, M Dolores del Castillo, Ángel Iglesias, and Jesús Oliva. Assessing aspects of reading by a connectionist model.

Neurocomputing,

72(16):36593668,

2009. [45] N. Srivastava, R. R. Salakhutdinov, and G. E. Hinton. Modeling documents with

Proceedings of the Twenty-Ninth Conference on Uncertainty in Articial Intelligence . Bellevue, WA, USA, August 11-15, 2013. deep boltzmann machines. In

[46] Cheston Tan, Joel Z Leibo, and Tomaso Poggio. Throwing down the visual intelligence gauntlet. In

Machine Learning for Computer Vision,

pages 115. Springer,

2013. [47] Joseph Turian, Lev Ratinov, and Yoshua Bengio. Word representations: a simple

Proceedings of the 48th annual meeting of the association for computational linguistics, pages 384394. Association and general method for semi-supervised learning. In for Computational Linguistics, 2010.

[48] JC Valle-Lisboa, A Pomi, A Cabana, B Elvevåg, and E Mizraji. A modular approach to language production: Models and facts.

Cortex, 55:6176, 2014.

[49] J.C. Valle-Lisboa, F. Reali, H. Anastasía, and E. Mizraji.

Elman topology with

sigma-pi units: an application to the modeling of verbal hallucinations in schizophrenia.

Neural Networks, 18(7):863877, 2005.

[50] Juan C. Valle-Lisboa and E Mizraji. The uncovering of hidden structures by Latent Semantic Analysis.

Information Sciences, 177(19):41224147, 2007.

[51] Yiming Yang. An evaluation of statistical approaches to MEDLINE indexing. In

Proceedings of the AMIA Annual Fall Symposium, pages 358362, 1996.

[52] Yiming Yang. An evaluation of statistical approaches to text categorization.

mation retrieval, 90:6990, 1999.

Infor-

[53] Yongwook Yoon, C Lee, and GG Lee. An eective procedure for constructing a hierarchical text classication system.

Journal of the American Society for Information 21

Science and Technology, 57(3):431442, 2006.

22