Accepted Manuscript
Unsupervised Morphological Segmentation based on Affixality Measurements Carlos-Francisco Mendez-Cruz, Alfonso Medina-Urrea, ´ Gerardo Sierra PII: DOI: Reference:
S0167-8655(16)30234-3 10.1016/j.patrec.2016.09.001 PATREC 6635
To appear in:
Pattern Recognition Letters
Received date: Accepted date:
11 December 2015 1 September 2016
Please cite this article as: Carlos-Francisco Mendez-Cruz, Alfonso Medina-Urrea, Gerardo Sierra, Un´ supervised Morphological Segmentation based on Affixality Measurements, Pattern Recognition Letters (2016), doi: 10.1016/j.patrec.2016.09.001
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Research Highlights (Required)
CR IP T
ACCEPTED MANUSCRIPT
To create your highlights, please type the highlights against each \item command.
It should be short collection of bullet points that convey the core findings of the article. It should include 3 to 5 bullet points (maximum 85 characters, including spaces, per bullet point.)
AN US
• A new method for unsupervised morphological segmentation is presented. • The method is based on a combination of affixality measurements. • The method performed well for Spanish multi-slot morphology.
• In an empirical evaluation, the new method outperformed Morfessor and ParaMor.
AC
CE
PT
ED
M
• Results show that our method is competitive for Spanish morphological segmentation.
ACCEPTED MANUSCRIPT 1
Pattern Recognition Letters journal homepage: www.elsevier.com
Unsupervised Morphological Segmentation based on Affixality Measurements Carlos-Francisco M´endez-Cruza,c,∗∗, Alfonso Medina-Urreab , Gerardo Sierraa a Instituto
CR IP T
de Ingenier´ıa UNAM, Circuito Escolar s/n, col. Ciudad Universitaria, del. Coyoac´an, D.F., 04510, Mexico Colegio de M´exico A.C., Camino al Ajusco, num. 20, col. Pedregal de Sta. Teresa, del. Tlalpan, D.F., 10740, Mexico c Center for Genomics Sciences UNAM, Av. Universidad, s/n, col. Chamilpa, Cuernavaca, Morelos, 62210, Mexico. b El
ABSTRACT
ED
M
AN US
In this paper, we present a method for unsupervised morphological segmentation for multi-slot morphology based on affixality measurements. These measurements quantify three linguistic characteristics of affixes: 1) they combine with many low frequency word-bases (high combinatorial capacity), 2) although they are relatively few, they help to maximize the size of a lexicon (economy principle), i.e. speakers know more words by remembering fewer morphological items, and 3) they are very frequent, so they contain less information than word-bases (entropy), i.e. borders between affixes and stems can be detected by finding entropy peaks. Several experiments combining these measurements were conducted to find the best way to apply them to data. The best strategy consists in successive segmentation when the average of the affixality measurements surpasses a threshold of 0.5. Also, we compared this strategy with some state-of-the-art methods for unsupervised morphological segmentation (Morfessor and ParaMor). Our method outperformed these methods, when tested in a hand-made corpus. Results indicate that our proposal is competitive at least for the morphological segmentation of Spanish words. c 2016 Elsevier Ltd. All rights reserved.
1. Introduction
AC
CE
PT
In general terms, linguistic studies of morphology seek to describe how the words of a human language are structured and to explain the phenomena involved in the processes of forming these words (Haspelmath, 2002). Although interest in morphology has been profuse, it was the linguistic school of structuralism who first emphasized the study of morphological units (Hockett, 1971) and the development of systematic manual procedures for discovering them (Nida, 1949). That linguistic school proposed that words are formed by morphemes as the minimum units of meaning. Furthermore, there is a wide discussion about the concept of word, due to the lack of generalized criteria to define what it is across human languages Anderson (1985). Also, it has been debated whether or not words are formed by morphemes and if these are meaningful units at all (Anderson, 1992).
∗∗ Corresponding
author: Tel.: +52-(777)-313-2063 e-mail:
[email protected] (Carlos-Francisco M´endez-Cruz),
[email protected] (Alfonso Medina-Urrea),
[email protected] (Gerardo Sierra)
The term morph was proposed in linguistics to refer to a representation of a morpheme (Hockett, 1971). In computational linguistics, this term has been adopted as an orthographic realization of a morpheme Sproat (1992). Consequently, for this paper, we adopt the terms word and morph as basic units of morphological analysis, and we assume that words are formed by morphs. Also, it is important to note that we deal in this paper only with concatenative morphology, which essentially deals with the addition of phonological or orthographic material (word segments) to word bases. In this kind of morphology, it is generally accepted that words are formed by a central element named stem which suffers inflectional or word-formation processes Haspelmath (2002). Segments concatenated before the stem are called prefixes, whereas segments concatenated after the stem are named suffixes. Both of them are known as affixes. Broadly speaking, inflection is a morphological process that produces word paradigms to express syntactic features (gender, number, tense). Word class (noun, verb) and lexical meaning do not change by inflectional processes: run vs run-s, book vs book-s. On the contrary, word-formation processes (derivation, compounding, and incorporation) often produce changes
ACCEPTED MANUSCRIPT 2
Most approaches, [...], explicitly or implicitly target languages which have (close to) one-slot morphology, that is, a word (or stem) typically takes not more than one prefix and not more than one suffix. Many [...] languages deviate more or less from this model.
2. Related work
PT
ED
M
AN US
In fact, even the most studied languages, such as English, French, Spanish, or German, do not have one-slot morphol˜ ogy. See, for example, the Spanish words nin-o-s (children), sentiment-al-ismo (sentimentality) and anti-rre-elec-cion-ista (any person or institution opposed to electing somebody more than once for a political office). Nevertheless, most unsupervised methods proposed for Spanish segmentation have treated this language as one-slot morphology (an exception is, for example, ParaMor). Thus, it is our objective to propose a method to fill this gap. In summary, it is necessary to further advanced research in unsupervised morphological segmentation for discovering all morphs of a word. Thus, we present in this paper an unsupervised method that infers stems and all suffixes of Spanish words. The paper is organized as follows: in the next section (2), we present related work of ULM, especially for Spanish; then, we describe our method and segmentation strategies in section 3; in section 4, we include the description of the corpus and present experiments, results, and a discussion; finally, a brief conclusion is given.
manner, for example Meya (1986), Klenk and Langer (1989), and Flenner (1994) Subsequently, Gelbukh et al. (2004, 2008) propounded a method of segmentation applied to Spanish by using genetic algorithms. This method inferred one suffix per word and was extended to involve paradigms. Another approach that uses Spanish as an experimental language was proposed by Monson et al. (2004, 2007). That approach focused on inflectional and one-slot morphology. The objective was to learn inflectional classes. The method was named ParaMor and was able to identify 92% of all Spanish inflectional suffixes (Monson et al., 2007, 2008a). ParaMor was extended to segment words by using discovered inflectional classes (only one segmentation). Finally, ParaMor advanced towards the analysis of Spanish multi-slot morphology. However no quantitative evaluation was reported (Monson et al., 2008b). To finish this section, we review the family of methods called Morfessor that have been carried out since 2002 for the multislot morphology analysis of Finnish language (Creutz and Lagus, 2007) As far as we know, none has been tested for Spanish morphology; however, their unsupervised nature makes them comparables to our method. The two initial methods of this family were named Morfessor Baseline (Creutz and Lagus, 2002; Creutz, 2003). That approach outperformed Linguistica (Goldsmith, 2001, 2006) for Finnish, but not for English. The next proposed method was Morfessor Categories-ML (Creutz and Lagus, 2004). In this method, each morph is associated to one of three categories: prefix, suffix, and stem; additionally, constraints related to sequences of morphs (morphotactics) are expressed by Hidden Markov Models (HMMs). Also, some linguistic assumptions about affixes and stems are represented in a probabilistic manner: affixes are combined with many different morphs, stems are part of a set bigger than the set of affixes, and stems are larger than affixes. The method was applied to a corpus of 16 million Finnish words, semiautomatically segmented. It achieved a precision of 79%. Later, an improvement was proposed adopting the Maximum a Posteriori (MAP) framework (Creutz and Lagus, 2005). The central idea was to maximize the conditional probability P(lexicon|corpus). The method was named Morfessor Categories-MAP and it creates a hierarchical lexicon of morphs. For Finnish, it outperformed previous methods and Linguistica. The last member of the Morfessor family is Morfessor FlatCat (Gr¨onroos et al., 2014), which shares components with Morfessor Baseline and Morfessor Categories-MAP. The best results for this method were obtained with semisupervised learning. Morfessor approach employs morphotactic constraints, an aspect missed in our method. The two approaches employ linguistic characteristics to describe morphological units, some strategy of segment predictability (entropy, perplexity, transitions probabilities), and some measures of morph combination.
CR IP T
in word class and lexical meaning, their motivation is semantic rather than syntactical. Examples of derivation are: play vs play-er (vb > noun), ill vs ill-ness (adj > noun). Unsupervised Learning of Morphology (ULM) is a wellestablished area of Computational Linguistics. The methods of ULM try to induce a morphological description from a raw-text corpus with the minimum of a priori linguistic information and the least possible supervision. In spite of the numerous works in ULM, there is room for improvement within this area. As Hammarstr¨om and Borin (2011, p. 332) point out:
AC
CE
Many proposals of inducing morphological descriptions from corpora have emerged since Harris’ approach (1955). As there are some papers which review most of them, we prefer to highlight methods applied to Spanish. Medina-Urrea (2000) mentions some methods proposed by some of Harris’ contemporaries; most of them were based on finding morphological boundaries by means of character frequencies. Creutz and Lagus (2002, 2004, 2005) and Creutz (2003) present reviews of methods from the nineties, and Goldsmith (2010) offers a wide review of methods in word and morphological segmentation. Finally, Hammarstr¨om and Borin (2011) present a survey with more than 200 studies. As far as we know, the first method for automatic morphological analysis, which experiments with Spanish, corresponds to De Kock and Bossaert (1974, 1978). The idea behind their method was the quantification of economic relations among morphological units. Thereafter, some approaches to the morphological analysis of Spanish were proposed in a supervised
3. Description of the method Our approach is based on the proposal by Medina-Urrea (2000, 2007), which segments words into a stem and only one
ACCEPTED MANUSCRIPT 3 This measure allows for the validation of segmentations as signatures (Goldsmith, 2001) or groups of words (Bernhard, 2006). 3.2. Measure of economy
CR IP T
This measure is based in the work by De Kock and Bossaert (1974, 1978) which quantifies economic relations among morphological units. It was inspired by the Economy Principle of language, which can be expressed as follows: given a language, the fewer high frequency morphological units it has, which combine to form the greater number of lexical items, the more economic a language is at the morphological level. Thus, given a word i, cut into two segments after the jth character, it can be represented as ai, j ::bi, j (the concatenation of segments ai, j and bi, j forms word i). If ai, j belongs to a large set of segments of low frequency, while bi, j belongs to a small set of very frequent segments, then ai, j would be a stem and bi, j would be an affix. Let Ai, j be the set of word beginnings that precede bi, j in words found in the corpus and let Bi, j be the set of word endings that follow segment ai, j to form other words in the corpus. Therefore, ai, j ∈ Ai, j and bi, j ∈ Bi, j . Some of the members of Ai, j might be actual prefixes, so let Ai,p j be this set of prefixes, which is a subset of Ai, j . Similarly, some members of Bi, j might be actual suffixes, so let Bi,s j be this set of suffixes which is a subset of Bi, j . If we allow ki, j to be the measure of economy of the segmentation ai, j ::bi, j , we will have two possible values, depending on whether ai, j is a prefix and bi, j is a stem (k p ) or ai, j is a stem and bi, j a suffix (k s ). The following formulations are normalized possibilities to measure this kind of economy:
PT
3.1. Measure of squares
ED
M
AN US
prefix or one suffix to infer catalogs of affixes. We selected this method because it has been used for Spanish (Romance) and some unrelated languages such as Chuj (Mayan), Czech (Slavic), and Tarahumara (Uto-Aztecan). Thus, we are going to be able to test our proposal in several languages in the future. The idea behind this method is to quantify some linguistic characteristics of affixes across languages: 1) they combine with many low frequency word-bases (see the measure of squares below), 2) although affixes are relatively few, they help to maximize the size of a lexicon, i.e. speakers know more words by remembering fewer morphological items, (see the measure of economy) and 3) affixes are very frequent, so they contain less information than word-bases, i.e. borders between affixes and stems can be detected by finding entropy peaks (see the measure of entropy). In essence, an affix is more likely to function grammatically (i.e. be a marker of tense, mode, grammatical number or gender). Thus, we will say its information content is more grammatical (i.e. affixes contain less information because they are frequent; they carry information mostly about grammatical structure). Also, a stem’s information content deals more with the meaning carried by the whole of the message (written or spoken) in which it occurs, so its amount of information is necessarily greater (i.e. word-bases are less frequent and therefore more informative). In short, affixes have a more grammatical function than stems within a word, just as function words have a more grammatical function than content words within discourse. For measuring this kind of grammaticality within words, Medina-Urrea proposed measuring the following features of borders between affixes and stems: number of squares, economy of segmentation, and entropy, which will be briefly described below. Averaging these quantities results in a measurement of what we will call affixality, which is used to discover boundaries between stems and affixes (Medina-Urrea, 2000).
AC
CE
The concept of squares was first described by Greenberg (1957) as a structure of four segments which combine to form four words found in a corpus. For example, perfection-ism, modern-ism, perfection-ist, modern-ist are four words formed by the segments modern-, perfection-, -ism and -ist. Given the ith word in a corpus, a square is a set of four word-segments, two word-beginnings (a and a0 ) and two wordendings (b and b0 ) which combined form four words found in the corpus: a::b, a::b0 , a0 ::b and a0 ::b0 .1 The ith word of a corpus cut into two segments after the jth character can be represented as ai, j ::bi, j (the concatenation of segments ai, j and bi, j forms word i). Let the measure of squares, ci, j , be the number of squares that can be found in the corpus given the word that is formed by segments ai, j and bi, j .
1 We use the symbol :: as a word segmentation mark, the symbol - as a linguistic standard segmentation mark, the symbol + to concatenate labels, and the symbol ∼ as an automatic segmentation mark.
ki,p j
ki,s j
=1−
=1−
|Ai, j | − |Ai,p j | |Bi,s j |
|Bi, j | − |Bi,s j | |Ai,p j |
;
(1)
(2)
This idea of comparing frequencies of stems and affixes is indeed present in different ways in other methods such as Creutz and Lagus (2005) and Bernhard (2006). 3.3. Measure of entropy This measure is associated with the notion of predictability of segments establish by Harris (1955), which was accomplished by counting phonemes following or preceding a given word segmentation. The idea behind this procedure is that a greater number of phonemes implies greater uncertainty about what follows or precedes a word segment. Hafer and Weiss (1974) first tested Claude Shanon’s entropy as a measure of this kind of uncertainty and made several experiments exploring several ways to apply this and other methods. In what follows, we will describe how to calculate entropy at each word segmentation following Medina-Urrea (2000). Measuring this is also present in other methods (such as Creutz and Lagus, 2005; Bernhard, 2006).
ACCEPTED MANUSCRIPT 4
|Li, j | X h ai, j :: Li, j = − p li, j,k × log2 p li, j,k
(3)
here with suffixes only. In what follows, we will evaluate some possible heuristic strategies, using the affixality index, in order to determine the best one. Medina-Urrea (2000, p. 108) discussed some ways to use the AF index to find segmentation. For instance, a threshold can be used to find valid segmentation: 1) select the best segmentation of a word when AF > threshold; 2) look for next best segmentation, above the threshold, recursively to the left; or 3) look for next best segmentation, above the threshold, recursively to the right. ˜ Table 2. Affixality measurements for the Spanish word ninos
N
k=1
Again, entropy is also bidirectional, depending on whether we are dealing with prefixes hi,p j or suffixes hi,s j . 3.4. Affixality index
AF (s x ) =
cx max c
+
kx max k
+
hx max h
(4)
3
CE
PT
ED
M
where c x , k x , and h x are, respectively, the measure of squares, economy, and entropy of segment s x . The method takes a list of words as input to calculate the affixality index per each possible segmentation within a word. Affixality is calculated from left to right of the word (beginning to end) to obtain a prefix, and from right to left to obtain a suffix. Up until now, the method has been used to obtain catalogs of prefixes and suffixes, by cutting words where the highest affixality index is found, as long as it surpasses a threshold manually selected. At this point, in order to avoid extremely short stems, a minimum length of characters can be required. An example of these measurements for the Spanish word pasteles (cakes) is shown in Table 1. The highest value of entropy suggests the segmentation PAST∼ELES, whereas the highest values of squares, economy, and affixality index suggest the segmentation PASTEL∼ES, which corresponds to the valid Spanish plural suffix -es in the affix catalog. Table 1. Affixality measurements for the Spanish word pasteles A
S
T E 1.67 2.22 1.62 0 72 0 0 0 0 0.25 0.33 0.24 Right to left affixality
AC
P
hx cx kx AF
0 0 0 0
0 0 0 0
I
0.33
L
E 2.03 929 0.99 0.97
S 1.36 160 0.94 0.57
3.5. Affixality measurements for multi-slot morphology As we mentioned above, a one-slot analysis results insufficient to deal with the morphological complexity of many languages. Thus, we will describe a method to infer all suffixes of a word. In Spanish, main derivational and all inflectional processes are accomplished by suffixation. Therefore, we will deal
˜ N
0.26 0.88 Right to left affixality
O
S 0.93
We can observe the effect of establishing a threshold of 0.5 and exploring one direction or the other in the word ni˜nos (children + masc.) in Table 2. The highest peaks (AF > 0.5) tell us ˜ of two segmentation, NIN∼O∼S, which happen to correspond to real morphological borders. If we look only at the best one ˜ of 0.93 (in boldface), we have NINO∼S (boy + pl.) and we can look left for the next best segmentation above 0.5, which is 0.88 ˜ (in italics) leaving us with NIN∼O∼S (child + masc. + pl.). Alternatively, we can look to the right of ∼S and find no further segmentation; so using one direction or the other when looking for a segmentation is in fact relevant. Also, notice that, since the three values measure different aspects of affixes, they may point to different segmentation (just as it was the case for the word pasteles above). See for example the measurements calculated for the Spanish word dibujante (draftsman or draftswoman) in Table 3. The highest values of entropy and squares, and the second highest of economy, separate the stem DIBUJ- from the agentive derivational suffix ANTE, which is the correct segmentation. However, the highest economy value points to the probable but incorrect occurrence of the enclitic TE which attaches to verbal forms. This discordance may be explained by the fact that entropy tends either to assume that a frequent string of characters is always a morph, regardless of its context of occurrence, or to point at borders between stems and affixes, while the economy measure selects the outermost inflectional affixes. Also, Spanish inflectional affixes occur as suffixes. They can be regarded as the most economic because they are the most frequent and they belong to a relatively small set of affixes. So the measure of economy can be combined with the measure of squares to validate segmentation. For example, in the word cantaremos (we will sing), the highest value of economy segments the rightmost inflectional suffix, -MOS (Table 4), which is the morph of the 1st. person plural of Spanish verbs (Alcoba, 1999). Additionally, combined measurements of entropy and squares point to the stem of the verb (cant-). So, finding a combination of measurements for discovering stems and still another combination for detecting suffixes may be important to work with agglutinative languages, where words are formed by
AN US
Medina-Urrea (2003) suggested a normalized affixality index (AF) of a word segment s x as the average of three normalized measurements as follows:
AF
CR IP T
We can interpret this measure as the amount of information carried by segments: if affixes are less informative than wordbases, peaks of entropy within a word can be used to identify morphological boundaries (a peak of entropy must signal the beginning of a stem). We use a reformulation of Shannon’s method. Given a word segmentation ai, j ::bi, j , and the set Bi, j containing all word endings occurring in a corpus attached after segment ai, j , let Li, j be the set of the characters with which the elements of Bi, j begin and let li, j,k be the kth element of Li, j (li, j,k ∈ Li, j ). The measure of entropy hi, j is obtained by:
ACCEPTED MANUSCRIPT 5 Table 3. Affixality measurements for the Spanish word dibujante
D
I 0 0 0
hx cx kx
B 0 0 0
U 1.09 0 0
J A 1.00 2.65 0 36238 0 0.56 Right to left affixality
N 0.58 2233 0
T 1.78 6499 0.99
E 1.25 0 0
Table 4. Affixality measurements for the Spanish word cantaremos
hx cx kx
A 1.89 3 0
N 0.99 0 0
T 2.17 303 0.15
A R 2.73 1.41 1.52 274560 9016 12750 0.92 0.81 0.93 Right to left affixality
M 0.86 25991 0.99
O 1.21 0 0
S 1.30 0 0
erage the measures in question, while S2 and S4 multiply them. The rest of the strategies (S5 to S16) apply only one combination of measurements. Also, direction of segmentation, condition to segment and operation to combine measurements vary as specified in Table 5. The last four strategies (S13 to S16) do not use direction of segmentation because segmentation is based on values higher than 0.5.
AN US
sequences of stems and affixes. Given these differences, it is important to evaluate different segmentation strategies. In Table 5, the sixteen strategies that we evaluated are displayed. Essentially, these are combinations of the following elements: 1) affixality measurements applied and their order of application, 2) direction of segmentation, 3) condition to segment (highest value vs threshold) and 4) operation to combine measurements (product vs average).
E
CR IP T
C
4. Experiments and Results
Table 5. Summary of segmentation strategies
AC
CE
PT
ED
M
MEASUREMENTS DIR CON OPE S1 First h x and c x , then h x and k x → Max Avg S2 First h x and c x , then h x and k x → Max Prd S3 First h x and k x , then h x and c x ← Max Avg S4 First h x and k x , then h x and c x ← Max Prd S5 h x , c x , and k x ← Max Avg S6 h x , c x , and k x → Max Avg S7 h x , c x , and k x ← Max Prd S8 h x , c x , and k x → Max Prd S9 h x and k x ← Max Avg S10 h x and k x → Max Avg S11 h x and k x ← Max Prd S12 h x and k x → Max Prd S13 h x , c x , and k x >0.5 Avg S14 h x , c x , and k x >0.5 Prd S15 h x and k x >0.5 Avg S16 h x and k x >0.5 Prd DIR=direction of segmentation; CON=condition to segment OPE=operation to combine measurements → Left to right, ← Right to left Max=Highest value; Avg=Average, Prd=Product
Strategies S1 and S2 identify the stem by cutting at the point of highest value of entropy (h x ) and squares (c x ) combined. Then, they identify suffixes by moving to the right and segmenting where the next highest value of entropy (h x ) and economy (k x ) appear. S3 and S4 identify first the rightmost suffix with the highest value of entropy (h x ) and economy (k x ) combined. Then, they identify any remaining suffixes, and the stem, by moving progressively to the left, detecting the highest values of entropy (h x ) and squares (c x ). Finally, strategies S1 and S3 av-
4.1. Datasets
A list of 975,250 Spanish word-forms was gathered to perform experiments and evaluation. This list contains: 1) the word-forms previously used in a supervised morphological analyzer for Spanish (Gelbukh and Sidorov, 2003),2 2) the headwords of the Mexican Spanish dictionary Diccionario del Espa˜nol de M´exico (DEM) (Lara, 2010) and 3) all word-forms found in the Corpus del Espa˜nol Mexicano Contempor´aneo (CEMC) (Lara et al., 1979). Since we were unable to find a gold standard of Spanish morphologically segmented words, in order to determine how much suffixal morphology our method learned, we compiled a test corpus of 1,600 word-forms selected from the CEMC3 and segmented each word manually. The few available segmented corpora contain words segmented once (stem + suffix) or, at the most, twice (prefix + stem + suffix). In order to gather an appropriate sample of inflectional and derivational processes, the words of our test corpus were segmented following some Spanish morphological studies (Ambadiang, 1999; Moreno de Alba, 1986; Alcoba, 1999; Beniers, 2004) and the rules described by the DEM 4 (Lara, 2010). The distribution of these phenomena in our test corpus was: Noun inflection 5% (76), noun derivation 53% (855), verb inflection 31% (490), and verb derivation 11% (180).
2 This handmade list includes many Spanish verbs in all possible inflected forms as well as all possible enclitic combinations. This increases substantially the number of word-forms. 3 http://www.corpus.unam.mx/morfotactica/testDatsetSpanish.zip 4 http://dem.colmex.mx/.
ACCEPTED MANUSCRIPT 6
CR IP T
including some additional features (Virpioja et al., 2013). Second, we tested Morfessor Categories-MAP. Third, we included Morfessor FlatCat, the last release of this family. Finally, we added ParaMor .7 By using the development set, we performed a tuning stage to find the best parameters of these methods. In the case of Morfessor 2.0, we tested different values of likelihood weight (α): 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, and used the development set to equalize precision and recall (dev). Additionally, three types of frequency dampening (dam) were used (none, log, ones). Morfessor Categories-MAP was tuned by testing some values of the perplexity threshold (per): 5, 10, 50, 75, 100, 150, 200, 250, 300. Regarding Morfessor FlatCat, we tested per = 5, 10, 50, 75, 100, 150, 200, 250, and 300 with α = 0.05, 0.1, 0.3, 0.5, 0.7, 0.9, and 1.0. In the case of ParaMor, we tested different sizes of training corpus (size): 50K, 100K, 150K, 200K, 300K, 250k, and 350k. Also, we adjusted type length (=2) and language (=GENERIC). 4.3. Results and discussion We show the five top segmentation strategies according to the F-Measure of BPR evaluation method in Table 6. Strategies S13 and S15 obtained the first two places. Both strategies segment well when the affixality index surpasses 0.5 and the affixality measurements are averaged. The difference between them rests on the fact that S13 combines all measurements (c x , h x and k x ) while S15 combines only entropy (h x ) and economy (k x ). BPR exposes that S13 undersegments, while S15 oversegments. We decided to use both strategies for comparing to other methods.
PT
ED
M
AN US
The word-forms from our test corpus were eliminated from the initial list and randomly divided into two parts: 1) a development set of 800 word-forms for selecting the best segmentation strategy and tuning the state-of-the-art methods and b) a heldout set of 800 word-forms for final evaluation. The remaining 973,650 word-forms were used as the training set. Spanish has an interesting variety of suffixes. For example, the 855 noun derivational forms compiled by Moreno de Alba (1986) display 143 orthographic variations (allomorphs) of 79 derivational suffixes. Furthermore, verb inflection includes approximately 26 different suffixes (see Alcoba, 1999) whose combination allows us to express a Spanish verb in one of more than 40 different inflectional forms (for example, canta-r, cant-a-s, cant-a, cant-a-mos, cant-´a-is, cant-a-n, cant-e, cant-e-s, cant-e-mos and so on). Verb forms were also segmented according to the inflectional models of the DEM, which recommends a model with less segmentation (cant-ar, cant-as, cant-a, cant-amos, cant-´ais, cant-an, cant-e, cant-es, cant-emos and so forth). The lowest variety of suffixes belongs to noun inflection (gender: -a, -o, -e, number: -s, -es) and verb derivation (-ear, -ecer, -ificar, -izar). In addition, Spanish, like many other languages, has regular and irregular morphological processes. A regular process does not modify the stem of a word when affixes are concatenated, for example elimin-ar vs elimina-ci´on (eliminate/elimination) or cant-ar vs cant-amos (to sing/we sing). In an irregular process, the stem undergoes phonological/orthographic changes such as agua vs acu´a-tico (water/aquatic) or med-ir vs mid-o (to measure/I measure). We decided to explore the performance of the method taking into consideration irregular forms, so 17% of the test corpus includes examples of this kind. Needles to say, this made our evaluation more challenging. Finally, we included word-forms with a sequence of derivational and inflectional suffixes, such as defect-uos-o-s (defective + masc. + pl.). Also, even though enclitic pronouns (me, te, se, nos, lo, etc.) are grammatically and semantically different from affixes, they are graphically attached at the end of many verbal forms. So we included examples exhibiting all their possible combinations, e.g. d´ar-te-lo (I give it to you). Compounds were ignored.
AC
CE
4.2. Experiments and evaluation We performed an evaluation to find the best segmentation strategy. First, initial statistics to calculate affixality measurements using the training set were generated. Then, the words of the development set were segmented using all strategies explained above in Table 5. Afterwards, an evaluation method was applied to obtain precision, recall, and F-measure scores. Following Virpioja et al. (2011), we chose the measurement of boundary positions (BPR).5 Furthermore, we compared the best strategies with those of the state-of-the-art methods for unsupervised morphological segmentation. We tested some Morfessor approaches.6 First, we used Morfessor 2.0, which is based on Morfessor Baseline,
5 Scripts
in http: //research.ics.tkk.fi/events/morphochallenge/
6 http://www.cis.hut.fi/projects/morpho/.
Table 6. The top five segmentation strategies on development set
S1 S2 S3 S13 S15
Precision 0.780 0.780 0.773 0.734 0.674
BPR Recall 0.586 0.586 0.599 0.657 0.704
F-measure 0.669 0.669 0.675 0.693 0.689
From Tables 5 and 6, we can observe that averaging measurements (Avg) seems better than multiplying them and that the use of a threshold of 0.5 is preferable than the use of the highest values. Also, when looking for next segmentation, any direction seems appropriate. Regarding evaluation against other methods, we first present the best found parameters for each one. We exhibit only the two best results for each method according to F-measure. These were selected for final evaluation. For Morfessor 2.0 (Table 7), the best parameters were α = 0.6 and α = 0.5 with dam = ones in both cases. Morfessor Categories-MAP performed well with per = 10 and per = 150 according to BPR (Table 8). The combination of α = 0.3 and per = 250 (flat 28) and 300 (flat 29) worked well for Morfessor FlatCat in agreement with BPR (Table 9). Finally, ParaMor gained in F-measure with larger training sets (Table 10). 7 http://www.cslu.ogi.edu/
monsonc/ParaMor.html.
ACCEPTED MANUSCRIPT 7
DAM α mor2 1 ones 0.5 mor2 2 ones 0.6 DAM=dampening
Precision 0.668 0.681
BPR Recall 0.532 0.538
F-measure 0.592 0.601
Table 8. The best parameters for Morfessor Categories-MAP on development set
PER Precision cat 1 10 0.690 cat 4 150 0.701 PER=perplexity
BPR Recall 0.575 0.567
F-measure 0.627 0.627
Table 11. Results comparing all methods on held-out set
S13 S15 par 5 cat 4 par 4 flat 29 cat 2 flat 28 mor2 1 mor2 2
Precision 0.736 0.683 0.825 0.711 0.817 0.700 0.690 0.686 0.674 0.675
BPR Recall 0.688 0.728 0.530 0.581 0.513 0.574 0.563 0.562 0.542 0.531
F-measure 0.711 0.705 0.645 0.639 0.631 0.631 0.620 0.618 0.601 0.595
CR IP T
Table 7. The best parameters for Morfessor 2.0 on development set
Table 12. Random examples of segmentation from held-out set
PER α flat 28 250 0.3 flat 29 300 0.3 PER=perplexity
Precision 0.712 0.706
BPR Recall 0.569 0.564
F-measure 0.633 0.627
Table 10. The best parameters for ParaMor on development set
M
F-measure 0.582 0.588
ED
BPR SIZE Precision Recall par 4 200k 0.822 0.451 par 5 300k 0.827 0.456 SIZE=size of training corpus
CE
PT
Final evaluation was accomplished by using the held-out set. Results for BPR show that our approach outperformed other methods (Table 11). Like other methods, our approach achieved higher precision than recall (except S15), that is, it undersegments. Nevertheless, our recall is higher than the others. Thus, a strength of our approach is the fact that, regularly, it selects correct boundaries. Until we accomplish an evaluation for other languages, a weakness of our approach is the possibility of overfitting to Spanish. Morfessor performance could be explained by considering the fact that it was developed for richer morphological languages than Spanish, that is, agglutinative and polysynthetic ones. To be fair, it is necessary to test our method with this kind of languages too. Nevertheless, these family of methods was competitive. Also, ParaMor was a strong competitor (second place BPR) because the Spanish language was involved in its development. In this case, more experimentation with a larger training corpus is required. We randomly selected a few examples of segmentation from the best competitor of each approach (Table 12). Verb derivation was well solved by our approach in these examples (espej∼e∼ar, capital∼iz∼a, com∼ieron); however, some nouns were oversegmented (g∼a∼t∼a, mur∼al∼la). In comparison to our method, ParaMor left some words without segmentation.
AC
cat 4 espeje∼ar
flat 29 espej∼ear
par 5 espejear
observ∼atorio capital∼iza gata lo∼bos dosific∼ar
observ∼atorio capital∼iza gat∼a lo∼bos dosific∼ar
observatorio capitaliza g∼a∼t∼a lob∼o∼s dosificar
reun∼iones jue∼guen
reun∼iones jue∼gue∼n
reun∼i∼ones jue∼gu∼e∼n
com∼ieron
com∼ieron
com∼ie∼r∼on
mural∼la mor2 1 espej∼e∼ar
mur∼al∼la S13 espej∼e∼ar
mural∼l∼a
observ∼a∼torio capital∼iza gat∼a lo∼bos d∼osific∼ar
observatorio capital∼iz∼a g∼a∼t∼a lob∼o∼s dosific∼ar
reun∼iones jue∼guen
reun∼io∼nes juegu∼e∼n
com∼ieron
com∼ieron
mur∼al∼la
mur∼al∼la
AN US
Table 9. The best parameters for Morfessor FlatCat on development set
Gold standard espej-e-ar, espej-e-a-r observ-atorio capital-iz-a, gat-a lob-o-s dos-ific-ar, dos-ific-a-r reun-ion-es juegu-en, juegu-e-n com-ieron, com-ie-ro-n mur-alla Gold standard espej-e-ar, espej-e-a-r observ-atorio capital-iz-a, gat-a lob-o-s dos-ific-ar, dos-ific-a-r reun-ion-es juegu-en, juegu-e-n com-ieron, com-ie-ro-n mur-alla
5. Conclusion We presented a method based on affixality measurements for unsupervised morphological segmentation that seeks to go beyond one-slot morphology (one which deals with one prefix or one suffix) and head into a multi-slot morphology. Our approach consists in segmenting when the average of the affixality measurements surpasses a threshold of 0.5. These measurements are: 1) the number of squares (combinatorial possibilities of affixes and bases), 2) the ratio of words formed with an affix (in syntagmatic relation with word-bases) to alternating affixes (those in paradigmic relation with the affix). 3) the information content of the word-bases to which the affix attaches (entropy).
ACCEPTED MANUSCRIPT 8
Acknowledgments We thank El Colegio de M´exico A.C. for allowing us to use the CEMC. This work was supported by CONACyT grants: 34744/174764 and CB-2012/178248. References
AC
CE
PT
ED
M
AN US
Alcoba, S., 1999. La flexi´on verbal, in: Bosque, I., Demonte, V. (Eds.), Gram´atica descriptiva de la lengua espa˜nola. Espasa-Calpe, RAE, Madrid. volume 3, pp. 4305–4366. Ambadiang, T., 1999. La flexi´on nominal. G´enero y n´umero, in: Bosque, I., Demonte, V. (Eds.), Gram´atica descriptiva de la lengua espa˜nola. EspasaCalpe, RAE, Madrid. volume 3, pp. 4843–4913. Anderson, S., 1985. Typological distinction in word formation, in: Shopen, T. (Ed.), Language Typology and Syntactic Description. Grammatical Categories and the Lexicon. Cambridge University Press, Cambridge. volume III, pp. 3–56. Anderson, S., 1992. A-Morphous Morphology. Cambridge University Press, Cambridge. Beniers, E., 2004. La formaci´on de verbos en el espa˜nol de M´exico. volume 54. Universidad Nacional Aut´onoma de M´exico, M´exico. Bernhard, D., 2006. Unsupervised morphological segmentation based on segment predictability and word segments alignment, in: Proc. of the Pascal Challenges Workshop on the Unsupervised Segmentation of Words into Morphemes, pp. 19–23. Creutz, M., 2003. Unsupervised segmentation of words using prior distributions of morph length and frequency, in: Proc. of the 41st Annual Meeting of the ACL, ACL. pp. 280–287. Creutz, M., Lagus, K., 2002. Unsupervised discovery of morphemes, in: Proc. of the Workshop on Morphological and Phonological Learning of ACL-02, Philadelphia, SIGPHON-ACL. pp. 21–30. Creutz, M., Lagus, K., 2004. Induction of a simple morphology for highlyinflecting languages, in: Proc. of 7th Meeting of the ACL Special Interest Group in Computational Phonology, pp. 43–51. Creutz, M., Lagus, K., 2005. Inducing the morphological lexicon of a natural language from unannotated text, in: Int. and Interdisciplinary Conf. on Adaptive Knowledge Representation and Reasoning, AKRR05. pp. 106– 113. Creutz, M., Lagus, K., 2007. Unsupervised models for morpheme segmentation and morphology learning. ACM Trans. Speech Lang. Process 4. De Kock, J., Bossaert, W., 1974. Introducci´on a la ling¨u´ıstica autom´atica en las lenguas rom´anicas. Gredos, Madrid. De Kock, J., Bossaert, W., 1978. The morpheme: an experiment in quantitative and computational linguistics. Van Gorcum, Amsterdam. Flenner, G., 1994. Ein quantitatives morphsegmentierungssystem f¨ur spanische wortformen, in: Klenk, U. (Ed.), Computation Linguae II. Franz Steiner, Stuttgart. volume 83 of ZDL-Beiheft, pp. 31–62. Gelbukh, A., Alexandrov, M., Han, S.Y., 2004. Detecting inflection patterns in natural language by minimization of morphological model, in: Progress in Pattern Recognition, Image Analysis and Applications. Springer, Berlin Heidelberg, pp. 432–438. Gelbukh, A., Sidorov, G., 2003. Approach to construction of automatic morphological analysis systems for inflective languages with little effort, in: Computational linguistics and intelligent text processing. Springer, Berlin Heidelberg, pp. 215–220. Gelbukh, A., Sidorov, G., Lara-Reyes, D., Chanona-Hernandez, L., 2008. Division of spanish words into morphemes with a genetic algorithm, in: Natural Language and Information Systems. Springer, Berlin Heidelberg, pp. 19–26.
Goldsmith, J., 2001. Unsupervised learning of the morphology of a natural language. Computational Linguistics 27, 153–198. Goldsmith, J., 2006. An algorithm for the unsupervised learning of morphology. Natural Language Engineering 12, 353–371. Goldsmith, J., 2010. Segmentation and morphology, in: The Handbook of Computational Linguistics and Natural Language Processing. WileyBlackwell, Oxford, pp. 364–393. Greenberg, J.H., 1957. Essays in linguistics. University of Chicago Press, Chicago. Gr¨onroos, S.A., Virpioja, S., Smit, P., Kurimo, M., 2014. Morfessor flatcat: An hmm-based method for unsupervised and semi-supervised learning of morphology, in: Proc. of COLING, the 25th International Conference on Computational Linguistics, pp. 1177–1185. Hafer, M.A., Weiss, S.F., 1974. Word segmentation by letter successor varieties. Information Storage and Retrieval 10, 371–385. Hammarstr¨om, H., Borin, L., 2011. Unsupervised learning of morphology. Computational Linguistics 37, 309–350. Harris, Z.S., 1955. From phoneme to morpheme. Language 31, 190–222. Haspelmath, M., 2002. Understanding morphology. Oxford University Press, New York. Hockett, C.F., 1971. Curso de ling¨u´ıstica moderna. EUDEBA, Buenos Aires. Klenk, U., Langer, H., 1989. Morphological segmentation without a lexicon. Literary and Linguistic Computing 4, 247–253. Lara, L.F., 2010. Diccionario del espa˜nol de M´exico. El Colegio de M´exico, M´exico. Lara, L.F., Ham Chande, R., Garc´ıa Hidalgo, M.I., 1979. Investigaciones ling¨u´ısticas en lexicograf´ıa. volume 89. Colegio de M´exico, M´exico. Medina-Urrea, A., 2000. Automatic discovery of affixes by means of corpus: A catalog of spanish affixes. Journal of Quantitative Linguistics 7, 97–114. Medina-Urrea, A., 2003. Investigaci´on cuantitativa de afijos y cl´ıticos del espa˜nol de M´exico. Glutinometr´ıa en el Corpus del Espa˜nol Mexicano Contempor´aneo. Ph.D. thesis. El Colegio de M´exico. M´exico. Medina-Urrea, A., 2007. Affix discovery by means of corpora: Experiments for spanish, czech, ral´amuli and chuj, in: Mehler, A., K¨ohler, R. (Eds.), Aspects of Automatic Text Analysis. Springer, Berlin Heidelberg. volume 209, pp. 277–299. Meya, M., 1986. Morphologische analyse des spanischen, in: Schwarz, C., Thurmair, G. (Eds.), Informationslinguistische Texterschließung. Georg Olms Verlag, Z¨urich. volume 4, pp. 134—-156. Monson, C., Carbonell, J., Lavie, A., Levin, L., 2007. Paramor: Minimally supervised induction of paradigm structure and morphological analysis, in: Proc. of 9th Meet. of the ACL Special Interest Group in Computational Morphology and Phonology, ACL. pp. 117–125. Monson, C., Carbonell, J., Lavie, A., Levin, L., 2008a. Paramor: Finding paradigms across morphology, in: Advances in Multilingual and Multimodal Information Retrieval. Springer, Berlin Heidelberg, pp. 900–907. Monson, C., Lavie, A., Carbonell, J., Levin, L., 2004. Unsupervised induction of natural language morphology inflection classes, in: Proc. of the 7th Meet. of the ACL Special Interest Group in Computational Phonology, ACL. pp. 52–61. Monson, C., Lavie, A., Carbonell, J., Levin, L., 2008b. Evaluating an agglutinative segmentation model for paramor, in: Proc. of the 10th Meet. of ACL Special Interest Group on Computational Morphology and Phonology, ACL. pp. 49–58. Moreno de Alba, J.G., 1986. Morfolog´ıa derivativa nominal en el espa˜nol de M´exico. Universidad Nacional Aut´onoma de M´exico, M´exico. Nida, E.A., 1949. Morphology. The Descriptive Analysis of Words. The University of Michigan, Ann Arbor. Sproat, R., 1992. Morphology and Computation. The MIT Press, Cambridge, London. Virpioja, S., Smit, P., Gr¨onroos, S.A., Kurimo, M., 2013. Morfessor 2.0: Python implementation and extensions for Morfessor Baseline. Technical Report. Aalto University. Helsinki. Virpioja, S., Turunen, V.T., Spiegler, S., Kohonen, O., Kurimo, M., 2011. Empirical comparison of evaluation methods for unsupervised learning of morphology. Traitement Automatique des Langues 52, 45–90.
CR IP T
We compared our results with those of some state-of-the-art methods for unsupervised morphological segmentation. These results confirm that our method is a competitive option for morphological segmentation for Spanish since it outperformed all tested methods. It is necessary to apply and evaluate our method using morphologically richer languages, especially those used in the Morpho Challenge, and in some applications.