collide’

collide’

Language Sciences 57 (2016) 21–33 Contents lists available at ScienceDirect Language Sciences journal homepage: www.elsevier.com/locate/langsci A c...

418KB Sizes 3 Downloads 40 Views

Language Sciences 57 (2016) 21–33

Contents lists available at ScienceDirect

Language Sciences journal homepage: www.elsevier.com/locate/langsci

A corpus study on identification and semantic classification of light verb constructions in Persian: the case of light verb xordan ‘to eat/collide’ Ramin Golshaie* Iranian Research Institute for Information Science and Technology (IranDoc), Tehran, Iran

a r t i c l e i n f o

a b s t r a c t

Article history: Received 17 May 2015 Received in revised form 18 April 2016 Accepted 17 May 2016 Available online 11 June 2016

One of the challenging problems in the domain of Persian light verb constructions (LVCs) is to discover and classify a light verb (LV)’s multiple senses. This is important since productive use of LVs in Persian that leads to formation of novel LVCs can be explained in reference to these established LV senses. On the other hand, identification of LVCs in the first place is another problem which is a complex task given that not only no objective criterion exists for their identification but also the constituent elements of some LVCs can be split by interposing linguistic units that makes their identification difficult. This paper addresses these two issues using corpus methodology. To identify LVCs, the LV xordan with two unrelated meanings ‘to eat/collide’ was chosen for analysis and the corresponding LVCs were extracted from a sampled 50-million-word corpus based on a measure of collocational associations. The extracted LVCs consisted of frequent compositional and idiomatic noun-verb (N-V) patterns found in the corpus. Corpus examinations revealed that frequent compositional N-V sequences have constructional meanings and need to be recognized as LVCs. Finally, to discover the LV senses, 700 concordance lines of the extracted LVCs were studied and classified based on a behavioral profile analysis of their corpus usage patterns. According to the results of behavioral profile analysis, two constructional senses EAT and COLLIDE are coexistent under xordan each subsuming their own semantically-related LVCs. The findings while supporting the overall constructionist assumptions on polysemy network of LV senses necessitate a reconsideration of constructionhood criteria in Persian LVCs alongside the process of identification and classification of senses. Ó 2016 Elsevier Ltd. All rights reserved.

Keywords: Light verb constructions Corpus linguistics Behavioral profile analysis Collocations Semantics

1. Introduction Persian light verbs constructions1 (LVCs)2 have been the focus of much controversial research in the recent decades. These multiword expressions are composed of a non-verbal element (NV), which is usually a noun or an adjective, and a light verb (LV) which is semantically impoverished compared to its ‘heavy’ counterpart. In studying Persian LVCs, various aspects * P.O. Box: 13185-1371, 1090 Enqelab Ave., Tehran, Iran. Tel.: þ98 (0)21 66 4949 80. E-mail address: [email protected]. 1 Also known as ‘compound verbs’ or ‘complex predicates’ in the literature. 2 The following abbreviations have been used in the article: 1/2/3 ¼ person marker; ACC ¼ accusative marker; lit. ¼ literal; LV ¼ light verb; LVC ¼ light verb construction; N ¼ noun; NV ¼ non-verbal element; PAST ¼ past tense; PL ¼ plural; POSS ¼ possessive; PROG ¼ progressive; SG ¼ singular; V ¼ verb. Other abbreviations have been defined in the text or footnote where necessary. http://dx.doi.org/10.1016/j.langsci.2016.05.002 0388-0001/Ó 2016 Elsevier Ltd. All rights reserved.

22

R. Golshaie / Language Sciences 57 (2016) 21–33

including syntactic properties, separability/non-separability, event structure, productivity, and semantic characteristics of LVCs have been investigated by researchers (e.g. Barjasteh, 1983; Dabir-Moghaddam, 1997; Folli et al., 2005; HajiAbdolhosseini, 2000; Karimi, 1997; Karimi-Doostan, 1997, 2005, 2011; Megerdoomian, 2001; Mohammad and Karimi, 1992; Vahedi-Langrudi, 1996; Tabaian, 1979). Of these major themes, the topic of productivity (making new LVCs by adjoining a new NV to an already existing LV) continues to be of special interest for researchers. In the recent studies on Persian LVCs (Family, 2006, 2008), productivity is explained in the cognitive-constructional framework. LVCs are considered form-meaning pairings and an LV is said to have different constructions in which specific semantically-related NVs can appear. In other words, according to this approach, each LV in Persian forms networked clusters of polyesemous meanings for which specific types of NVs are appropriate. While the constructionist account seems to provide a solid scaffolding for explaining the productive behavior of Persian LVCs, some challenges still need to be addressed. One major problem is to empirically validate the constructionist account by finding a systematic method for discovering and classifying LV senses in Persian. For example, the LV xordan (studied in this paper) has two unrelated meanings, namely ‘to eat’ and ‘to collide’. These two meanings seem to be equally playing a role in the semantics of resulted LVCs, but in the literature these senses have been lumped together when studying LVCs and no clear semantic classification has been provided as to which sense might be associated with what LVCs. Identifying LVCs from N-V sequences is another challenge given the fact that some N-V sequences are ambiguous between verb phrases and LVCs. Furthermore, NV and LV in most LVCs can be separated by interposing linguistic elements complicating the process of identifying LVCs. It is proposed that these problems can be dealt with in a more effective and systematic way using corpus methods. Corpus-linguistic methods have been used to study multiword expressions (including LVCs) in variety of languages such as English (Stevenson et al., 2004), Chinese (Huang et al., 2014; Lin et al., 2014), and Urdu (Ahmed, 2010; Ahmed and Butt, 2011). In Persian, except for some studies focused on NLP3 applications (e.g. Gerdes and Samvelian, 2008; Rasooli et al., 2011), no significant corpus study has been carried out on LVCs. In the present study, corpus-based quantitative methods will be used to identify and semantically classify Persian LVCs containing the LV xordan. First, in order to identify and extract LVCs (including separable4 ones) from corpus, mutual information measure (Church and Hanks, 1990; Stubbs, 1995) will be used as an indicator of collocational associations. It will be shown that frequent compositional LVCs have constructional meanings that distinguish them from phrasal N-V sequences. Second, based on a behavioral profile analysis (Gries, 2006, 2010) of the extracted LVCs, it will be shown that the LV xordan simultaneously subsumes two major constructional senses (namely EAT and COLLIDE) each with their own semantically-related LVCs. In the next section, a brief review of the relevant literature is presented. Section 3 introduces the methods, the corpus, and the procedures employed in the extraction and semantic classification of LVCs. Section 4 presents the results of the statistical analysis of the data. Section 5 discusses the main findings in comparison with previous research and implications for reconsidering constructionhood criteria in Persian LVCs. Finally, Section 6 concludes the article by pointing to the main findings of the study. 2. Background Persian, contrary to languages like English, has fewer than 200 simple verbs (Sadeghi, 1993). For this reason, LVCs in Persian play an important role in expressing various verbal notions inexpressible by simple verbs. Although LVs may have a heavy or full semantic content in other contexts of use, they contribute lesser to the meaning of LVCs (see Jesperson, 1965). For example, compare the heavy and light verb usages of Persian xordan (‘to eat’) in the following sentences: (1)

a.

Maryam sib râ xor-d. Maryam apple ACC eat-PAST.3SG ‘Maryam ate the apple.’

b.

Maryam xeili ghosse xor-d. Maryam very grief eat-PAST.3SG ‘Maryam grieved very much.’

In sentence (1a), xord ‘ate’ is used as a heavy verb taking Maryam as its subject and sib ‘apple’ as its direct object. In sentence (1b), however, xord is not used in its heavy meaning. In this example, the LV xord together with the NV ghosse ‘grief’ is functioning as an LVC in the sentence which lexically means ‘to grieve’. In sentence (1b), the semantic relation between the NV and LV is not transparent compared to the semantic transparency existing between the direct object and the verb in (1a). Cases like (1b) can be considered typical instances of LVCs in Persian since the meaning of the construction is idiomatic or non-compositional.5

3

Natural Language Processing. In separable LVCs, NV and LV can be syntactically separated by intervening constituents. The terms ‘non-compositional’, ‘idiomatic’, ‘constructional’ and ‘unpredictable’ have been used interchangeably throughout the article to refer to a meaning that is not the compositional sum of the meanings of its component parts. 4 5

R. Golshaie / Language Sciences 57 (2016) 21–33

23

In cases where the LVC has a compositional meaning, deciding whether an N-V sequence constitutes an LVC or simply a verb phrase has been challenging. Some researchers have argued that the N-V sequences such as sobhâne xordan (lit. breakfast-eating, ‘to have breakfast’) and ghazâ xordan (lit. food-eating, ‘to have food’) in which N can be analyzed as the direct object of the V, are not LVCs but verb phrases (e.g. Seifollahi and Tabibzadeh, 2013; Tabatabaie, 2005). For example, Tabatabaie (2005) argues that LVCs function as a whole semantic unit, while in verb phrases the meaning of the whole expression is compositional. Furthermore, he introduces the substitution test as another criterion that can distinguish verb phrases from LVCs. According to this test, in verb phrases the nominal element (N) can be substituted with other Ns without having the semantic relationship between the N and V changed. Thus, for example, the noun ghazâ in ghazâ xordan (lit. food-eating, ‘eating’) can be substituted with other Ns belonging to the category of edibles such as sobhâne ‘breakfast’, nâhâr ‘lunch’, and chây ‘tea’. Dabir-Moghaddam (1997), on the other hand, considers some compositional N-V sequences as LVCs in which the direct object (N) is incorporated into the verb. He argues that there is a semantic difference between these LVCs and their nonincorporated counterpart. For instance, consider the following example discussed by Dabir-Moghaddam: (2)

a.

bache-hâ ghazâ-esh-ân râ child-PL food-POSS.3-PL ACC ‘The children ate their food.’

b.

bache-hâ ghazâ xor-d-and child-PL food eat-PAST-3PL ‘The children did food-eating.’

xor-d-and eat-PAST-3PL

According to Dabir-Moghaddam’s analysis, the direct object in (2a) loses its grammatical endings and incorporates with the verb to create an intransitive LVC in (2b). The meaning of ghazâ is referential in (2a) but not in (2b). Dabir-Moghaddam also notes that in some ‘to eat’-related LVCs the extra meaning of the construction is more conventionalized and distanced from the core ‘to eat’ meaning. For example, the LVC shirini xordan (lit. sweets-eating, ‘betrothing’) has got a ritualized meaning in the context of marriage proposal ceremony. Since in Iranian traditions sweets are served after the marriage proposal has been successful, the LVC has developed an extra meaning denoting an accepted proposal. Dabir-Moghaddam also makes similar observations about prepositional phrases that become LVCs through the process of incorporation. For example, he considers the LVC zamin xordan (lit. ground-collide, ‘to hit the ground/to fall’) to be the incorporated version of the prepositional phrase be zamin xordan in which the preposition be ‘to’ is removed and the prepositional object zamin ‘ground’ is incorporated into verb. He doesn’t discuss the semantic differences of these prepositional and incorporated versions in detail, but from his previous arguments it would be implied that the incorporated compound verb zamin xordan constitutes a conceptual whole different from the prepositional phrase be zamin xordan. Assessing the validity of these two contrasting views on whether recognizing compositional N-V sequences as LVCs or verbal phrases can be regarded as the problem of identifying LVCs in general and would require us to recall the definition of constructions before advancing to a solution. Constructions have been defined as form-meaning pairings stored in the lexicon with varying degrees of abstractness/specificity with some aspects of their form or meaning being unpredictable (Goldberg, 1995; Kay, 1995; Kay and Fillmore, 1999). Additionally, sufficiently frequent patterns are also considered to be stored as constructions even if they are fully predictable (Goldberg, 2003a, 2006, 2009). This latter criterion takes semantic predictability into account and, therefore, can be used in determining constructionhood of the disputed compositional (predictable) N-V sequences. Thus, based on this criterion and the fact that idiomatic LVCs too involve frequent NV þ LV patterns, it would be plausible to assume that significantly frequent co-occurrences of N-V sequences would be a reliable indication of LVCs. Frequent N-V sequences can be identified by extracting collocations (i.e. frequently co-occurring linguistic elements) from corpus. In the case of compositional N-V patterns, a corpus-based analysis of these expressions should also allow a detailed investigation and discovery of their possible constructional meaning. Importantly, it would also be possible to search the corpus for distanced co-occurrence patterns beyond usual adjacent N-V sequences. This is essential because there are separable LVCs in Persian that allow some syntactic or lexical elements (such as the accusative marker râ or adjectives) to be inserted between the NV and LV, whereas in non-separable ones such insertions are not possible (e.g. Barjasteh, 1983). Separable LVCs, contrary to inseparable ones, pose problems for automatic extraction of LVCs in large corpora since NV and LV are separated by some other elements so that recognition of them as elements of a single verbal expression becomes challenging for computers. After theses challenges of identifying LVCs are addressed, the next step involves discovering and classifying LV senses which is important for explaining productivity in Persian LVCs. Productivity of LVCs in Persian has recently been studied within the cognitive-constructional framework (Family, 2006, 2008, 2011; Goldberg, 2003b). Family (2006) attempts to explain the way Persian speakers disambiguate LVs in different LVCs. For example, she notes that the LV keshidan ‘to pull’ can be used with ‘hooka’, ‘hash’, ‘cigarette’, ‘pipe’ and everything that can be smoked. She suggests that the meaning of keshidan refers to the metaphorical act of pulling some substance (i.e. smoke) out of something (pipe, cigarette, etc.). On the other hand, in LVCs such as jâdde keshidan ‘make a road’, divâr keshidan ‘build a wall’, and narde keshidan ‘put up a fence’, keshidan involves building something that is extended in space with the NV being the object that is drawn out.

24

R. Golshaie / Language Sciences 57 (2016) 21–33

According to Family (2014), each LV has different constructions called ‘clusters of productivity’ and each construction is composed of some semantically-related NVs. These constructions seem to profile a specific aspect of the LVs’ heavy meaning. That is to say LVs in Persian LVCs retain some central characteristics of their heavy-verb meaning, thus forming a polysemy network by being connected to the central meaning of the LV. The network formed by the different meanings of LVs makes the semantics of new constructions predictable. In her approach to Persian LVCs, LVs are not devoid of meaning (see also Brugman, 2001), but contribute to the shared semantics of the clusters of productivity. On this account, semantic unpredictability of the LVC is the result of the holistic nature of its meaning which is not the compositional sum of the meaning of an LV plus NV. On the other hand, since clusters of productivity have their own constructional meaning, new LVCs can be formed based on the existing established meanings of LVCs. In sum, there are two related characteristics of LVCs in Persian that contribute to productivity: (1) NVs of a given LV form semantically related clusters, and (2) LVCs retain some central characteristics of their heavy-verb meaning. I would propose that by analyzing the behavioral profile of usage patterns (in which the detailed contextual meanings and properties associated with LVCs are studied and recorded) and subsequently classifying these usage patterns based on their similarity would ultimately lead us to the multiple sense clusters formed around an LV. This empirical bottom-up approach would reveal to what extent theoretical claims are confirmed by corpus data. In this paper, we will focus on the LV xordan. It is the eighth most frequent LV in Persian according to Karimi-Doostan (1997). Generally, xordan has two main meanings, namely ‘to eat’ and ‘to collide/hit’ of which ‘to eat’ is the most salient sense that immediately comes to mind out of its sentential context (Family, 2008). In this sense, it is a transitive verb and takes a volitional subject. When used in the ‘collide’ sense, xordan functions as an intransitive verb taking a non-volitional subject. The LV xordan was chosen for two reasons. First, it has relatively fewer senses than the rest of LVs in Persian which makes the analyses feasible. Second, for the two unrelated meanings ‘to eat’ and ‘to collide’ associated with this LV no clear semantic classification is provided in the literature as to what LVC is based on which sense. The present study attempts to address the problems of identifying LVCs as well as discovering and classifying senses of the LV xordan in Persian using corpus methodology. The question of interest regarding the extraction of LVCs is that what NV sequences are identified as LVCs based on corpus data. The second question is how to differentiate constructional senses of the LV based on its corpus usage patterns. To address the first question, the extraction of LVCs will be operationalized as the extraction of collocations. Since NV and LV in LVCs regularly co-occur, they are assumed to form collocational associations. To calculate the strength of association between NV and LV, the measure of mutual information (Church and Hanks, 1990; Stubbs, 1995) will be employed which is suitable to identify idioms and compounds. The second question which pertains to differentiation of LV senses will be answered by using a corpus-linguistic method known as behavioral profile analysis (Gries, 2006, 2010). In this method, all the contextual information of a given linguistic element including morphological, syntactic and semantic characteristics are studied and recorded in a behavioral profile table and then the table is submitted to a statistical cluster analysis to see how similar the linguistic elements have behaved based on their corpus usage patterns. The theoretical basis of this contextual-based similarity goes back to the Distributional Hypothesis that originated in linguistics (Harris, 1954) and later tested and developed into computational models in cognitive science (Landauer and Dumais, 1997; McDonald and Ramscar, 2001). According to this hypothesis, linguistic elements with similar usage patterns are similar in meaning. The successful application of behavioral profile analysis in several synonymy and polysemy studies (e.g. Gries, 2006, 2010; Gries and Divjak, 2009; Gries and Otani, 2010; Liu and Espino, 2012) would make it an appropriate method to address the problem at hand assuming that various senses of a given LV in Persian form polysemy networks. 3. Method In order to extract the intended LVCs systematically and retrieve the LV senses, two corpus-based methods of computing collocational associations and behavioral profile analysis were used. In the following subsections, the properties of the corpus used and the procedures for extracting LVCs and retrieving LV senses have been explained. 3.1. The corpus The corpus used for this study was a 50-million-word sample (approximately one-third) of the larger Hamshahri 2 corpus,6 a standard Persian text collection compiled by the Database Research Group of the University of Tehran (AleAhmad et al., 2009) in the period of 1996–2007. Hamshahri7 is an Iranian daily newspaper with cultural, political, economic, social, scientific, etc. content. In order to search the corpus and calculate some basic statistics, AntConc program (Anthony, 2011) was used. AntConc is a corpus tool with Unicode support and capabilities such as performing contextual searches and calculating frequencies and collocations.

6 7

http://ece.ut.ac.ir/dbrg/hamshahri/fadownload.html. http://www.hamshahrionline.ir/.

R. Golshaie / Language Sciences 57 (2016) 21–33

25

3.2. Identification of LVCs To identify LVCs containing the LV xordan, the corpus linguistic concept of collocation was employed. The definition adopted in the present study for a collocation is ‘a co-occurrence pattern that exists between two items that frequently occur in proximity to one another – but not adjacently.’ (McEnery and Hardie, 2012: 123). According to this definition, nonadjacent words can also be picked up as collocations which can solve the problem of identifying separable LVCs. Since LVCs are recurring patterns of NV þ LV, their identification can be facilitated by extracting collocations that have the LV xordan as their fixed element. To calculate the strength of association between collocates the statistical measure of mutual information (MI) was used (Church and Hanks, 1990; Stubbs, 1995). The MI of the two words n and c is directly proportional to their joint cooccurrence and the corpus size (N), and inversely proportional to the independent occurrences of the words n and c. According to Church and Hanks (1990), values higher than 3 for MI measure is a reliable indicator of significant collocations. Extracting LVCs by calculating significant collocations required various conjugated forms8 of the LV xordan to be considered as a search term. Since the corpus was in the plain text form without any annotation including morphological annotations such as verb lemmas,9 it was not possible to retrieve verb conjugations based on verb lemmas. To remedy this issue, the base form xor (which is shared by all the conjugations) was used as a string search term for significant collocates. The MI cut-off point for significant collocations was set to 4, so that those collocations with MIs > 4 were considered significant. Furthermore, in order for separable LVCs (in which NV and LV is separated by interposing constituents) to be included in the analysis, the significant collocations were searched for in a window span of 5 words (Gries, 2013: 339) to the left context of the search keyword. 3.3. Analysis of behavioral profiles After LVCs were extracted by calculating significant collocations, the next stage involved studying and recording the morphological, syntactic, and semantic characteristics of the LVCs by reading the concordances – the sentential context of LVCs. This procedure is known as behavioral profile analysis (Gries, 2006, 2010) during which all the contextual information of a given linguistic element (here LVCs) including morphological, syntactic, and semantic characteristics are studied and tagged in a behavioral profile table. This table is ultimately used for determining which elements have similar usage patterns in the corpus. The data recorded in the behavioral profile table, then, is submitted to a statistical analysis. In this study, the behavioral profile approach is employed rather differently compared to studies such as Gries (2006). For example, in Gries (2006) major senses of ‘to run’ are identified from dictionary and WordNet and then they are attributed to various corpus usages of the verb ‘to run’. Ultimately, these senses are clustered based on their usage patterns quantified in the behavioral profile table. In the present study, however, the LV senses were not identified beforehand as they are not systematically listed in dictionaries. Alternatively, the form of the LV was held constant, as in other polysemy studies, but instead of clustering pre-identified senses of the LV the LVCs (NV þ LV collocations) resulted from the LV were clustered. This provided the opportunity to simulate Family’s (2006) strategy of identifying senses in a bottom-up manner, i.e. to start from LVCs and arrive at LV senses. Gries (2010) distinguishes four stages in analyzing behavioral profiles of linguistic elements: 1. Retrieval of all or a random sample of word lemmas in the form of concordance lines. 2. Analyzing and tagging the linguistic characteristics of lemmas. Following Atkins (1987), Gries calls these characteristics IDtags that can include morphological, syntactic, semantic, etc. information. The ID-tags don’t have specific and predefined features and can be defined based on a researcher’s needs. 3. Converting the data of stage 2 to a co-occurrence table. In the co-occurrence table, the relative frequency of each lemma co-occurring with each ID-tag is recorded. 4. Analysis of co-occurrence data using statistical methods such as cluster analysis. In the present study 3 types of ID-tags including morphological, syntactic, and semantic tags were used. Table 1 shows types of the tags and their levels. In Table 1, morphological ID-tags are divided into six categories of tense, mood, aspect, number, person, and negation with each ID-tag category having various levels. The syntactic ID-tags record the syntactic information of the LVC including transitivity, subject plurality, object type, adpositions and conjunctions, NV attachment (NV (in)separability from LV), and NV modifiers. Finally, semantic information is captured by semantic ID-tags of NV meaning (partially based on Family’s [2006] classification), connotation (whether the LVC has negative/positive connotation), subject semantics, subject semantic role, type of contact in LVC (whether NV metaphorically/physically makes contact with LV), direction of image schema in LVC (whether the direction of motion is from NV to LV or vice versa), subject control (whether the agent has control over the

8 9

Various inflected forms of a verb. Canonical or dictionary from of a word or set of words.

26

R. Golshaie / Language Sciences 57 (2016) 21–33

Table 1 ID-tags used in behavioral profile analysis of Persian LVCs. Type of ID-tag

ID-tags

Levels of ID-tags

Morphological

tense mood aspect number person negation type of verb number of subject object type adpositions and conjunctions

present, past, future, NA* infinitive, declarative, subjunctive, interrogative, imperative simple, progressive, perfect, NA singular, plural, NA 1st, 2nd, 3rd, NA negative, affirmative transitive, intransitive singular, plural, NA direct object, prepositional, clausal, intra-LVC, NA be ‘to’, be gune ‘in the manner of’, be sude ‘in favor of’, barâye ‘for’, tavassote ‘by’, bâ ‘with’, az ‘from’, bar ‘on/about’, tâ ‘in order to’, dar ‘in’, ke ‘that’ (conj.), ruye ‘on’, (dar) moqâbele ‘against’, râ (accusative marker), NA attached, detached clitic, intra-LVC adjective, NA physical, imaginable, abstract, NA emotion/regret, motion, rotation, damage/wound, label/mark, physical deformation, link, edible, body part, being exposed to, sharp-tip weapon, target of collision, projectable object, deception, NA negative, positive, NA animate, human, concrete, abstract, NA agent, patient, experiencer, experienced, NA physical, metaphorical, NA from NV to subject, from subject to NV, NA yes, no, NA yes, no, NA physical, metaphorical, NA political, sports, medical, cultural/intellectual, crime/accident, life, social, NA

Syntactic

Semantic

NV attachment NV modifiers LVC meaning NV meaning**

connotation subject semantics subject semantic role type of contact in LVC direction of image schema in LVC subject has control rapid event subject movement discourse genre

*NA ¼ not applicable; ** some of the semantic features of NVs have been based on Family (2006).

action), rapid event (whether the event is rapid), subject movement (whether subject moves metaphorically or physically), and discourse genre (the genre in which LVC appears). During the analysis of LVC concordances, a single level of ID-tags would take the value 1 (meaning that feature has been present in the LVC at that particular context of use) and the remaining levels for that ID-tag are coded as 0. After values of IDtag levels for each LVC were summed up in the table, their relative frequencies were calculated by means of dividing the summed-up value of an ID-tag level for each LVC by the total number of concordances studied for each LVC. The result was a number between 0 and 1. For each of all 35 LVCs, 20 concordance lines were randomly studied from the corpus and ID-tagged according to the features defined in Table 1. This means a total number of 700 concordance lines (out of 8300 LVCs altogether) were studied and tagged. The resulted behavioral profile table (containing the relative frequencies of ID-tag levels) was submitted to hierarchical cluster analysis to get a classification for LV senses. Hierarchical cluster analysis is a family of techniques for clustering data and presenting them in a tree diagram (Baayen, 2008). To do the cluster analysis, a freely available R-based application (Jensen, 2013) was used. In cluster analysis, two parameters of distance metric and amalgamation method have to be specified. Following Gries and Divjak (2009), the Euclidean metric and Ward method were used as the distance metric and amalgamation method respectively. 4. Results and analysis The resulted significant collocates of the LV xordan that made up meaningful LVCs with the LV were sorted based on their frequency. Table 2 shows 35 significant collocates of the LV xordan that were later submitted to behavioral profile analysis. The data presented in Table 2 are notable in two respects. First, both idiomatic and compositional LVCs can be seen in the data: compositional LVCs such as zamin xordan ‘to hit the ground/to fall’, ghazâ xordan ‘to eat food’, and nâhâr xordan ‘to have lunch’ in which the NV can be perceived as the direct object or prepositional object of the LV. On the other hand, LVCs such as ghute xordan ‘to float’, tarak xordan ‘to crack’, and mahak xordan ‘to be assayed’ in which NV cannot be straightforwardly perceived as the object of LV, has an idiomatic meaning. Second, two unrelated senses of ‘to eat’ and ‘to collide’ can be observed in the LVCs extracted from corpus. In LVCs such as ghazâ xordan ‘to eat food’, and chây xordan ‘to drink tea’, the LV xordan has the meaning of ‘to eat’ whereas in LVCs sili xordan ‘to be slapped’, zamin xordan ‘to hit the ground/to fall’, etc. the LV has ‘to collide’ meaning. In other LVCs such as tarak xordan ‘to crack’, tâb xordan ‘to swing’, and gereh xordan ‘to be tied to’, it is less clear which sense of ‘to eat’ or ‘to collide’ is involved. A semantic classification of the LV senses would reveal how these LVCs are connected to the two unrelated senses of xordan. But before we advance to semantic classification of senses, we need to get a closer look at some of the compositional LVCs in the corpus in search for any kind of unpredictable meaning. The corpus instances of the compositional LVCs were further examined. In the case of LVCs with a literal ‘to eat’ sense like ghazâ xordan ‘to eat food’, nâhâr xordan ‘to have lunch’, and sobhâne xordan ‘to have breakfast’, it was noticed that the action

R. Golshaie / Language Sciences 57 (2016) 21–33

27

Table 2 Significant collocates (NVs) of the LV xordan. Collocations (LVCs) containing xordan

Literal translation

Meaning

Frequencya

be cheshm xordan shekast xordan ghazâ xordan raqam xordan gereh xordan be dard xordan tekân xordan peyvand xordan zarbe xordan gol xordan (be) zamin xordan kotak xordan barham xordan qasam xordan farib xordan latme xordan xâk xordan sobhâne xordan xat xordan afsus xordan xune del xordan ghosse xordan taassof xordan chây xordan sogand xordan nâhâr xordan tâb xordan bargasht xordan be hadaf xordan ghute xordan hasrat xordan gul xordan tarak xordan mahak xordan sili xordan

to eye- xordan defeat- xordan food- xordan number- xordan knot- xordan to pain- xordan movement- xordan link- xordan hit- xordan goal- xordan (to) ground- xordan smack- xordan to-one-another- xordan swear- xordan trick- xordan damage- xordan soil- xordan breakfast- xordan cross-out- xordan regret- xordan blood heart- xordan grief- xordan lament- xordan tea- xordan swear- xordan lunch- xordan swing- xordan return- xordan to target- xordan float- xordan yearning- xordan trick- xordan crack- xordan touchstone- xordan slap- xordan

to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to

2248 1361 582 416 398 303 291 260 247 238 191 144 131 127 115 112 100* 90* 83 79 77 68 68* 61 58 50 50 49 48 47 47 44 43 41 33

a

catch eye (by chance) be defeated eat food happen be tied to come in handy jerk be tied to be hit receive a goal hit the ground/to fall be smacked be unsettled swear be deceived be damaged gather dust have breakfast be struck off regret suffer (emotionally) grieve lament drink tea swear have lunch swing be returned (checks) hit the target float yearn be deceived crack be assayed be slapped

Frequency counts marked with asterisk (*) have been modified after studying concordances to exclude irrelevant non-LVC co-occurrences.

characterized by these LVCs was of habitual and non-referential nature. The semantic difference between these LVCs and other types of eating is that the NV and LV in the LVCs are conceptually integrated comparable to actions denoted by intransitive verbs such as walking, sleeping, and running. On the other hand, the case of compositional LVCs with ‘to collide’ sense would be a little different. For example, corpus instances of the LVC (be) zamin xordan were examined (see example 3). From total 119 corpus instances, 46 instances were in prepositional-phrase form (be zamin xordan) and 73 instances were without preposition. It was found that two senses of ‘to fall by losing balance/control’ and ‘to collide with the ground’ can be distinguished in the usage data. The examples (3a–b) below show the prototypical usages of these senses in the corpus: (3)

a.

u râ rahâ kar-d-am sar-ash be zamin him ACC free do-PAST-1SG head-POSS.3SG to ground xor-d collide-PAST.3SG ‘I let him fall and his head hit the ground.’

b.

amâ aghlab zamin mi-xor-d but often ground PROG-collide-PAST.3SG va az digarân aghab mi-oft-âd and from others behind PROG-fall-PAST.3SG ‘But [he] often fell down and lagged behind.’

The corpus analysis revealed that ‘to fall by losing balance/control’ was almost totally associated with the nonprepositional string zamin xordan (3b). Table 3 summarizes the results. According to Table 3, the LVC zamin xordan is used with the meaning ‘to fall by losing balance/control’ in 97% of cases (75% literal and 22% metaphorical) while be zamin xordan is used almost equally in both of the senses ‘to fall by losing balance/ control’ and ‘to collide with the ground’. Also in 22% of usages, zamin xordan has the metaphorical meaning of ‘to fail’ which seems to be an extension of the meaning ‘to fall by losing balance/control’. Thus, the meaning of zamin xordan is not totally

28

R. Golshaie / Language Sciences 57 (2016) 21–33

Table 3 Distribution of senses for zamin xordan and be zamin xordan in the corpus. Pattern

zamin xordan be zamin xordan

Literal translation

ground-collide to-ground-collide

Frequency

73 46

‘To fall by losing balance/control’ sense Total

Literal

Metaphorical

71 (97%) 26 (56%)

55 (75%) 24 (52%)

16 (22%) 2 (4%)

‘To collide with the ground’ sense 2 (3%) 20 (44%)

compositional by saying that something collides to the ground. Here the additional meaning ‘falling by losing balance/control’ is not predictable from the meaning of component parts. In sum, two types of compositional LVCs with ‘to eat’ and ‘to collide’ senses in our data seem to have varying degrees of unpredictability. The compositional LVCs with literal ‘to eat’ sense such as nâhâr xordan, chây xordan, and ghazâ xordan have acquired a non-referential and conceptually independent meaning as a low degree of unpredictability. On the other hand, the compositional LVC zamin xordan belonging to the ‘to collide’ sense has acquired the constructional meaning ‘to fall by losing balance’ which is distinguishable from its primary compositional meaning. Thus, the semantic unpredictability in zamin xordan appears to be more pronounced than the LVCs with literal sense ‘to eat’. Once the LVCs were extracted, the next move was to identify major senses of the LV based on its corpus usage patterns. After the LVCs were extracted and the behavioral profile table for 35 LVCs was prepared, the table was submitted to the statistical hierarchical cluster analysis with the aim of calculating similarity distance between cases (here LVCs) based on multiple variables (here ID-tags). Fig. 1 shows the resulted dendrogram of the cluster analysis. In the dendrogram (Fig. 1), LVCs are spread along the x-axis. The y-axis, labeled by ‘Height’, measures the degree of closeness of clusters: the lower the conjoined clusters are, the more similar they are. In Fig. 1, all senses of the LV xordan are first divided into two broad clusters labeled as 1 and 2. The LVCs in cluster 1 revolve around the ‘to eat’ meaning of LV xordan physically or metaphorically. Thus, the first dichotomy in the dendrogram seems to distinguish EAT10 sense of xordan from other senses of the LV. The clusters and sub-clusters with their identified senses are provided in Table 4. Cluster 1 accommodates three seemingly unrelated senses such as ‘to eat’, ‘to swear’, and ‘to regret/grief’. The pattern of amalgamation shows that first the sub-clusters ‘to eat’ and ‘swear’ are amalgamated and then the resulted cluster joins the ‘regret/grief’ sub-cluster. The corpus instances suggest that NV categories of cluster 1 share the following features: 1. The subject in the LVCs of cluster 1 is animate and mainly human. 2. The actions characterized by the LVCs of cluster 1 are mainly voluntary, so the subject can exert control over them. The LVCs of the sub-cluster ‘regret/grief’, though not fully voluntary, would be considered semi-voluntary since the subject who passively experiences the feeling is capable of exerting a conscious control over and change their state of mind depending on the person’s meta-cognitive skills. All the LVCs of cluster 1 share some kind of ‘to eat’ meaning literally or metaphorically. The LVCs nâhâr xordan ‘to have lunch’, chây xordan ‘to drink tea’, sobhâne xordan ‘to have breakfast’, and ghazâ xordan ‘to eat food’ all share the literal meaning ‘to eat’. The next sub-cluster immediately joining the literal ‘to eat’ category has two members with ‘to swear’ meaning: sogand xordan and ghasam xordan. The grouping of these two ‘to swear’ and ‘to eat’ sub-clusters would be motivated by the fact that, according to Moein Persian dictionary, in the LVCs with ‘to swear’ meaning the LV xordan had been used in its literal ‘to eat’ meaning in the past.11 Finally, the last sub-cluster amalgamating with the ‘eating-swearing’ cluster encompasses LVCs with ‘to regret’ or ‘to yearn’ meaning in English that includes ghosse xordan ‘to grieve’, hasrat xordan ‘to yearn’, taassof xordan ‘to lament’, xune del xordan ‘to suffer (emotionally)’, and afsus xordan ‘to regret’. In this sub-cluster, the metaphorical ‘to eat’ meaning is explicitly spelled out in the LVC xune del xordan (literally means ‘to eat heart blood’). The metaphor conceptualizes emotional suffering in terms of physical suffering and bleeding. ‘To eat’ meaning of xordan is lexically clueless in other member LVCs of ‘regret/grief’ sub-cluster. Cluster 2 (see Fig. 1) has seemingly more varied senses in comparison to cluster 1 but the main sense of this cluster which is shared by most of its members is the COLLIDE sense. Cluster 2 has been grouped into two main sub-clusters namely 2.1 and 2.2 which are distinguished by their subject being moving or stationary. Within cluster 2, seven senses can be identified. There is also a cluster (named [uncategorized]) with varied senses for which no interpretable similarity was found, though the general characteristics of cluster 2, i.e. ‘to collide’ meaning and ‘subject being stationary’, are applicable to this uncategorized cluster. In describing major characteristics of cluster 2 which distinguish it from cluster 1, the following corpus-based observations can be made:

10

Senses written in capital letters denote the two major constructional senses emerged in the cluster dendrogram. In ancient Iran, sogand/qasam xordan had been used literally. In fact, sogand was a variant of the Avestan saokenta (meaning ‘sulfur’) which a person suspected of lying had to drink in order to prove he/she was not lying. If the person survived the chemical substance, it was a proof of his/her telling the truth. 11

R. Golshaie / Language Sciences 57 (2016) 21–33

29

Cluster Dendrogram

2 14

1

2.1

nâhâr_xordan

sobhâne_xordan ghazâ_xordan chây_xordan

sogand_xordan qasam_xordan

taassof_xordan hasrat_xordan ghosse_xordan

afsus_xordan xune_del_xordan

gereh_xordan be_cheshm_xordan be_dard_xordan

be_hadaf_xordan peyvand_xordan

ghute_xordan tarak_xordan tâb_xordan

sili_xordan zarbe_xordan

raqam_xordan xâk_xordan latme_xordan

xat_xordan bargasht_xordan

mahak_xordan kotak_xordan gol_xordan

gul_xordan shekast_xordan

farib_xordan

tekân_xordan barham_xordan zamin_xordan

4 2 0

Height

6

8

10

2.2

Fig. 1. Dendrogram for the cluster analysis of 35 LVCs containing the LV xordan.

Table 4 Main senses of the LV xordan based on cluster analysis. Cluster

Senses

LVC

(1) EAT

Subj. Human

to eat to swear to regret/grief

(2) COLLIDE

(2.1) Subj. in Motion

to come to senses (metaphorical) be linked to to collide (from subject to target) be caused to move/deform (suddenly) be hit/damaged be marked or labeled (metaphorical) be deceived [uncategorized]

nâhâr xordan, chây xordan, sobhâne xordan, ghazâ xordan sogand xordan, qasam xordan ghosse xordan, hasrat xordan, taassof xordan, xune del xordan, afsus xordan be dard xordan, be cheshm xordan

(2.2) Subj. Stationary

gereh xordan, peyvand xordan be hadaf xordan, (be) zamin xordan, barham xordan tekân xordan, tâb xordan, tarak xordan, ghute xordan zarbe xordan, latme xordan, sili xordan, xâk xordan raqam xordan, bargasht xordan, xat xordan gul xordan, farib xordan gol xordan, kotak xordan, mahak xordan, shekast xordan

1. The subject in the LVCs of cluster 2 is not necessarily animate or human. 2. The actions realized by the LVCs in the second cluster are mostly non-voluntary and out of the control of the subject. Sub-clusters 2.1 and 2.2 underline different aspects of the COLLIDE sense of the LV xordan. In sub-cluster 2.1 the property of ‘movement’, be physical or metaphorical, is dominant.12 In fact, the subject in the LVCs of this sub-cluster is involved in a kind of real or fictive motion. In sub-cluster 2.2., on the other hand, the subject is stationary and it is affected by an external force physically or metaphorically. 5. Discussion In the LVCs extracted from the corpus, we observed that both types of compositional and idiomatic LVCs were present. While the semantic unpredictability of idiomatic LVCs is easy to notice, detecting unpredictable aspects of meaning (i.e. constructional meaning) is not so straightforward in the case of compositional LVCs. At the same time, the corpus analysis of some of the compositional LVCs also revealed that the semantic unpredictability is of varying degrees in different LVCs. The degree of unpredictability in the LVCs having literal ‘to eat’ meaning under EAT sense is at the lowest level (that even might be considered non-existent), while the LVC like zamin xordan under COLLIDE sense has a higher degree of unpredictability. How this gradable predictability that fluctuates between fully predictable and fully unpredictable extremes can be explained in Persian LVCs given that previous studies including Family (2006) have focused on the unpredictability criterion of constructionhood?

12 It would be noticed that some LVCs in Table 4 (e.g. tarak xordan ‘to crack’) may not belong to the cluster labeled by a property such as ‘Subj. in Motion’. This would be considered normal given the possibility of having noise in the few data points analyzed in this study. Thus, we would assume that the cluster labels show the semantic tendency of LVCs in general and may not to be interpreted categorically.

30

R. Golshaie / Language Sciences 57 (2016) 21–33

The findings would suggest the need to reconsider the definition and identification of LVCs in Persian. For example, in Family’s (2006) work on Persian LVCs unpredictability is central to the definition of constructions following Goldberg (1995). However, as pointed out in the Background section, the frequency of co-occurrence is also taken into consideration in the recent conception of constructions. Let’s once again look at Goldberg’s (2006: 5) updated definition of constructions: Any linguistic pattern is recognized as a construction as long as some aspect of its form or function is not strictly predictable from its component parts or from other constructions recognized to exist. In addition, patterns are stored as constructions even if they are fully predictable as long as they occur with sufficient frequency. This updated view on constructions can account for different aspects of idiomatic and compositional N-V sequences identified as LVCs in Table 2. The idiomatic LVCs are captured by the first part of the definition that highlights the unpredictable patterns and compositional LVCs are captured by the second part of the definition that underlines the frequent predictable patterns. However, it needs to be emphasized that no frequent compositional LVC was found in the corpus that was ‘fully predictable’ in meaning as asserted in Goldberg’s definition. The point here would be that the core meaning of frequent compositional patterns might be fully predictable, but careful corpus examinations, as suggested by our results, could unveil some aspects of unpredictability. It is also worth mentioning that the issue of full predictability has been treated cautiously within the usage-based framework (Bybee, 1985, 2006, 2010). For example, Bybee (2006: 4) uses the term ‘largely predictable’ in discussing frequently co-occurring sequences and points out that ‘these sequences of words must have memory storage despite being largely predictable in form and meaning’. Particularly, we would find semantic unpredictability of frequent patterns better foregrounded in corpus-linguistic terms: ‘Collocations [as frequent patterns] are not fully compositional [or predictable] in that there is usually an element of meaning added to the combination’ (Manning and Schütze, 1999: 151). In short, whether constructions can be fully predictable or not needs further empirical research, but as the results of this study show Persian LVCs constitute a continuum ranging from semantically unpredictable constructions to compositionally predictable ones. Our findings support Dabir-Moghaddam’s (1997) view on the constructional status of compositional LVCs including those with COLLIDE sense such as zamin xordan and also LVCs with literal ‘to eat’ meaning under EAT sense such as nâhâr xordan ‘to have lunch’, sobhâne xordan ‘to have breakfast’, and ghazâ xordan ‘to eat food’ in which NVs are considered to be nonreferential.13 Conversely, Tabatabaie’s (2005) standpoint on using compositionality criterion for distinguishing LVCs from verb phrases is not supported since it was shown that frequent compositional expressions have constructional meanings too. An important implication here would be that constructional meanings of compositional LVCs may not be easily accessible to intuition and uncovering them would require extensive corpus analyses. The results obtained from corpus-based classification of LV senses suggest that the EAT sense of xordan needs to be recognized as an LV and distinguished from COLLIDE sense. The fact that xordan has two unrelated meanings is captured by the major split between two main clusters in the dendrogram. Also this analysis can better account for the meaning of the LVCs derived from EAT sense and those derived from COLLIDE sense. In fact, by looking at the clusters one can see what LVCs are attached to which sense. The cluster analysis also confirms the overall constructional basis of sense interrelations in Persian LVCs provided by Family (2006, 2008, 2014). In her account, various LVCs of a single LV are clustered into different polysemous categories (or ‘clusters of productivity’) and the members of each category share semantic similarities. Comparably, this semantic similarity can be seen in the closely clustered LVCs resulted from behavioral profile analysis, which means their usage patterns have been similar in the corpus. The point of departure of the present study from Family’s (2006) is the constructional senses identified and their interrelations. In Family (2006, 2008), high frequency LVCs such as sobhâne xordan ‘to have breakfast’ and nâhâr xordan ‘to have lunch’ (related to the EAT sense of the LV xordan) are considered marginal cases of LVCs (see Family, 2008) due to their semantic transparency and thus have no place in the network representation of the senses (see Fig. 2). On the contrary, based on corpus evidence, it was shown that these EAT-based LVCs with largely compositional meanings have developed some subtle unpredictable meanings which together with their high frequency satisfy the constructionhood condition and need to be considered verbal constructions. The use of corpus data in extracting LVCs also rules out the risk of including suspicious or non-existent data in the analysis. For example, based on our corpus and Google searches, it was found that the LVC azâb

13 It would be relevant to make reference to a question raised by one of the reviewers that there are many other non-LV verbs such as xândan ‘to read’ that can also take non-referential objects (ketâb xândan, lit. book-reading, ‘to read book’) similar to ghazâ xordan ‘to eat food’ or nâhâr xordan ‘to eat lunch’, hence this non-referentiality would not be specific to LVCs. I would argue that viewing from corpus linguistics perspective, nâhâr xordan is considered a collocation and collocational meanings are ‘gestaltic’: they mean more than the semantic sum of their parts. Based on this principle, xândan may not always function as a heavy verb. If searched in a large corpus, ketâb xândan like nâhâr xordan would come up as a collocation. It would denote a repeated habitual activity like nâhâr xordan with a holistic meaning. This holistic meaning would be more salient in expressions like dars xândan (lit. course-reading, ‘to study’). In dars xândan, the meaning of xândan is distanced from the heavy ‘reading’ meaning and might involve writing, reciting, and preparing, in any form, for an exam or a course. Similarly, in expressions such as namâz xândan ‘to say one’s prayer’, fâtehe xândan ‘to recite Fatiha’, and doâ xândan ‘to pray’, xândan has the constructional meaning of ‘to recite a religious verse’ which is different from but polysemously related to ‘reading’. On the other hand, according to a more recent frequency-based view on constructions (Goldberg, 2006, 2009) which was discussed earlier, ketâb xândan and any verbal expression co-occurring frequently would satisfy constructionhood condition. In conclusion, there seems to be no sharp boundary between LVs and heavy verbs – some verbs may be in the beginning of the way to acquire LV status and stand somewhere between the two extreme ends.

R. Golshaie / Language Sciences 57 (2016) 21–33

31

xordan

SUFFERING | azâb [torment] ghosse [grief] gij [dizziness] …

USURPING | reshve [bribe] nozul [interest] savâri [riding]

AFFECTED | farib [trick] sadame [damage] sili [slap] …

MOTION | tekân [motion] ghalt [sommersault] pich [twist] …

Fig. 2. Major senses of the LV xordan in Family (2006, 2008).

xordan (torment-xordan) included in Family’s analyses (Fig. 2) is apparently non-existent, which would undermine the validity of the NVs used in the study. The behavioral profile analysis method with its usage-based approach shows to be promising in classifying and interrelating various LV senses. The process of assigning LVCs to different sense clusters is based on the contextual behavior of LVCs using morphological, syntactic, and semantic criteria which are quantified as ID-tags. Those LV senses are identified as similar and clustered together whose usage behaviors have been similar in the corpus. In the case of LV xordan, it can be observed that behavioral profile analysis can reveal some semantic associations between senses in a clearer manner. For example, in analyzing the first cluster (EAT sense) of the LV xordan, it was shown (based on historical meanings and some lexical clues of LVCs) that ‘swear’ and ‘regret/grief’ senses are metaphorical extensions of the core EAT sense of the LV xordan.14 This semantic relationship has been captured in the cluster tree by the proximity of theses senses and placement of them within a larger cluster. This association is absent in Family’s (2006, 2008) analysis of sense classes (Fig. 2). The present study has taken the first steps in the direction of applying quantitative corpus-based methods in studying controversial areas of Persian linguistics such as LVCs. The data analyzed and the results obtained in this study, however, are by no means comprehensive and conclusive. There have been some limitations in the methodology which can be improved in the future investigations of the topic. First, the corpus used as the source of LVCs was a journalistic text collection. This means that our data didn’t include the everyday speech or colloquial language. This can severely bias the LVCs used in the corpus towards those which are frequent in the journalistic genres. Of the consequences of this limitation is that the results obtained from the extraction and analysis of the LV senses may not be generalizable to all registers of Persian language. In the future investigations of the subject, a broad coverage corpus comprised of various language styles would produce more realistic results. Second, we had small number of data points studied for 35 LVCs (700 concordances). In fact, for each LVC a total number of 20 concordances (or corpus instances) were studied which is a small amount from a statistical perspective. Consequently, limiting the number of instances analyzed for each LVC would risk the results by increasing the noise in the data. The more instances are analyzed and ID-tagged, the more reliable the statistical analysis would be. In the present study, the emphasis was on covering as much LVCs as possible since the classification of various LV senses was of primary interest. Given that the hand-tagging of 700 corpus instances was already a laborious and timeconsuming task, increasing the data points (e.g. to 1400), though statistically desirable, would make them practically difficult to analyze. In the future studies, this limitation can be overcome by taking the number of corpus instances into consideration. 6. Conclusion In this article, corpus-based methods were used to identify Persian LVCs containing the LV xordan and classify their constructional senses. The collocational measure used for the extraction of LVCs proved to be successful in spotting these multiword expressions. Further, it was argued that semantically compositional N-V collocations should be considered LVCs based on their frequency of use and varying degrees of unpredictability associated with their meaning. This implied that uncovering constructional meanings of compositional LVCs would require corpus analyses because such subtle meanings are not usually accessible to intuition. Further, it was shown that the LV xordan has two major unrelated senses EAT and COLLIDE and the meanings of the derived LVCs are motivated by one of these two main senses. The sense clusters were calculated by statistically analyzing the output of a procedure called behavioral profile analysis by means of which LVCs’ similarities were

14 Also see Link (2013) and Nguyen (2013) for similar metaphorical extensions related to ‘regret/grief’ sense in Chinese and Vietnamese respectively. The conceptual metaphor EMOTIONS ARE LIQUID seems to be responsible for the formation of these extensions.

32

R. Golshaie / Language Sciences 57 (2016) 21–33

captured by morphological, syntactic, and semantic features extracted from their corpus usage patterns. The application of behavioral profile analysis to retrieve the LV senses from corpus showed to be a promising first step towards addressing problematic aspects of Persian light verb constructions using corpus linguistic methods. By comparing findings of the present study with those of Family (2006), it was shown that the corpus-based methods of extracting collocations and analyzing behavioral profiles can be advantageous regarding the reliability of sense extraction criteria and also the capability of pointing at possible semantic motivations underlying the semantic interrelations which can exist among various senses. Finally, the need for reconsidering constructionhood conditions in Persian LVCs was discussed and it was suggested that the noncompositionality criterion used for identifying LVCs should be supplemented by the more recent frequency-based approach to constructions based on which frequent compositional N-V sequences would be regarded as LVCs. Acknowledgment I would like to thank two anonymous reviewers for their insightful and constructive comments on an earlier version of this paper. References Ahmed, T., 2010. The interaction of light verbs and verb classes of Urdu. In: Interdisciplinary Workshop on Verbs: the Identification and Representation of Verb Features. Ahmed, T., Butt, M., 2011. Discovering semantic classes for Urdu N-V complex predicates. In: Proceedings of the International Conference on Computational Semantics (IWCS 2011), pp. 305–309. AleAhmad, A., Amiri, H., Darrudi, E., Rahgozar, M., Oroumchian, F., 2009. Hamshahri: a standard Persian text collection. J. Knowl. Based Syst. 22 (5), 382–387. Anthony, L., 2011. AntConc (Version 3.2.2.1) [Computer Software]. Waseda University, Tokyo, Japan. Retrieved from. http://www.antlab.sci.waseda.ac.jp/ antconc_index.html. Atkins, B.T.S., 1987. Semantic ID tags: corpus evidence for dictionary senses. In: Proceedings of the Third Annual Conference of the UW Center for the New Oxford English Dictionary, pp. 17–36. Barjasteh, D., 1983. Morphology, Syntax, and Semantics of Persian Compound Verbs: a Lexicalist Approach. Ph.D. dissertation. University of Illinois at Urbana-Champaign. Baayen, R.H., 2008. Analyzing Linguistic Data: a Practical Introduction to Statistics Using R. Cambridge University Press, Cambridge. Brugman, C., 2001. Light verbs and polysemy. Lang. Sci. 23, 551–578. Bybee, J., 1985. Morphology: a Study of the Relation between Meaning and Form. John Benjamins Publishing Company, Amsterdam. Bybee, J., 2006. From usage to grammar: the mind’s response to repetition. Language 82 (4), 711–733. Bybee, J., 2010. Language, Usage and Cognition. Cambridge University Press, Cmabridge. Church, K., Hanks, P., 1990. Word association norms, mutual information and lexicography. Comput. Linguist. 16 (1), 22–29. Dabir-Moghaddam, M., 1997. Compound verbs in Persian. Stud. Linguist. Sci. 27, 25–59. Family, N., 2006. Explorations of Semantic Space: the Case of Light Verb Constructions in Persian. PhD dissertation. EHESS, Paris. Family, N., 2008. Mapping semantic spaces: a constructionist account of the “light verb” xordan ‘eat’ in Persian. In: Vanhove, M. (Ed.), From Polysemy to Semantic Change: Towards a Typology of Lexical Semantic Associations. John Benjamins, pp. 139–161. Family, N., 2011. Verbal islands in Persian. Folia Linguist. 45 (1), 1–30. Family, N., 2014. Semantic Spaces of Persian Light Verbs: a Constructionist Account. Brill, Leiden. Folli, R., Harley, H., Karimi, S., 2005. Determinants of event type in Persian complex predicates. Lingua 115, 1365–1401. Gerdes, K., Samvelian, P., 2008. A statistical approach to Persian light verb constructions. In: Proceedings of the 27th Conference on Lexis and Grammar. L’Aquila, Italy. Goldberg, A., 1995. Constructions: a Construction Grammar Approach to Argument Structure. University of Chicago Press, Chicago. Goldberg, A., 2003a. Constructions: a new theoretical approach to language. Trends Cogn. Sci. 7 (5), 219–224. Goldberg, A., 2003b. Words by default: the Persian complex predicate construction. In: Francis, E., Michaelis, L. (Eds.), Mismatch: Form-function Incongruity and the Architecture of Grammar. CSLI Publications, pp. 83–112. Goldberg, A., 2006. Constructions at Work: the Nature of Generalizations in Language. Oxford University Press, Oxford. Goldberg, A., 2009. The nature of generalization in language. Cogn. Linguist. 20 (1), 93–127. Gries, S.T., 2006. Corpus-based methods and cognitive semantics: the many meanings of to run. In: Gries, S.T., Stefanowitsch, A. (Eds.), Corpora in Cognitive Linguistics: Corpus-based Approaches to Syntax and Lexis. Mouton de Gruyter, Berlin/New York, pp. 57–99. Gries, S.T., 2010. Behavioral profiles: a fine-grained and quantitative approach in corpus linguistics. Ment. Lex. 5, 323–346. Gries, S.T., 2013. Statistics for Linguistics with R: a Practical Introduction, second ed. De Gruyter Mouton, Berlin. Gries, S.T., Divjak, D.S., 2009. Behavioral profiles: a corpus-based approach towards cognitive semantic analysis. In: Evans, V., Pourcel, S.S. (Eds.), New Directions in Cognitive Linguistics. John Benjamins, Amsterdam/Philadelphia, pp. 57–75. Gries, S.T., Otani, N., 2010. Behavioral profiles: a corpus-based perspective on synonymy and antonymy. ICAME J. 34, 121–150. Haji-Abdolhosseini, M., 2000. Event types in the generative lexicon: implications for Persian compound verbs. In: Toronto Working Papers in Linguistics (Proceedings of Niagara Linguistic Society), vol. 19, pp. 25–38. Harris, Z., 1954. Distributional structure. Word 10 (23), 146–162. Huang, C.R., Lin, J., Jiang, M., Xu, H., 2014. Corpus-based study and identification of Mandarin Chinese light verb variations. In: Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties an Dialects, pp. 1–10. Jensen, K.E., 2013. Clusterizor: an R-based Cluster Analysis Program for Linguistic Analysis (Computer programme). Jesperson, O., 1965. A Modern English Grammar on Historical Principles Part IV: Morphology. George Allen and Unwin Ltd., London. Karimi, S., 1997. Persian complex verbs: idiomatic or compositional. Lexicology 3 (2), 273–318. Karimi-Doostan, Gh, 1997. Light Verb Constructions in Persian. PhD dissertation. University of Essex. Karimi-Doostan, Gh, 2005. Light verbs and structural case. Lingua 115, 1737–1756. Karimi-Doostan, Gh, 2011. Separability of light verb constructions in Persian. Stud. Linguist. 65, 70–95. Kay, P., 1995. Construction grammar. In: Verschueren, J., Östman, J.O., Blommaert, J. (Eds.), Handbook of Pragmatics. John Benjamins, Amsterdam/Philadelphia, pp. 171–177. Kay, P., Fillmore, Ch, 1999. Grammatical constructions and linguistic generalizations: the what’s X doing Y? construction. Language 75 (1), 1–33. Landauer, T.K., Dumais, S.T., 1997. A solution to Plato’s problem: the latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychol. Rev. 104 (2), 211–240. Lin, J., Xu, H., Jiang, M., Huang, C.R., 2014. Annotation and classification of light verbs and light verb variations in Mandarin Chinese. In: Proceedings of the Workshop on Lexical and Grammatical Resources for Language Processing, pp. 75–82.

R. Golshaie / Language Sciences 57 (2016) 21–33

33

Link, P., 2013. An Anatomy of Chinese: Rhythm, Metaphor, Politics. Harvard University Press. Liu, D., Espino, M., 2012. Actually, genuinely, really, and truly: a corpus-based behavioral profile study of near-synonymous adverbs. Int. J. Corpus Linguist. 17 (2), 198–228. Manning, C., Schütze, H., 1999. Foundations of Statistical Natural Language Processing. MIT Press. McDonald, S., Ramscar, M., 2001. Testing the distributional hypothesis: the influence of context on judgments of semantic similarity. In: Proceedings of the 23rd Annual Conference of the Cognitive Science Society, pp. 611–616. McEnery, T., Hardie, A., 2012. Corpus Linguistics: Method, Theory and Practice. Cambridge University Press, Cambridge. Megerdoomian, K., 2001. Event structure and complex predicates in Persian. Can. J. Linguist. 46 (1/2), 97–125. Mohammad, J., Karimi, S., 1992. Light verbs are taking over: complex verbs in Persian. In: Proceedings of the Western Conference on Linguistics (WECOL), vol. 5, pp. 195–212. Nguyen, N.L., 2013. The EMOTION IS LIQUID metaphor in English and Vietnamese: a contrastive analysis. In: Procedia – Social and Behavioral Sciences (5th International Conference on Corpus Linguistics (CILC2013)), vol. 95, pp. 363–371. Rasooli, M.S., Faili, H., Minaei-Bidgoli, B., 2011. Unsupervised identification of Persian compound verbs. In: Batyrshin, I., Sidorov, G. (Eds.), MICAI 2011, Part I, LNAI 7094, pp. 394–406. Sadeghi, A.A., 1993. Dar bâre-ye fe’l-hâ-ye ja’li dar zabân-e farsi (On denominative verbs in Persian). In: Proceedings of the Zabân-e Farsi va Zabân-e Elm (Persian Language and the Language of Science) Seminar. Iran University Press, Tehran, pp. 236–246 (In Persian). Seifollahi, M., Tabibzadeh, O., 2013. Che zanjire-hâ-yi fe’l-e morakkab nistand? (What strings are not compound verbs?) Farhangnevisi 5 & 6, 93–104 (In Persian). Stevenson, S., Fazly, A., North, R., 2004. Statistical measure of the semi-productivity of light verb constructions. In: Proceedings of the ACL Workshop on Multiword Expressions: Integrating Processing, pp. 1–8. Stubbs, M., 1995. Collocations and semantic profiles: on the cause of the trouble with quantitative studies. Funct. Lang. 2. Tabaian, H., 1979. Persian compound verbs. Lingua 47, 189–208. Tabatabaie, A., 2005. Fe’l-e morakkab dar zabân-e farsi (Compound verb in Persian). Name-ye Farhangestan 26, 26–34 (In Persian). Vahedi-Langrudi, M., 1996. The Syntax, Semantics and Argument Structure of Complex Predicates in Modern Farsi. PhD dissertation. University of Ottawa.