Morphological evaluation and sentiment analysis of Punjabi text using deep learning classification

Morphological evaluation and sentiment analysis of Punjabi text using deep learning classification

Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx Contents lists available at ScienceDirect Journal of King Sau...

2MB Sizes 0 Downloads 88 Views

Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx

Contents lists available at ScienceDirect

Journal of King Saud University – Computer and Information Sciences journal homepage: www.sciencedirect.com

Morphological evaluation and sentiment analysis of Punjabi text using deep learning classification Jaspreet Singh a,⇑, Gurvinder Singh b, Rajinder Singh a, Prithvipal Singh a a b

Department of Computer Science, Guru Nanak Dev University, Amritsar 143005 Punjab, India Department of Computer Science, Faculty of Engineering and Technology, Guru Nanak Dev University, Amritsar 143005 Punjab, India

a r t i c l e

i n f o

Article history: Received 19 January 2018 Revised 12 March 2018 Accepted 4 April 2018 Available online xxxx

a b s t r a c t Morphological processing of Indian languages is one of the most escalating fields in the era of Natural Language Processing (NLP) since the last decade. The evaluation of Asian languages is a highly relevant field in the times of text mining and information retrieval. The morphological evaluation of a text can be employed for extraction and classification of knowledge. This paper amalgamates morphological evaluation and sentiment prediction of Punjabi language text. The textual data for Punjabi language is concerned with farmer suicide cases reported for Punjab state of India. The pre-processing phase of this study involves morphological evaluation and normalization of Punjabi words to their respective canonical forms. The next phase carries out training and testing of deep neural network model on refined Punjabi tokens obtained from the earlier phase. The proposed model classifies Punjabi tokens into four negatively oriented classes tailored for farmer suicide cases. The average accuracies of sentiment prediction obtained after 10-fold cross validation are 93.85%, 88.53%, 83.3%, and 95.45% for the four respective classes. The proposed framework yields satisfactory results on 275 Punjabi text documents with the overall accuracy of 90.29% for sentiment classification. Ó 2018 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

1. Introduction The digital content of Indian languages is rapidly growing due to the evolution in language modelling and accessibility of Internet. Therefore, a lot of research work is required in computational linguistics for solving real world problems as per native linguistic use. This research work deals with evaluation and analysis of Punjabi language content concerned with farmer suicide reported in local newspapers. The trend of committing suicides among Punjabi farmers is rising alarmingly in the Punjab state of India (Jagbani E-Newspaper; Chardikala News; Ajit Weekly; Daily Pehredar; Doaba Headlines; Jan Jagrati; Nawan Zamana; Punjabi Jagran; Punjabi Tribune; Punjab Times). The motivation of this work comes from Punjabi University, Patiala, where they have surveyed seven districts of Punjab and collected farmer suicide ⇑ Corresponding author. E-mail addresses: [email protected] (J. Singh), gurvinder.dcse@gndu. ac.in (G. Singh). Peer review under responsibility of King Saud University.

reports from January-2010 to December-2016. They have collected 1309 cases of farmer and labourer suicides to evaluate the causes behind this frightening issue of distress among the Punjabi farmers (The Hindu, 2017). We have obtained the textual data for farmer suicide cases during the time span from 01, January 2017 to 30, November 2017 from online Punjabi news websites mentioned as sources in Table 12. This paper enlists four morphological features of Punjabi text in the Section 2 of the manuscript. The four morphological typologies are used to examine farmer suicide reports are presented in Section 3. Section 4 of the paper introduces four statistical features of text which are used for quantification of Punjabi sentences and words. The summary of literature concerned with the evaluation of Punjabi text is given in Table 11 of Section 5. Section 6 of the manuscript highlights various challenges and issues for morphological evaluation and sentiment prediction of Punjabi text. The pre-processing phase of proposed methodology, and an approach used for data collection are presented in Section 7. The framework of proposed model is shown in Section 8, followed by results and analysis in Section 9, and Section 10 represents concluding remarks. 2. Morphological features of Punjabi text

Production and hosting by Elsevier

The branch of linguistics concerned with formation of words and phrases along with the interaction of words among themselves

https://doi.org/10.1016/j.jksuci.2018.04.003 1319-1578/Ó 2018 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Please cite this article in press as: Singh, J., et al. Morphological evaluation and sentiment analysis of Punjabi text using deep learning classification. Journal of King Saud University – Computer and Information Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.04.003

2

J. Singh et al. / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx

is termed as Morphology. The identification and processing of morphological features of a language is required for some real life tasks. This paper employs four morphological features of Punjabi text considered for carrying out sentiment analysis of farmer suicides in Punjab state of India.

Table 3 Morphemes in Punjabi text. S. No.

Punjabi Lexemes

Equivalent English Expressions

1

ਆਤਮਘਾਤ ਆਤਮ-ਹੱਤਿਆ ਜੀਵਨ ਲੀਲਾ ਸਮਾਪਤ ਕਰ ਲੈਣਾ ਫਾਹਾ ਲੈ ਲੈਣਾ ਜਹਿਰੀਲਾ ਪਦਾਰਥ ਪੀ ਕੇ ਸਲਫਾਸ ਖਾ ਲਈ ਮਾਨਸਿਕ ਤਣਾਅ ਸੂਦਖੋਰਾਂ ਵੱਲੋਂ ਦਬਾਅ ਆੜਤੀ ਦਾ ਪਰੇਸ਼ਾਨ ਕਰਨਾ ਬੈਂਕ ਦੀ ਨੀਲਾਮੀ ਦਾ ਡਰ ਜਮੀਨ ਦੀ ਕੁਰਕੀ

Suicide or self-killing

2.1. Structure of Punjabi words and token A word in Punjabi language is circumscripted by white spaces as like words in English language. Words are smallest units of a sentence that carry semantic information. Words in Punjabi language are delimited by punctuation symbols and white spaces like words in English language (Liberman, 2009). The examples of simple Punjabi sentences and Punjabi words are given in Table 1 below. Here the first sentence starts with word ‘ਕਿ ’ (What) makes it interrogative like in English language. Bigram phrase like ‘ਆਤਮ-ਹਤਿਆ’ (suicide), ‘ਮੁੱਖ ਕਾਰਨ’ (main reason), ‘ਮਾਨਸਿਕ ਤਣਾਅ’ (mental stress) taken together introduces complex lexicographic structure in Punjabi language. The association of linguistic terms like ‘ਵਧਿਆ ਹੋਇਆ’ (acute), ‘।’ (ਡੰਡੀ)/ Endline character like period ‘.’ in English, and ‘ਮੁੱਖ ਕਾਰਨ’ followed by ‘?’ makes semantic text entailment in the task of sentiment analysis of farmer suicide cases.

2

Negative or stressed state of mind due to liabilities

Table 4 Allomorphs in Punjabi text. S. No.

Concatenated form

Contracted form

Equivalent English term

1 2 3 4 5

ਨਸ਼ੇ ਦੀ ਲੱਤ ਲੱਗਣਾ ਬੈਂਕ ਦੁਆਰਾ ਜਮੀਨ ਦੀ ਨਿਲਾਮੀ ਬੈਂਕ ਦਾ ਕਰਜਾ ਸਿਰ ਚੱੜ ਜਾਣਾ ਫਸਲ ਮਾਰੀ ਜਾਣੀ ਕਿਸੇ ਪਾਸਿਉ ਵੀ ਪੈਸੈ ਦੀ ਆਸ ਨਾ ਹੋਣੀ

ਨਸ਼ੇੜੀ ਕੁਰਕੀ ਕਰਜਈ ਘਾਟਾ ਪੈਣਾ ਨਿਰਾਸ਼ਾ

Drug addicted Land reimbursement Debtor Crop failure Hopeless

2.2. Punjabi morphemes Words in Punjabi language are delimited by white spaces and punctuation symbols as like words in English language. Further canonical forms of words obtained after removal of suffixes and prefixes are termed as morphemes. Morphemes are the root words that carry semantic information. Table 2 below shows examples of Punjabi morphemes taken from words used in farmer suicide’s dataset. 2.3. Punjabi Lexemes Sometimes, phrases and words represent same semantics but, carry different alternative forms. Each form of a word or a phrase introduces a concept in the sentence. These kinds of words with alternate forms are known as lexemes. The Punjabi lexemes used in farmer suicide reports are mentioned in a Table 3 below. Table 1 Words in Punjabi text. S. No.

Punjabi Sentence taken from dataset

English Equivalent

Punjabi Words

1

ਕੀ ਹੈ ਪੰਜਾਬ ਦੇ ਕਿਸਾਨਾ ਦੀ ਆਤਮ-ਹਤਿਆ ਦਾ ਮੂੱਖ ਕਾਰਨ?

2

ਕਿਸਾਨ ਦੀ ਆਤਮ-ਹਤਿਆ ਕਰਨ ਪਿੱਛੇ ਮੁੱਖ ਕਾਰਨ ਹੈ ਕਰਜ, ਅਤੇ ਕਰਜੇ ਕਰਕੇ ਵਧਿਆ ਹੋਇਆ ਮਾਨਸਿਕ ਤਣਾਅ।

What is the main reason behind the farmer suicide in Punjab? The main reason behind the farmer suicides is debt and acute mental stress caused by debt.

ਪੰਜਾਬ, ਕਿਸਾਨਾ ਆਤਮ-ਹਤਿਆ, ਮੁੱਖ, ਕਾਰਨ ਕਿਸਾਨ, ਆਤਮਹਤਿਆ, ਮੁੱਖ, ਕਾਰਨ, ਕਰਜ, ਕਰਜੇ, ਵਧਿਆ, ਮਾਨਸਿਕ, ਤਣਾਅ

Table 2 Lexemes in Punjabi text. Lexemes

Punjabi Word

Morpheme

English Equivalent

1

ਕਰਜਾ ਕਰਜਾਈ ਕਰਜਈ ਕਰਜਦਾਰ ਕਰਜੇ ਜਮੀਨੀ ਬੇਜਮੀਨੀ ਬੇਜਮੀਨੀਆ ਜਮੀਨਦਾਰ ਜਮੀਨਦਾਰੀ

ਕਰਜ

Debt

2

ਜਮੀਨ

Land

2.4. Punjabi Allomorphs Allomorphs are the alternative forms of the morphemes. Many Asian languages contain contracted forms of concatenated word groups like English language. Sometimes, these contracted forms are termed as one-word substitutions for the group of ordered terms (Beesley and Karttunen, 2003). Table 4 presents five examples of Punjabi allomorphs along with their equivalent English terms obtained from the dataset. 3. Morphological typology in Punjabi language text used in farmer suicide reports Typology of Punjabi language deals with study and classification of qualitative features concerned with words and their morphemes. The Punjabi language typologies give qualitative relations between words and their morphs similar to the typology in English language. Although, there are many typologies in Punjabi language but, this paper considers only four Punjabi language typologies taken from the farmer suicide dataset. 3.1. Isolation/Analytical association When a single word in a sentence carries multiple morphs in different contexts, it is said to be isolated or having analytical bindings with other concepts. The task of sentiment analysis depends largely on isolation or binding of context terms with multiple morphs in different concepts. Table 5 uncovers an example of

Table 5 Typology in Punjabi text. Punjabi Language Sentence

English Equivalent

Isolated Typology

ਤਲਵੰਡੀ ਸਾਬੋ ਵਿੱਚ ਕਰਜੇ ਦੀ ਪੰਡ ਹੇਠ ਦੱਬੇ ਇੱਕ ਹੋਰ ਕਿਸਾਨ ਨੇ ਖੁਦਕੁਸ਼ੀ ਕਰ ਲਈ।

One more farmer from village Talwandi saabo committed suicide due to huge load of debt.

(ਪੰਡ) – Accumulated debt (ਪੰਡ) – A Sack of fodder

Please cite this article in press as: Singh, J., et al. Morphological evaluation and sentiment analysis of Punjabi text using deep learning classification. Journal of King Saud University – Computer and Information Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.04.003

J. Singh et al. / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx

Punjabi sentence taken from dataset having isolated typology (Example word ‘ਪੰਡ’ is having Isolated typology). 3.2. Synthetic morphemes The Punjabi language contains synthetic morphemes other than one-word substitutions similar to English language. Synthetic phrases are not standard phrases commonly used in a language; their usage can vary from topic to topic and in different contexts. Table 6 below listed few examples of linguistic terms along with their synthetic morphs and equivalent English terms. 3.3. Agglutinative morphemes Agglutinative morphemes are concerned with those linguistic phrases having single usage function at a certain point in a report. Punjabi language has this feature while reporting some event. Close investigation of farmer suicide reports revealed that agglutinative morphemes significantly exist in their reporting documents. Table 7 gives an example of Punjabi sentence extracted from dataset, the English equivalent of the report along with its agglutinative morphemes are mentioned here. 3.4. Concatenative morphemes Concatenative morphemes make a successive order with each other to produce meaningful phrases. The Punjabi language contains concatenation of simple terms to make semantically correct concatenative morphemes similar to English language as given in the Table 8 below (Munro and Manning, 2010).

Table 6 An Example of Synthetic morpheme in Punjabi text. Linguistic Expression in Punjabi

Synthetic morpheme

English Equivalent

ਪੈਸਿਆਂ ਦੀ ਘਾਟ ਖੁਦਕੁਸ਼ੀਆਂ ਦਾ ਸਿਲਸਿਲਾ ਕੀਟਨਾਸ਼ਕ ਪੀ ਕੇ ਜਿੰਦਗੀ ਸਮਾਪਤ ਕਰ ਲਈ

ਗਰੀਬੀ ਲੜੀਵਾਰ ਖੁਦਕੁਸ਼ੀਆਂ ਆਤਮਦਾਹ

Financial crisis Successive suicides Self-Killing

Table 7 An Example of Agglutinative morpheme in Punjabi text. Punjabi text from ਗੁਰਦਾਸਪੁਰ ਦੇ ਪਿੰਡ ਅਲਾਵਲਪੁਰ ਦੇ ਕਿਸਾਨ ਰੇਸ਼ਮ ਸਿੰਘ (੩੦) ਵੱਲੋਂ ਕਰਜੇ ਤੋਂ ਪਰੇਸ਼ਾਨ ਹੋ ਕੇ ਖੁਦਕੁਸ਼ੀ ਕਰਨ ਦਾ ਸਮਾਚਾਰ ਪ੍ਰਪਿਤ ਹੋਇਆ ਹੈ। ਸੂਤਰਾਂ ਅਨੁਸਾਰ ਮਿਲੀ ਜਾਣਕਾਰੀ ਮੁਤਾਬਕ ਕਿਸਾਨ ਦੇ ਸਿਰ ਤਿਨ ਲੱਖ ਦਾ ਕਰਜਾ ਸੀ, ਜਿਸ ਦੇ ਚਲਦਿਆਂ ਉਹ ਮਾਨਸਿਕ ਤੌਰ ਤੋਂ ਪਰੇਸ਼ਾਨ ਰਹਿੰਦਾ ਸੀ।

4. Statistical features of Punjabi text Words, phrases and sentences in any natural language are used to express thoughts, viewpoints and emotions in an ordered manner. The most intuitive units of any language are words, through which phrase and linguistic expressions are defined. Quantification of morphological features concerned with words is required in feature extraction phase of morphological processing and sentiment analysis process (Nidhi and Gupta, 2012a; Jain and Saini, 2015). This study considers four statistical features of Punjabi text viz. Sentence Length (SL), Punjabi Term Frequency-Inverse Sentence Frequency (TF-ISF), Punjabi Nouns, Common Punjabi-English Nouns features (CPEN). 4.1. Sentence length (SL) SL is defined as the ratio of number of words in an arbitrarily taken sentence and number of words in a longest sentence as shown in equation (1). It is observed that text from farmer suicide reports involves longer sentences than the sentences in normal Punjabi writings of novels and textual books. The reason behind the longer sentences is due to the fact that the egatively oriented farmer suicide reports carry reasons of mental stress and situations to be mentioned in the same sentence, thereby making length of a sentence to grow longer.

SL ¼

#of words in a sentence #of words in a longest sentence

Agglutinative morphemes

A farmer named Resham Singh (30) from Alawalpur village of Gurdaspur district has committed suicide due to convulsion from debt. According to sources, the farmer had a debt of 3 lakh, due to which he was mentally perturbed.

(ਸੂਤਰਾਂ ਅਨੁਸਾਰ)According to sources (ਮਾਨਸਿਕ ਤੌਰ ਤੋ)ਂ Mentally (ਜਿਸ ਦੇ ਚਲਦਿਆਂ)- Due to which

Table 8 An Example of Concatenative morpheme in Punjabi text. Punjabi text from News report

Equivalent English

Concatenative morphemes

ਆਰਥਿਕ ਤੰਗੀ ਅਤੇ ਚਿੱਟੀ ਮੱਖੀ ਕਾਰਨ ਤਬਾਹ ਹੋਈ ਨਰਮੇ ਦੀ ਫਸਲ ਕਾਰਨ ਗਰੀਬ ਕਿਸਾਨ ਵੱਲੋਂ ਜਿਹਰੀਲਾ ਪਦਾਰਥ ਪੀਣ ਦਾ ਸਮਾਚਾਰ ਪ੍ਰਾਪਤ ਹੋਇਆ ਹੈ।

Financial hardships and cotton crop failure due to attack of white bees, the news of a poor farmer drinking poison have been received.

(ਜਿਹਰੀਲਾ ਪਦਾਰਥ)- Poison (ਚਿੱਟੀ ਮੱਖੀ)White Bees (ਆਰਥਿਕ ਤੰਗੀ)Financial crisis

ð1Þ

Where 0 < SL < 1 4.2. Term frequency-inverse sentence frequency (TFISF) The importance of a word in a document depends upon its frequency inside its sentence. The term frequency is defined as TF(w,s) in the Eq. (2). 0

TFðw; sÞ ¼ number of times word w0 appears in sentence0 s0

ð2Þ

Evaluation of whole corpus for importance of a word ‘w’ can be determined through Inverse Sentence Frequency i.e. ISF(w) given in the Eq. (3).

ISFðwÞ ¼ log

Equivalent English

www.jagbani.com

3

jSj SFðwÞ

ð3Þ

The total number of sentences in the corpus is represented as |S|, where SF(w) determines number of sentences in which word ‘w’ appears. The TF-ISF feature highlights significance of a word inside a sentence of a corpus given in equation (4).

TFISFðw; sÞ ¼ TFðw; sÞISFðwÞ

ð4Þ

4.3. Punjabi Nouns Nouns are significant words in a language that carry subjective information of context about which the sentence is oriented. There are 15,445 Punjabi nouns identified during morphological processing of farmer suicide dataset in the pre-processing phase of the experiments. The examples of Punjabi nouns along with their English equivalent terms are given below in Table 9. 4.4. Common Punjabi-English Nouns (CPEN) The modernization of Punjabi language during last few decades witnessed the usage of English nouns in Punjabi. It was observed that during pre-processing phase of farmer suicide dataset, there are few nouns used by Punjabi speakers are phonetically same as

Please cite this article in press as: Singh, J., et al. Morphological evaluation and sentiment analysis of Punjabi text using deep learning classification. Journal of King Saud University – Computer and Information Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.04.003

4

J. Singh et al. / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx Table 9 Punjabi Nouns obtained from Farmer Suicide Datasets. Punjabi Noun

Equivalent English Translation

ਕਰਜ ਆਤਮਘਾਤ ਸੂਦਖੋਰ ਫਾਹਾ ਵਿਆਜ ਜਹਿਰੀਲਾ ਪਦਾਰਥ ਪਿੰਡ ਪੰਡ ਕੀਟਨਾਸ਼ਕ ਖੁਦਕੁਸ਼ੀ

Debt Self-Killing Revenue Agent Hanging Rope Interest Poison Village Burden/Stress Insecticide Suicide

Table 10 CPENs obtained from Farmer Suicide Datasets. Punjabi Noun (CPEN)

English Equivalent

ਰਿਪੋਰਟ ਬੈਂਕ ਜਲੰਧਰ ਯੂਨੀਅਨ ਮਸ਼ੀਨ ਰੀਪਰ ਟਰੱਕ ਲੱਖ ਬਰੇਕ ਸਲਫਰ

Report Bank Jalandhar Union Machine Reaper Truck Lakh Break Sulphur

nouns in English Language. Some examples of CPEN terms taken from dataset are presented in the Table 10 below.

Darwish, 2013; Eshrag Refaee, 2014). Another Punjabi specific challenge is related to ontology, where morphological mapping of Punjabi text to their respective classes is complex and NP Hard problem (Nidhi and Gupta, 2012b; Kaur and Sharma, 2016). The key issue for Punjabi text processing is availability of resources like Punjabi corpus, software libraries and Punjabi recognition tools (Kaur et al., 2010; Gupta and Lehal, 2011; Nidhi and Gupta, 2012a; Gupta, 2013). There are 184 Punjabi stop-words introduced by (Kaur and Saini, 2016) are utilized in pre-processing phase of the proposed work. Other than Punjabi specific issues, there are some general linguistic issues which are responsible for low accuracy of text classification. Spam and sarcasm are general linguistic issues in sentiment classification found by (Liu, 2012) for Arabic language. The domain specificity determines the sentiment of text observed by authors in (Varghese and Jayasree, 2013). The processing of farmer suicide dataset evolves the concept of domain dependency in this work. Sometimes implicit sentiment of text cannot be determined due to objective nature of Punjabi phrases (AbdulMageed et al., 2011; Abdul-Mageed and Diab, 2012). For example: ‘‘ਫਾਹਾ ਲੈ ਕੇ ਜਾਂ ਜਹਿਰੀਲੀ ਦਵਾਈ ਪੀ ਕੇ ਮੌਤ ਨੂੰ ਗਲੇ ਲਗਾਉਣਾ ਕੋਈ ਸੌਖਾ ਕੰਮ ਨਹੀ।” This sentence carries no sentiment, but our proposed system classify it into a negative class C1, Due to the presence of terms like ‘‘ਜਹਿਰੀਲੀ ਦਵਾਈ” and ‘‘ਮੌਤ” (Arora and Kaur, 2015; Kaur and Kaur, 2015). The degree of negativity in the report is always fuzzy in nature. For example: ‘‘ਆੜਤੀ ਅਤੇ ਬੈਂਕ ਦਾ ਕਰਜਾ ਵਿਆਜ-ਸਹਿਤ ਵੱਧ ਰਿਹਾ ਸੀ”. Here the terms ‘‘ਕਰਜਾ” (loan) and ‘‘ਵਿਆਜ” (interest) are not quantified into mathematical figures thereby making it fuzzy in nature (Wang et al., 2015). The Fig. 1 below shows summary of key challenges and issues related to Punjabi text classification. 7. Data collection and pre-processing

5. Related work Morphological processing and sentiment analysis are emerging research areas in the field of computational linguistics for low resource Asian languages. Punjabi is one of the low resource Asian language with very limited work that has been done on its morphological evaluation during last decade. The summary of the research findings on morphological evaluations till date is given in Table 11 below. 6. Challenges & issues involved in morphological processing and sentiment analysis in Punjabi The main aim of sentiment analysis and morphological processing of Punjabi text is to achieve high accuracy in classification. There are certain issues and challenges to achieve high accuracies (Hamdi et al., 2016; Ahmed et al., 2013) which are categorized into two broad dimensions viz. Punjabi specific challenges and general linguistic issues. This study has faced Punjabi language specific challenges while implementing morphological processing in the pre-processing phase. The general linguistic issues are more prominent in sentiment analysis phase while implementing proposed framework. The complexity in Punjabi language is found from its complex morphological structure, observed from the word root complexity and a diverse formation of a sentence (Kaur, 2017). Punjabi language is spoken in five adjoining states of Punjab, due to which there are many dialectal versions that came into existence. The more prominent dialects are three in number viz. Maajhi (spoken in Amritsar, Gurdaspur and Tarn-Taran districts), Malwai (Bathinda, Patiala, Firozpur and Moga), and Doabi (Jalandhar, Hoshiarpur and Nakodar) (Kaur et al., 2017). The Punjabi poetry was observed during dataset collection and pre-processing phases of this study. Poetry in Punjabi brings ordered arrangement of phrases and words generally applied to sarcasm in social and cultural aspects of Punjab (Kaur and Saini, 2017; Mourad and

The extraction of meaningful text from raw textual data of web is the most important and integral task in pre-processing phase. Most of the datasets available at authentic sources like Kaggle (Kaggle Datasets) and UCI machine learning repository (UCI machine learning repository) are available in English language. Further it is hard to look for dataset related to specific region like Punjab state of India at these sites. We have utilized authenticated websites of local newspapers like Jagbani, Ajitweekly, Punjabi Jagran, Punjabi Tribune, Nawan Zamana, Pehredar, and Punjab Times. Since this study is concerned with sentiment analysis of farmer suicide cases, the terms ‘ਖੁਦਕੁਸ਼ੀ’ (suicide), ‘ਜਮੀਨ’ (Land), ‘ਕਰਜਾ’ (Loan), ‘ਆਰਥਿਕ ਤੰਗੀ’ (Financial Crises) etc. are used to extract Punjabi text related to farmer suicide cases. The suicide reports in Punjabi text were available along with their date of report publication; therefore we have sub-divided these documents month-wise into eleven text files from (January to November-2017). The details of data collection and statistical description of dataset is represented in Tables 12 and 13 respectively. The Punjabi text obtained from various online web sources was diverse in font styles and font sizes. Hence we follow a systematic conversion of all font styles into Asses (Punjabi Font) to bring uniformity (using www.punjabiconverter.com) and finally to Unicode (Punjabi converter) conversion as depicted in Fig. 2 below. The eleven unicoded files containing Punjabi text has gone through Stopwords removal phase and punctuation removal phase. It is generally believed that Stopwords and punctuation symbols do not carry any important information to affect the overall sentiment of document. The extracted morphemes from normalized Punjabi lexemes have been identified followed by dialectal word identification from three Punjabi dialects discussed earlier in (Kaur et al., 2017). The Malwai and Doabi dialects like ‘ਗੱਬ’ੇ (Center), ‘ਿਬਆਜ’ (Interest), and ‘ਬੀਰ’ (Brother) etc are replaced with their canonical Maajhi equivalents like ‘ਿਵਚਾਲੇ’ (Center), ‘ਿਵਆਜ’ (Interest), and ‘ਵੀਰ’ (Brother) respectively. The refined set of Punjabi word

Please cite this article in press as: Singh, J., et al. Morphological evaluation and sentiment analysis of Punjabi text using deep learning classification. Journal of King Saud University – Computer and Information Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.04.003

5

J. Singh et al. / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx Table 11 Summary of recent articles for Punjabi text processing. Author(s), (Year of Publication)

Description

Approach/Algorithm

Dataset

Performance

Kaur et al. (2017)

Study of three Punjabi dialects viz. Majhi, Maalvi and Doabi (Kaur et al., 2017)

Punjabi dialects conversion system using Rule based Morphological Analyser

96.58% accuracy rate for Majhi, 96.48% for Maalvi and 97.54% for Doabi

Kaur (2017)

Performed optical character recognition for Punjabi Gurmukhi text. (Kaur, 2017) Classification of Punjabi poetry into four categories of Punjabi poems (Kaur and Saini, 2017)

Image acquisition, recognition and classification algorithms using Matlab’s toolbox Ten models are trained and tested for poetry classification using machine learning algorithms

Developed nine dictionaries containing about 6000 dialectal words and validated the system over 11,000 Punjabi words taken from novels and articles. Handwritten images of Punjabi text are manually taken from Punjabi writers Manually obtained 240 Punjabi

Marathi text categorization for general text documents and News documents (Sahani et al., 2016) Text categorization of Indian language documents and keyword extraction (Hanumanthappa and Narayana Swamy, 2016) A report on Punjabi Stop words (Kaur and Saini, 2016)

Proposed LINGO algorithm for cluster Label Induction and cluster content discovery Proposed machine learning algorithms on KNN, NB, C4.5 (J48) for Indian languages

Hentschel and Pal (2015)

Punjabi language text, audio, and video data crowdsourcing online (Hentschel and Pal, 2015)

Proposed a website containing crowd-sourced Punjabi language content

Kaur and Sharma (2016)

Automatic graph based system for Punjabi domain ontological development (Kaur and Sharma, 2016)

Proposed system generates domain ontology of Punjabi text.

Salesky and Shen (2014)

Evaluation of four languages viz. English, Dari, Pashto, and Arabic in terms of their morphology and grammar (Salesky and Shen, 2014) Automatic normalization of Punjabi words for NLP applications (Gupta, 2013) Classification of Punjabi text into predefined 8 classes (Nidhi and Gupta, 2012a)

Trained and tested a Machine learning model using SVM and Margin Infused Relaxed Algorithm

Kaur and Saini (2017)

Sahani et al. (2016) Hanumanthappa and Narayana Swamy (2016) Kaur and Saini (2016)

Gupta (2013)

Nidhi and Gupta (2012a)

Nidhi and Gupta (2012b)

Classification of Punjabi text using domain based ontology and Hybrid technique

Gupta and Lehal (2011))

Summarization of Punjabi text using feature selection and weight learning (Gupta and Lehal, 2011) Categorization of Punjabi Synsets and syntactic evaluation of Punjabi WordNet relations (Kaur et al., 2010) Developed a Hindi-Punjabi machine translation system using Hindi-Punjabi parallel corpus (Kaumar and Goyal, 2010) Developed Punjabi to Universal Network of Languages (UNL) convertor using morphological processing of Punjabi text (Bhatia and Sharma, 2009)

Kaur et al. (2010)

Kaumar and Goyal (2010)

Bhatia and Sharma (2009)

Using Linguistic approach from traditional methods

Rule based algorithm for normalization of Punjabi Nouns Proposed a classification algorithm using ontology based technique

poems through www.punjabikavita.com, www.punjabizm.com, www.punjabimaaboli.com 24 general category documents and 33 News documents are manually collected form online web sources Kannada, Tamil and Telugu language documents (100 for each) are manually obtained from web sources Punjabi language documents are taken from Punjabi articles, news, novels and books. Initial archive consists of Punjabi language text and videos obtained through interview of 8 Punjabi speakers from different disciplines. Punjabi documents (1000 in number) concerned with agriculture, health, entertainment, politics, and sports are manually obtained from web. 1390 documents are taken from DLIFLC (Defence Languages Institute Foreign Language Centre) 50 Punjabi news articles are manually collected from online news sources. Obtained 150 Punjabi text documents from web sources like

Yield satisfactory results in classification of Punjabi characters. 50.63% of accuracy using Hyperpipes, 52.92% for kNN, 52.75% for NB and 58.79% for SVM Rand Measure of 95.83% for general documents and 93.93% for news documents Accuracy of 93% using kNN, 97.33% for J48 and 97.66% for NB classifiers 184 Punjabi stop-words are manually obtained Significant performance of system in accumulation of Punjabi content. Significant results in correct classification of Punjabi documents

77% of reduction in MSE (Mean Squared Error) for Pashto language Significant normalization of Punjabi Nouns with 1.562% variation in spellings Yield satisfactory results in terms of correct classification

www.likhari.org, www.jagbani.com, www.ajitweekly.com, etc. Corpus of 180 Punjabi news articles is extracted from online web sources.

50 Punjabi text documents are taken from online Punjabi news websites

85% of accuracy in classification for Ontology based and Hybrid, 71% for Centroid based, and 64% for Naïve Bayes classification Obtained significant results in text summarization

Hindi Synsets (35000 in number) are taken for evaluation of Punjabi Synsets

Significant concept identification in Punjabi language

Coined Hindi-Punjabi translation system using NLP techniques

Obtained 50 k sentences from web

Proposed system was found 94.5% accurate in terms of translation

Rule based conversion algorithm was used to design convertor

Punjabi sentences from books, novels and Punjabi articles are taken for defining morphological rules

Proposed an algorithm using Ontology based and Hybrid method for pre-processing, Trained and tested Machine learning algorithm for classification Estimation of text features and their weights is done using regression method Morphological evaluation of semantic relations using traditional NLP techniques

tokens are collected in an output text file as shown in Fig. 3 below (Šilic´ et al., 2007). 8. Proposed framework for sentiment analysis of punjabi text The refined Punjabi tokens from output.txt file are loaded using NLTK (Natural Language Toolkit) of Python 3.6 loader function

source: http://h2plearnpunjabi.org

Proposed system performed significantly in conversion

through UTF-16 encoding. The decoded versions of Punjabi text is loaded back to their original form in Python‘s shell. The class assignment of Punjabi word tokens is done manually for the initial ten tokens. Four classes are designed along with their respective sentiment score in Table 13 (Hrala and Kral, 2013). Four features (discussed in Section 4) viz. SL, TFISF, Punjabi Nouns and CPEN are extracted from labelled Punjabi tokens. The word vectorization

Please cite this article in press as: Singh, J., et al. Morphological evaluation and sentiment analysis of Punjabi text using deep learning classification. Journal of King Saud University – Computer and Information Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.04.003

6

J. Singh et al. / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx

Fig. 1. Challenges and Issues in Morphological Processing and Sentiment Analysis.

Table 12 Statistics of Farmer Suicide Dataset. Sr. No

Month (2017)

Number of Farmer Suicide Cases

Sources

1 2 3 4 5 6 7 8 9 10 11

November October September August July June May April March February January

25 21 23 24 21 23 27 31 31 24 25

Jagbani (Jagbani E-Newspaper), Ajitweekly (Ajit Weekly), Rojana Spokesman (Rojana Spokesman), Punjabi Jagran (Punjabi Jagran), Punjab Kesri (Jagbani E-Newspaper), Punjabi Tribune (Punjabi Tribune), Chardikala News (Chardikala News), Nawan Zamana (Nawan Zamana), Pehredar (Daily Pehredar), Jan Jagrati (Jan Jagrati), Doaba Headlines (Doaba Headlines) and Punjab Times (Punjab Times)

Table 13 Statistics of features obtained from Dataset.

Morphological Features

Morphological Typology

Features

Number of Instances

Punjabi Tokens Lexemes Morphemes Allomorphs Isolation/Analytical Association Synthetic Morphemes Agglutinative Morphemes Concatenative Morphemes

5218 1209 1812 198 154 71 238 413

and theano libraries of Python 3.6 Relu and sigmoid activation functions are used during the training of deep neural network model. The first ten successive runs and test vectors are fit for the validation of DNN model followed by 10-fold cross validation of sentiment prediction. The Fig. 4 below shows the working of proposed model for sentiment classification of farmer suicide cases in Punjab. The proposed model trains DNN using recurrent neural network to investigate Punjabi text for sentiment prediction. The text from output.txt file is broken into two subsets in the ratio 80:20. The former subset is utilized for training of DNN and latter one for testing of classification. The labelling of Punjabi text in training subset is done manually using human resources while testing set is kept unlabelled. The conventional NLTK’s functions are exploited for extracting four features viz. SL, TFISF, Punjabi Nouns and CPEN. The Punjabi sentences are delimited by ‘|’ character and Punjabi words like English words are circumscripted by ‘space’ character. These word tokens are fed to proposed DNN model in the sequence w1 ; w2 ; w3 . . . . . . . . . . . . :wn . The output of the model yields Punjabi sentences tagged with their respective sentiment classes. Equations (5) and (6) respectively represent the left context of word ðwÞi as cðwÞli whereas right context by cðwÞri .

cðwÞli ¼ W l cðwÞli1 þ W sl eðwÞi1

ð5Þ

cðwÞri ¼ W r cðwÞriþ1 þ W sr eðwÞiþ1

ð6Þ

Here eðwÞi1 and eðwÞiþ1 are word vectors of words on first left and first right side of target word ðwÞi respectively. The contextual hidden layer to next hidden layer mappings are represented by W l .

of Punjabi word tokens is done through CBOW (Continuous bag of words) method followed by assignment of initial weights and sentiment scores to the word vectors using sklearn, tensorflow

and W r respectively. W sl and W sr gives semantic matrices mapping of target word’s context with words on the left and right side of target word respectively. The forward pass of scanning word

Please cite this article in press as: Singh, J., et al. Morphological evaluation and sentiment analysis of Punjabi text using deep learning classification. Journal of King Saud University – Computer and Information Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.04.003

7

J. Singh et al. / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx

Fig. 2. Steps of Punjabi text collection.

This activation is preformed over every neuron in the input layer of DNN as shown in Eq. (9) below. Here ðyÞ2i represented a semantic vector and max. pooling layer picks maximum of it.

ðyÞ3 ¼ max ðyÞ2i ;

where 1 6 i 6 n

ð9Þ

Lastly, equation (10) gives the latent semantic word vector obtained by linear combination of max. pooling layer’s yields and updated weight matrix in the output layer of DNN model.

ðyÞ4 ¼ ðWÞ4 ðyÞ3 þ ðbÞ

4

ð10Þ

The validation of semantic vectors obtained from output layer of DNN is done using softmax classifier given in Eq. (11). 4

Pj ¼

eðyÞi n X

ð11Þ

ðyÞ4k

e

k¼1 th

Here P j gives probability that word vector ðwÞi belongs to j class of text. Where 1 6 j 6 4 9. Results and analysis Fig. 3. Data Pre-Processing Phase of Proposed model.

vectors from left to right thereby computes all cðwÞli 0s whereas backward pass computes all cðwÞri 0s. DNN finds fixed window for word embedding vectors ðxÞi of word ðwÞi given in the equation (7) below.

h i ðxÞi ¼ cðwÞli eðwÞi cðwÞri

ð7Þ

Further, the activation function (given in equation (8)) utilized by DNN is hyperbolic tangent function tanh, which constitutes linear combination of context vectors and weight matrices along with a bias ‘b’ in the next hidden layer.

  2 ðyÞ2i ¼ tanh ðWÞ2 ðxÞi þ ðbÞ

ð8Þ

The topic of farmer suicides in Punjab itself represents negativeness in the sentiment polarity. Therefore in order to classify farmer suicide cases, four negatively oriented classes have been designed. The polarity of negativeness increases from class C1 to class C4 along with the sentiment score ranging from 0 to 1 respectively. The deep investigation of farmer suicide cases identified seven social and economic factors behind suicide viz. loan stress, daughter’s marriage, crop failure, drug-addiction in family, mortgage of land, unemployed farmers and act of gambling. The description of four classes is shown in the Table 14 below. The relation between the age of farmer and number of farmers reported for suicide during the period from 01-January-2017 till 30-November-2017 is shown in Table 15. The maximum number of suicides cases was found in the age category of 51 to 60 which reveals the socio-economic conditions of young farmers. The two more prominent factors viz. stress of loan and marriage of daughter are involved in most of the suicide cases of the age category

Please cite this article in press as: Singh, J., et al. Morphological evaluation and sentiment analysis of Punjabi text using deep learning classification. Journal of King Saud University – Computer and Information Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.04.003

8

J. Singh et al. / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx

Fig. 4. Architecture of Proposed System.

Table 14 Four Classes of Farmer Suicide cases (F1: Loan Repayment, F2: Daughter’s Marriage, F3: Crop Failure, F4: Drug Addiction, F5: Land Mortgage, F6: Unemployment, F7: Gambling). Farmer Suicide Class

Sentiment Score

Description (English)

Description (Punjabi)

C1 C2 C3 C4

0.00–0.25 0.26–0.50 0.51–0.75 0.76–1.0

F1 only F1 + F2 F1 + F2 + F3 + F4 F1 + F2 + F3 + F4 + F5 + F6 + F7

ਸਿਰਫ ਕਰਜਾ। ਕਰਜਾ ਅਤੇ ਧੀ ਦੇ ਵਿਆਹ। ਕਰਜਾ, ਧੀ ਦਾ ਵਿਆਹ, ਫਸਲ ਅਤੇ ਨਸ਼ਾ। ਕਰਜਾ, ਧੀ ਦਾ ਵਿਆਹ, ਫਸਲ ਅਤੇ ਨਸ਼ਾ, ਜਮੀਨ, ਬੇਰੋਜਗਾਰੀ ਅਤੇ ਜੂਆ।

Table 15 Suicide cases reported in Punjabi Newspapers. Age of Farmer

Number of Suicide Cases

Below 50 51–60 61–70 71–80 Above 81

68 116 62 23 6

from 51 to 60. On the other hand, very few suicide cases were reported in the elderly age groups of 71 to 80 and above 81. Table 16 gives month wise classification of farmer suicide cases. The average of suicide cases per month is around 25 in number and maximum number of suicide cases is classified in class C1 which reveals that the major cause of stress in farmers is due to loan and financial hardship. The number of suicide cases above average was found in the months of April and September. Although, there is hardly any work available on Punjabi text classification for a social issue like farmer suicide but, some of the researchers have performed categorization and classification of Indian languages for different benchmarks as shown in Table 17 below. The validation of proposed model is done by comparing the accuracies and techniques used for text classification in benchmarks. Table 18 shows two examples of Punjabi sentences processed using proposed methodology. Here the first sentence is classified in class C1 due to the presence of word ‘ਕਰਜਾ’ (Loan) while the second sentence has got classified in class C3 due to the presence of terms ‘ਧੀ ਦਾ ਿਵਆਹ’ (Daughter’s Marriage) and ‘ਨਸ਼ੇ ਦੀ ਲੱਤ’(Drug addiction).

Please cite this article in press as: Singh, J., et al. Morphological evaluation and sentiment analysis of Punjabi text using deep learning classification. Journal of King Saud University – Computer and Information Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.04.003

9

J. Singh et al. / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx

Table 16 Month-wise Classification matrix for Farmer Suicide cases, here T represents Total number of cases in a class and C gives Number of correctly classified cases by proposed framework. Unicode pre-processed dataset file

D1_Jan_2017 D2_Feb_2017 D3_Mar_2017 D4_Apr_2017 D5_May_2017 D6_Jun_2017 D7_Jul_2017 D8_Aug_2017 D9_Sep_2017 D10_Oct_2017 D11_Nov_2017

Class C1

Class C2 T

C

T

C

T

C

13 11 15 19 13 11 11 14 16 12 15

11 11 14 15 13 10 10 14 15 12 15

7 8 6 7 9 7 6 3 2 6 7

7 7 6 4 8 6 6 3 2 5 5

3 4 7 3 3 4 3 4 3 2 2

3 2 7 3 2 3 3 3 3 2 1

2 1 3 2 2 1 1 3 2 1 1

2 1 3 2 1 1 1 3 2 1 1

Benchmarks

Accuracy of text classification

Techniques used

Punjabi Poetry classification by Kaur and Saini (2017) Marathi Text categorization by Sahani et al. (2016)

52.75% using NB, 58.79% using SVM, and 52.92% using KNN

Naïve Bayes, Support Vector Machine, and KNearest Neighbours

94. 83% of Rand Measure for General Category and 93.93% for News category documents 93% of overall average accuracy of text classification

LINGO Algorithm for Marathi language

85% of accuracy for both ontology based and hybrid approach, 71% for Centroid based, and 64% for NB

Ontology Based, Hybrid Approach, Centroid based text classifications, and Naïve Bayes classifier for learning Deep Neural Network and Morphological Punjabi text classification

Proposed model for morphological processing and sentiment classification of Punjabi text using DNN classifier

Class C4

C

Table 17 Validation of proposed model with existing benchmarks.

Indian languages text categorization by Hanumanthappa and Narayana Swamy (2016) Domain based Punjabi text classification by Nidhi and Gupta (2012b)

Class C3

T

Maximum accuracy achieved is 95.45%

C4.5, NB, and KNN

extricated 275 suicide cases in Punjab from 01, January 2017 to 30, November 2017 and categorized the cases according to the age of farmer as shown in Table 15. It was perceived after classification, that the farmers of the age group 51–60 were highest in number in terms of suicides, while very few cases were noticed in the age group of 71–80 and 81 above. Furthermore, the classification unveils that most of the suicide cases fell in first two classes. The mean accuracy of classification observed after 10 successive epochs through proposed framework is 90.29% and the average values of parameters SL and TFISF were 0.89 and 3.82 respectively. These average values are higher than the values of SL and TFISF for normal Punjabi text in books, novels, and articles. This outcome infers that the main reasons behind the distress of Punjabi farmers are loan’s stress and daughter’s marriage. These two factors commingle among themselves and represent socio-economic status of farmers in Punjab. Hence it is the combined responsibility of a common man, religious bodies, and government of Punjab to coin appropriate measures for reducing the suicidal tendencies among Punjabi farmers. The future scope of this research work is to investigate more complex features viz. cultural, personal, religious and geological importance and also to extend the evaluation of more causes behind agony of Punjabi farmers. Moreover, this study has not considered the grammatical errors which are often present in reported Punjabi text. The extended version of this work will consider grammatical errors.

References Table 18 Classification Example Processed using Proposed Framework. Example Punjabi Text

Sentiment Score

Suicide Class

ਕਿਸਾਂਨਾ ਵੱਲੋਂ ਕਰਜੇ ਤੋਂ ਦੁੱਖੀ ਹੋ ਕੇ ਕੀਤੀਆਂ ਜਾ ਰਹੀਆਂ ਖੁਦਕੁਸ਼ੀਆਂ ਦੇ ਕਾਰਨ ਮਾਨਸਾ ਦੇ ਸ਼ਹਿਰ ਬੁਢਲਾਡਾ ਦੇ ੬੦ ਸਾਲਾ ਕਿਸਾਨ ਸੁਖਮ ਸਿੰਘ ਪੁੱਤਰ ਕਪੂਰ ਸਿੰਘ ਨੇ ਕਰਜੇ ਤੋਂ ਤੰਗ ਆ ਕੇ ਰੇਲ ਗੱਡੀ ਹੇਠ ਆ ਕੇ ਖੁਦਕੁਸ਼ੀ ਕਰ ਲਈ। ਮਰਨ ਵਾਲੇ ਕਿਸਾਨ ਦੇ ਸਿਰ ਬੈਂਕ ਤੇ ਸੋਸਾਈਟੀ ਦਾ ਕਰੀਬ ੩.੫ ਲੱਖ ਰੁਪਏ ਦਾ ਕਰਜਾ ਸੀ ਤੇ ਉਸ ਕੋਲ ਸਿਰਫ ੪ ਕਿੱਲੇ ਜਮੀਨ ਸੀ। ਤਲਵੰਡੀ ਸਾਬੋ ਵਿਚ ਕਰਜੇ ਦੇ ਬੋਝ ਹੇਠ ਦੱਬੇ ਇਕ ਹੋਰ ਕਿਸਾਨ ਨੇ ਖੁਦਕੁਸ਼ੀ ਕਰ ਲਈ। ਪਿੰਡ ਜੱਗਾ ਰਾਮ ਤੀਰਥ ਦੇ ਨਿਰਭੈ ਸਿੰਘ ਨੇ ਫਾਹਾ ਲਾ ਕੇ ਮੌਤ ਨੂੰ ਗਲੇ ਲਾ ਲਿਆ। ਘੱਟਨਾ ਰਾਤ ੧੧ ਵਜੇ ਦੇ ਕਰੀਬ ਦੀ ਹੈ।ਦਸਿਆ ਜਾ ਰਿਹਾ ਹੈ ਕਿ ਕਿਸਾਨ ਦੇ ਸਿਰ ਕਰੀਬ ੧੨ ਲੱਖ ਦਾ ਕਰਜਾ ਸੀ। ਕਿਸਾਨ ਦੀਆਂ ਦੋ ਧੀਆਂ ਤੇ ਇਕ ਪੁੱਤਰ ਸੀ। ਇਕ ਧੀ ਦਾ ਵਿਆਹ ਕਰਜਾ ਚੁੱਕ ਕੇ ਕੀਤਾ ਸੀ ਅਤੇ ਪੁੱਤਰ ਨਸ਼ਿਆਂ ਦੀ ਬਹਿਣੀ ਬਹਿ ਗਿਆ ਸੀ।

0.23

C1

0.69

C3

10. Conclusion This work has accomplished morphological processing and sentiment prediction of farmer suicide cases reported in Punjabi language on Punjabi news websites mentioned in Table 12. We have

(n.d.). Retrieved from Punjabi converter: www.punjabiconverter.com (n.d.). Retrieved from UCI machine learning repository: https://archive.ics.uci.edu/ ml/index.php (n.d.). Retrieved from Kaggle Datasets: https://www.kaggle.com (n.d.). Retrieved from Ajit Weekly: http://www.ajitjalandhar.com/ (n.d.). Retrieved from Rojana Spokesman: https://rozanaspokesman.com/ (n.d.). Retrieved from Punjabi Jagran: http://punjabi.jagran.com/ (n.d.). Retrieved from Punjabi Tribune: http://punjabitribuneonline.com/ (n.d.). Retrieved from Chardikala News: http://charhdikala.com/ (n.d.). Retrieved from Nawan Zamana: http://nawanzamana.in/ (n.d.). Retrieved from Daily Pehredar: http://www.dailypehredar.com/ (n.d.). Retrieved from Jan Jagrati: http://www.dailyjanjagriti.com/ (n.d.). Retrieved from Doaba Headlines: http://www.doabaheadlines.co.in/home/ (n.d.). Retrieved from Punjab Times: http://www.dailypunjabtimes.com/ (2017, June). Retrieved from The Hindu: http://www.thehindu.com Abdul-Mageed, M., Diab, M.T., 2012. AWATIF: A Multi-Genre Corpus for Modern Standard Arabic Subjectivity and Sentiment Analysis. In: International Conference on Language, Resourcea and Evaluations, pp. 3907–3914. Istanbul. Abdul-Mageed, M., Diab, M.T., Korayem, M., 2011. Subjectivity and sentiment analysis of modern standard Arabic. Association for Computational Linguistics, pp. 587–591. Portland. Ahmed, S., Pasquier, M., Qadah, G., 2013. Key Issues in Conducting Sentiment Analysis on Arabic Social Media Text. In: 9th International Conference on Innovations in Information Technology (IIT). Abu Dhabi: IEEE. Arora, P., Kaur, B., 2015. An Approach for Sentiment Analysis of Punjabi Text. Int. J. Inf. Technol. Comput. Sci. Perspect. 4 (2), 1464–1470.

Please cite this article in press as: Singh, J., et al. Morphological evaluation and sentiment analysis of Punjabi text using deep learning classification. Journal of King Saud University – Computer and Information Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.04.003

10

J. Singh et al. / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx

Bhatia, P., Sharma, R., 2009. Role of Punjabi Morphology in Designing Punjabi-UNL Enconvertor. ICAC3 ’09. Mumbai: ACM. Salesky, Elizabeth, Shen, W., 2014. Exploiting Morphological Grammatical and Semantic Correlates for Improved Text Difficulty Assessment. Association for Computational Linguistics, Maryland, USA. Eshrag Refaee, V.R., 2014. An Arabic Twitter Corpus for Subjectivity and Sentiment Analysis. In: International Conference on Language Resources and Evaluation, pp. 2268–2273. Reykjavik. Gupta, V., 2013. Automatic Normalization of Punjabi Text. Int. J. Eng. Trends Technol. 6 (7), 353–357. Gupta, V., Lehal, G.S., 2011. Feature Selection and Weight Learning for Punjabi Text Summarization. Int. J. Eng. Trends Technol. 2 (2), 45–48. Hamdi, A., Shaban, K., Zainal, A., 2016. A Review on Challenging Issues in Arabic Sentiment Analysis. J. Comput. Sci. 12 (9), 471–481. Hanumanthappa, Narayana Swamy, M., 2016. Indian Language Text Documents Categorization and Keyword Extraction. Int. J. Control Theory Appl. 9 (3), 1473– 1481. Hentschel, J., Pal, J., 2015. Sada Vehra: A Framework for Crowdsourcing Punjabi Language Content. ACM, Singapore. Hrala, M., Kral, P., 2013. Evaluation of Document Classification Approaches. In: Proceedings of the 8th International Conference on Computer Recognition Systems CORES 2013. 226. Milkov: Advances in Intelligent Systems and Computing, Springer. Jagbani E-Newspaper. (n.d.). Retrieved from Jagbani: http://jagbani.punjabkesari.in Jain, U., Saini, K., 2015. Punjabi Text Classification using Naive Bayes Algorithm. Int. J. Curr. Eng. Technol. 5 (6), 3777–3780. Kaumar, P., Goyal, V., 2010. Development of Hindi Punjabi Parallel Corpus using Existing Hindi Punjabi Machine Translation System. IITM’10. Allahabad: ACM. Kaur, A., Singh, P., Kaur, K., 2017. Punjabi Dialects Conversion System for Majhi, Malwai and Doabi Dialects. ICCMS. Canberra, Australia: ACM. Kaur, G., Kaur, K., 2015. Sentiment Analysis on Punjabi News Articles Using SVM. Int. J. Sci. Res. 6 (8), 414–421. Kaur, J., 2017. Classification of Printed and Handwritten Gurmukhi text using labeling and Segmentation technique. Int. J. Scientific Res. Publ. 7(1). Kaur, J., Saini, J.K., 2017. Punjabi Poetry Classification: The Test of 10 Machine Learning Algorithms. In: International Conference on Machine Learning and Computing. Singapore: ACM.

Kaur, J., Saini, J.K., 2016. Punjabi Stop Words: A Gurmukhi. Shahmukhi and Roman Scripted Chroncile. ACM, Indore, India. Kaur, R., Sharma, S., 2016. Semi Automatic Domain Ontology Graph Generation System in Punjabi. ACM, Udaipur, India. Kaur, R., Sharma, R., Preet, S., Bhatia, P., 2010. Punjabi Wordnet Relations and Categorization of Synsets. Mumbai: ICON-2010 IIT Kharagpur. Beesley, Kenneth R., Karttunen, L., 2003. Finite State Morphology. CSLI Studies in Computational Linguistics, CA, USA. Liberman, M., 2009. Morphology, Linguistics-001 Retrieved from http://www.ling. upenn.edu/courses/fall2009/ling001/morphology.html Lecture-7, . Liu, B., 2012. Sentiment Analysis and Opinion Mining (1st ed.). (G. Hirst, Ed.) Chicago, United States of America: Morgan & Claypool Publishers. Mourad, A., Darwish, K., 2013. Subjectivity and Sentiment Analysis of Modern Standard Arabic and Arabic Microblogs. In: 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 55–64. Atlanta: Association for Computational Linguistics. Munro, R., Manning, C.D., 2010. Subword Variation in Text Message Classification. In: HLT ’10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 510–518. Los Angeles: ACM. Nidhi, Gupta, V., 2012a. Algorithm for Punjabi Text Classification. Int. J. Comput. Appl. 37 (11), 30–35. Nidhi, Gupta, V., 2012b. Domain Based Classification of Punjabi Text Documents using Ontology and Hybrid Based Approach. Computational Linguistics, Mumbai, Inida. Sahani, A., Sarang, K., Umredkar, S., Patil, M., 2016. Automatic Text Categorization of Marathi Language Documents. Int. J. Comput. Sci. Inf. Technol., 2297–2301 Šilic´, A., Chauchat, J.-H., Bašic´, B.D., Morin, A., 2007. N Grams and Morphological Normalization in Text Classification: a Comparison on a Croatian-English Parallel Corpus. In: Portuguese Conference on Artificial Intelligence. 4874. Portuguese: Lecture Notes in Computer Science, Springer, Berlin, Heidelberg. Varghese, R., Jayasree, M., 2013. Aspect based Sentiment Analysis using support vector machine classifier. IEEE, pp. 1581–1586. Maysore. Wang, B., Huang, Y., Wu, X., Li, X., 2015. A Fuzzy Computing Model for Identifying Polarity of Chinese Sentiment Words. Comput. Intell. Neurosci. 2015 (2), 1–13.

Please cite this article in press as: Singh, J., et al. Morphological evaluation and sentiment analysis of Punjabi text using deep learning classification. Journal of King Saud University – Computer and Information Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.04.003