Journal Pre-proof Improving the accuracy using pre-trained word embedding on deep neural networks for Turkish text classification Murat Aydo˘gan, Ali Karci
PII: DOI: Reference:
S0378-4371(19)31843-6 https://doi.org/10.1016/j.physa.2019.123288 PHYSA 123288
To appear in:
Physica A
Received date : 7 May 2019 Revised date : 17 September 2019 Please cite this article as: M. Aydo˘gan and A. Karci, Improving the accuracy using pre-trained word embedding on deep neural networks for Turkish text classification, Physica A (2019), doi: https://doi.org/10.1016/j.physa.2019.123288. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
© 2019 Published by Elsevier B.V.
*Highlights (for review)
Journal Pre-proof HIGHLIGHTS
of p ro Pr eal
urn
The largest Turkish no-labelled dataset and word vectors were created. Text classification was applied with deep neural networks on another Turkish multiclass dataset. The effect of using the pre-word embeddings with deep neural networks was investigated. Comparison of deep neural networks and word embedding methods performances was analyzed. The accuracy rate was improved using Turkish pre-trained word vectors with transfer learning.
Jo
*Manuscript (Word) Click here to view linked References
Journal Pre-proof
Improving the Accuracy Using Pre-Trained Word Embedding on Deep Neural Networks for Turkish Text Classification Murat AYDOĞAN a, *, Ali KARCI b Department of Computer Technology, Genc Vocational School, Bingol University, 12000, Bingol, TURKEY b Department of Computer Engineering, Faculty of Engineering, Inonu University, 44280, Malatya, TURKEY
a
of
* Corresponding author E-mail address:
[email protected] (M. Aydoğan)
ABSTRACT
Pr e-
p ro
Today, extreme amounts of data are produced, and this is commonly referred to as Big Data. A significant amount of big data is composed of textual data, and as such, text processing has correspondingly increased in its importance. This is especially valid to the development of word embedding and other groundbreaking advancements in this field. However, When studies on text processing and word embedding are examined, it can be seen that while there have been many world language-oriented studies, especially for the English language, there has been an insufficient level of study undertaken specific to the Turkish language. As a result, Turkish was chosen as the target language for the current study. Two Turkish datasets were created for this study. Word vectors were trained using the Word2Vec method on an unlabeled large corpus of approximately 11 billion words. Using these word vectors, text classification was applied with deep neural networks on a second dataset of 1.5 million examples and 10 classes. The current study employed the Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and the Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) methods – other types of this architecture – and their variations as deep neural network architectures. The performances of the embedding methods for the words used in this study, their effects on the rate of accuracy, and the success of the deep neural network architectures were then analyzed in detail. When studying the experimental results, it was determined that the GRU and LSTM methods were more successful compared to the other deep neural network models used in this study. The results showed that the pre-trained word vectors’ (PWVs) accuracy on deep neural networks improved at rates of approximately 5% and 7%. The datasets and word vectors of the current study will be shared in order to contribute to the Turkish language literature in this field. Keywords: Deep learning, Word embedding, Turkish text classification, Text processing
Jo
urn
al
1. Introduction Natural language processing studies have gained in importance rtecently with the increase of textual data and has become a popular topic of study in the published literature. Natural language processing techniques are used in fields of work such as text classification, sentiment analysis, machine language translation, information retrieval, summarization, and in the development of question-answer systems [1-6]. In recent years, deep neural network architectures have brought about significant innovations and have presented successful solutions to natural language processing problems, just as it has created positive impacts in numerous other fields [7]. Deep learning, whose foundation reaches back to the 1980s, was hobbled over studies in the 1990s due in particular to hardware-related problems, but today has become the most popular sub-branch of artificial intelligence studies. Applications continue to be developed using deep learning techniques in many fields such as computer vision, natural language processing, and autonomous vehicles [8-10]. Text classification remains one of the most studied natural language processing problems. Especially with the development of word embedding, a significant increase has been seen in natural language processing. Whilst various text analyses has been conducted in many global languages, primarily English, studies are still at an inadequate level for the Turkish language. Despite groundbreaking advancements and academic research in text processing, languages other than English have not been able to keep up with anywhere near the same level of research being published. Especially with indigenous languages coming from different language families, it is not considered easy to reproduce similar studies. Although the number of text processing studies in the Turkish language has seen an increase, the number of word embedding studies is still limited. The primary focus of the current study is text classification performed for the Turkish language using pretrained word embedding and popular deep learning networks. For this aim, two Turkish datasets were produced; of which one was labeled and the other unlabeled. The TF-IDF method was used in order to create stop words for the Turkish language, and was subsequently applied to these datasets in a preprocessing phase. Training the largest dataset on word embedding was achieved through the application of a new deep neural network framework that was custom built based on the intricacies and nuances of the Turkish language. Testing these approaches in a multiclass
Journal Pre-proof text classification case was achieved using state-of-the-art deep learning frameworks with custom configurations. The main purpose of the current study is to improve the current accuracy rate using pre-trained word embeddings. The accuracy rate was shown to have improved by between 5% and 7% on almost all models using the presented method. The remainder of the paper is organized as follows. Section 2 presents the related works, while Section 3 provides detailed information about the deep neural network architectures and word embedding methods used in the study. Information is then provided about the materials in Section 4, with the results of the experiments shared in Section 5. The findings procured as a result of the conducted studies are interpreted in Section 6, and finally, conclusions and future works are presented in Section 7.
Jo
urn
al
Pr e-
p ro
of
2. Related Work This study is related to two research topics; word embedding for the Turkish language, and the use of deep learning with pre-trained word vectors (PWVs) for the purposes of text classification. When studies related to these topics are examined; Kılınç et al. created a Turkish dataset called TTC-3600, which was comprised of six categories with data collected from various news sites. They also applied a text classification process with different machine-learning algorithms to the dataset [11]. The convolutional neural network (CNN) architecture is a method of deep learning that was recommended for multidimensional inputs and particularly for two-dimensional visual data. However, successful results were also obtained in studies in the field of natural language processing [12]. Collobert and Weston expressed in their first study that convolutional neural network architectures initially designed for image-processing studies could also be used to address natural language-processing problems [13]. Kim et al. performed a text classification process on various datasets in their study. Successful results were obtained in which a single-layer convolutional neural network architecture was employed [14]. A study by Zhang et al. that was conducted on text classification using a character-level convolutional neural network architecture revealed a powerful method that could be used in the resolution of text classification problems [15]. A study conducted by Lai applied text classification with a recurrent convolutional neural network architecture. The recommended model was a recurrent neural network architecture for contextual information together with a convolutional neural network architecture for text embedding [16]. Another model used in this study is the recurrent neural network and two of its different types, the Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) architectures. Reclusive neural networks [17] are neural networks with secret levels that can analyze streams of data and are suited to the resolution of problems whose output is based on previous calculations. The recurrent neural networks have, for this reason, been shown to be rather successful in the solution of different problems, particularly those based on natural language processing [18]. Due to certain issues that emerged during learning in recurrent neural networks, different versions such as LSTM [19] and GRU [20], as researched in the current study, have been published in the literature. Lee and Dernoncourt used the LSTM model together with the convolutional neural network architecture – a type of recurrent neural network architecture – in their study and obtained successful findings with various datasets [21]. Another important development in the field of natural language processing is artificial neural networkbased word embedding methods. The most popular method in this field is the Word2Vec method, which was developed by Mikolov et al. [22-23]. There are two different approaches for the Word2Vec method, which are known as the Continuous Bag of Words (CBOW) and Skip-gram (SG) algorithms. Another important word embedding method is the Global Vectors for Word Representation (Glove) method that was developed at Stanford University [24]. The various word-embedding methods, and their effect on classification success, are examined in detail within the current study. 3. Methods 3.1 Deep Neural Networks
Deep learning is a special state of the topic of machine learning harbored within the concept of artificial intelligence. When reviewing the literature, the first neural network is the perceptron algorithm [25]. This network was formed from an input and output layer. With the addition of more than one level to these neural networks, the goal was to determine more complex connections [26-27]. This newly created architecture was called a deep neural
Journal Pre-proof network architecture. It has also been termed the multilayered and multi-neuron state of classic artificial neural networks for deep neural networks [28-29]. Along with the use in recent years of graphic processing units that can process faster and more powerfully, the cost of fast processing has fallen which has resulted in deep neural networks becoming more popular [30].
p ro
of
Convolutional Neural Networks (CNN) CNN architectures were developed as a special type of multilevel censor and were designed with inspiration drawn from the human ocular systems. CNN is a deep neural network architecture that was initially recommended for image-processing studies [14]. When CNN architectures are used in the field of natural language processing, rather than image processing, values expressed within matrices become sentences or documents instead of pixels as in image processing. Each line of the matrix is a vector that embeds a word, and these vectors are known as word embedding. With the assumption that each word is expressed with a 100-dimensional vector for a sentence comprised of 10 words, the size of the input matrix will be 10x100 [36]. While the filter is being slid step by step on the image in studies pertaining to image-processing, the filter in text processing is slid over the entire line of the matrix (word). Therefore, the width of the matrix is the same as the dimension of the word vector [37-40].
Pr e-
Recurrent Neural Networks (RNN) Recurrent neural networks are an artificial type of neural network used in sequence information processing that varies temporally [16,41]. Because it is used in data with a sequenced flow, it is commonly preferred in areas such as speech, music, and video processing, but primarily for language-processing problems [42-45]. Different to the classic artificial neural network structures, feedback is taken as the input for the subsequent level [46].
al
Long Short-Term Memory (LSTM) The LSTM method brings solutions to the exploding and vanishing gradients problems that the simple RNN architecture faces. It is also an approach that has been preferred in recent years because it generally provides better results than the basic RNN architecture [19,50]. LSTM has four gates. Cell state is the channel to which information flows uninterruptedly from one cell to another. If considered that LSTM cells sequence successively, the flow of information between the cells is ensured through these means [51]. LSTM decides which information will be forgotten and will be unable to be transferred to the other cells with the cell states with forget gate. The information to be forgotten by implementing the sigmoid process for information coming as input with the information coming from previous states is decided [52]. It is decided with input gate which information will be stored with the cell states and transferred to the other cells. The cell states also updated after the processes in this step are completed and are prepared to be transferred to a subsequent cell [53]. It is decided with output gate what the output value must be. But the entire value coming from the cell state is not provided as an Output. For this, a filter is implemented, being passed through a cell state process [54].
urn
Gated Recurrent Unit (GRU) GRU is a most commonly used variety of LSTM. When studying the GRU structure, as shown in Fig. 3, it can be seen that the most important change is the removal of the cell state. A second important change is that the forget and input gates, which are two different gates, were removed and combined in the form of the update gate. With these two changes, the GRU has a simpler structure compared to the LSTM [55-59].
Jo
Bidirectional LSTM (BLSTM) The bidirectional LSTM is a type of LSTM that was developed to further increase model performance in the processes of data classification, and specifically within sequence structures. Two LSTMs were trained rather than one in the input sequence of the BLSTM structure in problems in which the input sequence had time intervals [61]. The x0, x1, x2,… xn values were used to embed the input vectors. In the LSTMs trained here, the first input of the LSTM is the input sequence itself, and the second is the inverted form of the first sequence. In this regard, the goal was to perform faster and better training for the problem. After completing the training process, the values coming from the LSTM cells are then combined in order to produce the result [62]. 3.2 Word Embedding How the texts within a corpus are embedded is one of the most critical points for text-processing studies and the most important input for a network. The conversion of text into numerical data is called, in its simplest definition, word embedding. The same text can be converted to different numerical values in different manners [6366]. It is a Word2Vec model essentially based on the principle of learning through artificial neural networks and
Journal Pre-proof
p ro
of
words that was developed in 2013 by Mikolov et al. from the estimation-based word embedding method. The model is based on the principle of estimating the target word with reference to words taken as input. The Word2Vec model is composed of two different models, namely CBOW and SG. Experiments were also conducted with the Glove model, another word embedding method, in the scope of two other studies [22,23].
Fig. 1. Relation captured by word embedding [60].
Pr e-
CBOW Model (Continuous Bag of Words) Words not at the core of the window size in the CBOW model are taken as input, and the words in the core are attempted to be estimated as the output. This process continues until the sentence ends (see Fig. 2). The value shown as w(t) is the output value to be estimated and is located at the center of the sentence, while the values shown as w(t-2)…..w(t+2) are the output values not at the center (window_size), based on the preferred dimension of the window [22,23].
urn
al
Skip-Gram Model The Skip-gram (SG) model has a manner of processing that is inverse to that of the CBOW model. The word at the core is taken as the input, and the words not at the core are attempted to be estimated as the output. This process continues until the sentence ends (see Fig. 3) [22,23].
Glove
Jo
Fig. 2. CBOW method
Fig. 3. SG method
Glove is another word embedding method that is the acronym for “Global Vectors for Word Representation.” It is the most commonly utilized method in the field of natural language processing following the Word2Vec method. It was developed by Pennington et al. at Stanford University in California. However, it is less successful when compared to the Word2Vec method in the identification of words with close meanings. Glove is a method that is trained on global word-word numbers and therefore allows for statistics to be used more effectively. The Glove model produces a word space model with 75% accuracy on an analogical dataset [24].
Journal Pre-proof
4. Materials Two Turkish datasets were created in the scope of the current study. The first dataset (Dataset1) is a large corpus created from unlabeled Turkish texts for the purpose of training word vectors. The second dataset (Dataset2) was created for the process of classification and was used together with the acquired word vectors in deep neural networks.
p ro
of
Dataset1 One of the important goals of this study was to test the effect of PWVs on the success of the model. For this aim, the largest corpus within Turkish studies was created. The primary objective in the creation of this dataset was to both improve the classification success within the study, and also to publically share the resulting word vectors in order to contribute to the Turkish literature through future studies yet to be conducted. The corpus was created with a Python library called “Beautiful Soap” [67] that was then useful in order to retrieve data based on queries written, and executed using the Internet. The corpus was created from Turkish Wikipedia articles. There were no spelling errors found in the corpus. As one of the largest corpuses within studies conducted in this field, the dataset created size of 60 GB, and is comprised of a total of 22,090,767 lines, 528,087,641 sentences, and 10,562,752,820 individual words that were turned into tokens within the scope of the current study. Dataset1 and the word vectors created from this study will be publically shared online so as to contribute to the literature on Turkish language text processing studies.
Pr e-
Dataset2 The second dataset in this study was created in order to test the success of Turkish text classification problem, and was used together with deep neural network models for the trained word vectors of Dataset1. Based on this scope, feedback from customers and vendors regarding a clothing store were collected. Messages received through call centers and via the website were labeled by operators. Thus, a large and comprehensive dataset comprising 1.5 million samples and 10 classes was created as Dataset2. Table 1 details the class ID and the corresponding description and class. Table 1 Structure of Dataset2 ID
Description
Class
Satın aldığım ürün büyük geldi küçük beden ile değiştirmek istiyorum
00329
Aldığım ürün için ödememi taksit seçeneği ile yaptım fakat detaylarımda göremiyorum, bilgi verirseniz sevinirim... teşekkürler Aldığımız Eteği çok beğendik teşekkürler kargo sorunsuzdu…
106
merhaba,musterımız 22051399 sıparısıne aıt kargosu alıcı teslım almadıgı ıcın musterımız bu konu hakkında en kısa sure ıcerısınde dönüş beklemekedir. Müşteri Teslimat Yapılmamış Sipariş: 001245 Ürün gayet güzel Kısa surede elime geçti teşekkürler
104
00331 00332 00333
Jo
Table 2 Categories of Dataset2
urn
00330
al
00328
category_name
class
Sipariş / Order
101
Gecikme / Delay
102
İade / Return
103
Kargo / Cargo
104
Ödeme / Payment Şikayet / Complaint
105 106
Teşekkür / Appreciation
107
İptal / Cancel
108
Ürün / Product
109
Satıcı / Dealer
110
103
107
101 107
Journal Pre-proof
5. Experimental Results
of
The data was prepared for processing by performing preprocessing studies initially for Dataset1 and Dataset2 in the experimental studies. Later, the training process was performed using Dataset1, and the word vectors were then created. Classification was applied on the deep neural network models as detailed in the Methods section and on Dataset2 with its variations (hyper parameter tuning). The successes of these models were then compared and analyzed. Finally, PWVs were used together with the deep neural networks, and the effects on success then studied. Fig. 4 illustrates the general architecture of the research.
Pr e-
10.5 billion words 528 million sentences
p ro
1.5 million samples 10 classes
Fig. 4. Overall architecture of the study
All experimental studies were developed using the Python programming language on an Anaconda platform [68]. An Intel Xeon E5-2630 2.20 CPU with 64 GB of memory was used with the Microsoft Windows Server 2012 R2 communication system.
urn
al
Preprocessing Text was divided up into sentences by reading text within text documents and referring to periods. Characters other than letters (e.g., numbers, symbols, punctuation marks) were stripped (cleaned) from the sentences. Sentences were separated into words, noting the spaces between them. Words were separated into tokens and into their roots using a spring. Useless words, known as stop words, were eliminated. However, while stop word lists have previously been created for other languages, a list of stop words had to be created within the current study as no satisfactory list was found in the literature that was specific to the Turkish language.
Jo
The second section of the corpus was used on Dataset1, which was prepared specifically for the study, in the creation of the list of stop words. The reason for this was that this section of the corpus is smaller and contained no writing errors. The TF-IDF method was used in the creation of the stop words list [69]. The term frequency was calculated as TF, as shown by in Eq. (1). The number word occurrences in the text is shown as , and is the total number of words in the text.
The IDF value was calculated using Eq. (2), where documents containing the word “t” therein.
(1) is total number of documents and
is the number of (2)
Journal Pre-proof For the creation of the TF-IDF matrix weights, Eq. (3) was used. (3) When Table 3 is examined, the frequencies of the most common words within the corpus can be calculated. However, it can be seen how close the weight values are to 0 when the TF-IDF values are calculated (as in Table 3), and therefore how common that word is within the corpus. This reveals that within the corpus some words are not deemed to be specific in their application. Therefore, a total of 250 words were specified as stop words, based on having a TF-IDF score of between 0 and 0.1.
of
Table 3 TF-IDF weight matrix for Turkish stop words TF
IDF
'için'
126215/4742527 = 0.026
log(387179 /126215) = 0.486
TF*IDF = 0.012
'de'
98689/4742527 = 0.020
log(387179 /98689) = 0.593
TF*IDF = 0.011
'da'
68652/4742527 = 0.014
log(387179 /68652) = 0.751
TF*IDF = 0.010
'bu'
44037/4742527 = 0.009
log(387179 /44037) = 0.944
TF*IDF = 0.008
've'
26249/4742527 = 0.005
log(387179 /26249) = 1.168
TF*IDF = 0.005
p ro
WORD
WEIGHT
Pr e-
As previously specified, there are approximately 10.5 billion token words within the corpus prepared for the current study. A list of stop words was created and implemented, and 848,908,720 words were subsequently stripped out (cleaned), having been eliminated from within the corpus prepared for the model as words contained within the developed Turkish stop words list. The prepared list of stop words will be shared publically online along with other files created within the current study so as to offer a contribution to the Turkish text processing literature.
Jo
urn
al
Pre-trained Word Vectors (PWVs) The form of numerical embedding for the words is perhaps accepted as the most critical point of the text processing problem. The two most popular word embedding methods today are the Word2Vec [22] and Glove [24] methods, which embed words within a corpus using multidimensional vectors. The current study converted words with 100-dimensional vectors using these two methods on Dataset1, which was created specifically to perform word embedding. Next, the goal was to improve the success performance of the model, being provided as input to the deep neural network architectures, and its effects were studied.
Fig. 5. Turkish word vectors in Dataset1
When studying Fig. 5, word vectors similar to the “Turkish” word were listed and shown twodimensionally. It was seen that a successful vectorization process was performed because it was related to the Turkish words of all the listed words. Out-of-Vocabulary (OOV)
Journal Pre-proof
Pr e-
p ro
of
The purpose of this method is to detect words that do not exist within a vocabulary. As mentioned in the Materials section, Dataset1 (for word embedding training) is the largest corpus ever created in Turkish language studies, and was also found to contain no spellchecking errors. Therefore, it is accepted that the words excluded from the vocabulary were either incorrect or misspelled. For the detection of out-of-vocabulary words, the following method was proposed. After the words are converted to tokens on the dataset where the classification process will be made, the numerical values of the words in the form of vectors were taken from the vocabulary and transferred by transfer learning method. The critical point at this stage is that the values in the vectors are initially set to 0 (zero). Therefore, the words in the vocabulary are replaced with vectors defined as 0 by changing their values and the words that are not in the vocabulary will remain as zero (0). In other words, when the process of converting words into vectors is complete, the vectors that are made up of zeros (representation of words) denote words that are considered out-ofvocabulary. The Fig 6 illustrates the general architecture of the OOV method.
Fig. 6. Overview of OOV method
CNN
Jo
urn
al
An illustration of the CNN model used in the current study is shown as Fig. 7. The weight values were uploaded to the model first with the “embedding layer”, and then the two convolutional processes were later implemented. There were a total of 64 neurons placed on the first layer and 64 neurons placed on the second layer. On the output layer, the final layer, there are 10 neurons, the number of classes in the dataset. While using the ReLu activation function in the model, the softmax function was used in the output layer, and “categorical_crossentropy” was chosen as the loss function.
Pr e-
p ro
of
Journal Pre-proof
Fig. 7. Structure of CNN model
Jo
urn
al
When studying the success of the model, the validation accuracy values at the end of the 10 epochs can be seen (as shown in Fig. 8). The CNN model was first used in the model, and a success value was obtained. Later, the effect of the PWVs on the success of the model, one of the important goals of the current study, was analyzed. When considering the results, it was seen that model success was developed using PWVs. The most successful performance was with the Word2Vec method and the CBOW algorithm.
Fig. 8. Validation accuracy graph of CNN model
Journal Pre-proof When considering the test data, a model developed with word vectors trained using the SG algorithm, a Word2Vec method, exhibited the second-best performance. It was seen that word vectors trained with the Glove method had a lesser performance than the Word2Vec methods. However, it should be noted that regardless of method, the use of PWVs improved the model’s success.
Pr e-
p ro
of
LSTM & BLSTM The structure of the LSTM model used in the current study was developed as shown in Fig. 9. A total of 256 neurons were used in the developed model. Dropout was added after the LSTM layer in order to prevent overfitting [70]. The added dropout value was implemented as 0.2. Softmax function was used in the calculation of the output value. In the BLSTM structure, differently from the LSTM model, the LSTM cells were trained in a multifaceted manner.
Fig. 9. Structure of LSTM & BLSTM model
GRU
Jo
urn
al
When studying the results shown in Fig. 10 and in considering the test data, it can be seen that the BLSTM model was more successful than the LSTM model. It was also seen with the use of PWVs that the success of both the LSTM model and the BLSTM structure improved. The Word2Vec method in this model demonstrated more successful results when compared to the Glove method. Although the CBOW and SG algorithms provided results that were relatively close to each another, the CBOW was again shown to be more successful.
Fig. 10. Validation accuracy of LSTM & BLSTM models
A three-layered structure was used in the GRU model employed in the current study. There were a total of 512 neurons placed on the first layer and 64 neurons placed on the second layer. On the output layer, the final layer, there are 10 neurons, the number of classes in the dataset. The dropout layer was added in order to prevent overfitting between the first two layers [70]. Based on this, Fig. 11 illustrates the architecture of the GRU model.
p ro
of
Journal Pre-proof
urn
al
Pr e-
Fig. 11. Structure of GRU model
Fig. 12. Validation accuracy graph of GRU model
RNN
Jo
When studying Fig. 12, it can be seen that the current success of the GRU model was significantly improved when adding previously trained word vectors to the model. The model’s success after uploading the vectors trained with the CBOW and SG methods were close, but the values from the CBOW method were the more successful. Recurrent Neural Network (RNN) architecture was designed to be three-layered. There were a total of 256 neurons placed on the first layer and 64 neurons placed on the second layer. There were 10 neurons on the final layer as the output layer. Dropout was used in order to prevent overfitting before the final layer [70]. The output values were calculated in the final layer using the softmax function. Fig. 16 illustrates the structure of the RNN model.
p ro
of
Journal Pre-proof
urn
al
Pr e-
Fig. 13. Structure of RNN model
Fig. 17. Validation accuracy graph of RNN model
When studying Fig. 13, it can be seen that the RNN architecture was not particularly successful. The pretrained vectors’ effect in the RNN model was not as significant as seen in the other models. However, it can be said that the CBOW algorithm improved the success, albeit at a low rate.
Jo
CNN-LSTM The study also created an architecture in which the CNN and LSTM structures, two previously used deep neural network models, were applied together. After this model was given to the network as an input for the PWVs, the steps of convolution and max pooling were implemented. After completion of this process, the LSTM layer was added, and the dropout layer was also situated in order to prevent overfitting. There were 10 neurons on the final layer as the output layer [71,72]. The architecture of the model is illustrated in Fig. 14.
p ro
of
Journal Pre-proof
Fig. 14. Structure of CNN-LSTM model
Jo
urn
al
Pr e-
When studying the test results, it can be seen that the CNN-LSTM model represented a significant improvement in the model’s success with the use of PWVs. Although the SG model demonstrated more successful results at the first stage, the CBOW model was shown to be more successful in the more advanced epochs. Fig. 15 illustrates the validation accuracy in graphical format.
Fig. 15. Validation accuracy graph of CNN-LSTM model
6. Results and Discussion
Preprocessing was applied first on the two Turkish datasets created in the scope of the current study. Using the TF-IDF method, a list of stop words unique to the Turkish language was created. Next, the stop words were eliminated on the two datasets by filtering out the words on the created list. The data was prepared for processing after completion of the preprocessing. The word vectors were trained on Dataset1, and a text classification was applied with deep neural network models on Dataset2. The effect of PWVs on the success of the model were attempted to be studied in the research, and word vectors were created using the CBOW, SG, and Glove methods, which are all word embedding methods. In addition, nearly all the deep neural networks were trained on the Turkish dataset and their test results were then
Journal Pre-proof
Pr e-
p ro
of
evaluated in the manner seen with the early stopping technique based on the “validation accuracy” and “validation loss” graphics. Adam optimizer [73] was used as an optimizer, categorical_crossentropy was used as a loss function, a value of 64 was used as the batch_size, and 150 was used as the sentence_length in each of the experimental studies. Reviewing the results, the model that exhibited the most successful performance was the GRU model, while the BLSTM model took second place based on model performance, which were followed by the LSTM and CNN-LSTM models which each exhibited similar levels of performance (see Fig. 16). The relevance of the performance became clear in terms of data pollution, writing errors, and incorrect labeling because Dataset2 contains real data collected from actual customers. Word2Vec and CBOW were used as PWVs as they performed better than others.
Fig. 16. Validation accuracy and loss graph of models with PWV
urn
al
In the experimental studies, it was seen that increasing the number of layers in RNN-based models (GRU, LSTM, and Sim more beneficial when increasing the number of neurons rather than increasing the number of layers. The bidirectional structures were more successful compared with the RNN and LSTM structures. The graphic for the ROC curve [74,75] of each of the models used in the current study are shown in Fig. 17. The GRU mo
Jo
Tru e Posi tive Rat e
False Positive Rate Fig. 17. ROC curves of models with PWV
When studying the performances of PWVs, it can clearly be seen that the Word2Vec model was more successful than the Glove model. When studying the success of each of the deep neural network models, the
Journal Pre-proof
Table 4 Performance improvements for all models with pre-trained word vectors (PWV)
CNN-LSTM CNN RNN
Validation Accuracy (with PWV) 85.82% 83.80% 82.79% 79.80% 77.03% 71.10%
Performance Improvement 6.97% 5.25% 5.52% 3.70% 3.34% 2.44%
p ro
GRU BLSTM LSTM
Validation Accuracy (without PWV) 78.85% 78.55% 77.27% 76.10% 75.69% 68.66%
Pr e-
Model
of
networks trained with vectors trained with the Word2Vec model were more successful in all cases. The CBOW algorithm within the Word2Vec method demonstrated better performance compared to the SG algorithm, even though the results were close in some models. This situation can be interpreted as that the CBOW model performed better because the corpus (Dataset1) used in the study was significantly large. If it was a smaller corpus, the SG algorithm would probably have been more successful. If CBOW versus SG algorithms are compared, the analysis shows that CBOW is more successful in the representation of common words in a corpus, and that the SG algorithm is more successful in the representation of rare words. Therefore, the validation accuracy values for the PWVs were considered with the CBOW algorithm in Table 4. Based on the epoch numbers for the models, the test results from both before and after the PWVs were used are shown in Table 4, and information is also provided with regards to the extent to which the success of the model improved.
Time of Training (s) 32,580 36,400 24,890 26,100 7,810 13,790
al
As can be seen in Table 4, the GRU model exhibited the best performance, and therefore was the model whose success value improved the most using PWVs. Considering the performance improvement column alone, it can be seen that the model performance improved significantly. However, it was most clearly seen at the end of the study that the networks trained using PWVs were the more successful, and that their success values had improved. It is therefore considered beneficial to use PWVs created from different common and large datasets in natural language processing studies. Finally, the training periods for the models was given in Table 4 When reviewing this section, despite the GRU and BLSTM models having the highest success values, it can be seen that they each underwent a slower training process than the other models when considering their training times. As BLSTM is structurally a type of LSTM model that is implemented in a multifaceted manner, the training period is therefore longer (see Table 4). The fastest training process was performed using the CNN model, followed by the RNN model.
Model GRU BLSTM LSTM
Accuracy
Precision
Recall
F-Score
.85 .83 .82 .79 .77 .71
.84 .84 .83 .71 .68 .61
.80 .74 .74 .80 .80 .75
.81 .78 .78 .75 .73 .67
Jo
CNN-LSTM CNN RNN
urn
Table 5 Classification performance with PWV.
Table 5 shows the results of classification performance with PWVs for each model, in descending order of accuracy. The best results obtained in the current study were achieved for the GRU model, with 85% accuracy, 86% precision, and 87% recall.
Journal Pre-proof
68.8
74.5 75.1
0.05
79.5
77.5 77.3
0.01
81.3 83.5
71.03
79.5 79.9
0.005 71.1
70
82.79 83.8
p ro
65
83.2 84.4
79.03 79.8
0.001
60
84.8
of
Learning Rate
70.8
80.8 81.5
75
80
86.5
85.82 85
90
Validation Accuracy (%) RNN
CNN
CNN-LSTM
LSTM
BLSTM
GRU
Pr e-
Fig. 18. Validation accuracy results for different learning rate values
It was observed during the experimental works that one of the most important deep learning model parameters that affected accuracy value was learning rate [76]. Various learning rate values were tested for each of the deep neural network models (see Fig. 18). The optimum tuning of the learning rate parameter significantly affects model performance. In the current study’s experiments, the most appropriate learning rate value was found to be .005. 7. Conclusion and Future Works
Jo
urn
al
With the emergence of word embedding, significant improvements have been seen in the field of natural language processing. In the current study, the largest Turkish word embedding exercise was applied to a Turkish language corpus, and which was then used for the purposes of text classification. Deep neural networks were preferred in the classification phase considering the size of the data. Deep learning is considered an important field of research that has recently seen an increase in its level of popularity. Many different methods have been recommended in the literature on the topic of deep learning based on deep neural networks, and some of these have provided quite successful results, particularly in certain fields of research. The field of natural language processing is one such important research field. In the current study, two Turkish datasets were created. The first dataset was a very large corpus that contained 528 million sentences, 22 million lines, and a total of 11 billion words of unlabeled data, that were gathered from Turkish Wikipedia articles. Due to its size, the dataset required a storage of approximately 60 GB. The second dataset was a multiclass Turkish dataset consisting of 1.5 million records and 10 classes, and was produced for the purposes of text classification. On the large corpus (Dataset1), the word vectors were trained using the Word2Vec method. The word vectors were then transferred by transfer learning, and then text classification was applied on Dataset2. With this method, the current accuracy value was improved upon in the range of approximately 5-7%. In addition, as a result of the current study’s experiments, deep neural network architectures were compared, and the most successful deep neural network structure was found to be the GRU model. When studying the effects of PWVs on the deep neural network models, it was determined that the success of the model was improved by using PWVs. When considering the improvement rates on the current model performance, the current study showed that the most successful word embedding method was the CBOW algorithm for the Word2Vec model. As a result, the effect of PWVs on Turkish natural language learning using deep neural networks was studied within the scope of the current research, and the results were analyzed in detail. The findings and identifications acquired from this study are expected to aid future studies on deep learning in the fields of Turkish text processing and natural language processing, and that the word vectors and datasets created in this study will contribute to the current literature on Turkish text processing.
Journal Pre-proof In future studies, the researchers are planning to extend upon the corpuses and word embeddings used in the current study. Additionally, the researchers plan to develop spelling correction functionality using word embedding using out-of-vocabulary and auto-encoders for the purposes of improving data quality. References
Jo
urn
al
Pr e-
p ro
of
[1] M. Jiang, Y. Liang, X. Feng, X. Fan, Z. Pei, Y. Xue, R. Guan, Text classification based on deep belief network and softmax regression. Neural Comput. Appl, (2018) (29) 61–70. [2] C.C. Aggarwal, C. Zhai, A survey of text classification algorithms. In Mining Text Data; Springer: Berlin/Heidelberg, Germany, (2012) pp. 163–222. [3] M. Kaytan, D. Hanbay, Effective Classification of Phishing Web Pages Based on New Rules by Using Extreme Learning Machines. Anatolian Science - Bilgisayar Bilimleri Dergisi, (2017) 2 (1) 15-36. [4] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuglu, P. Kuksa, Natural Language Processing (Almost) from Scratch. Journal of Machine Learning Research (2011) 12:2493–2537. [5] K. Arkhipenko, I. Kozlov, J. Trofimovich, K. Skorniakov, A. Gomzin, and D. Turdakov, Comparison of neural network architectures for sentiment analysis of Russian tweets,(2016) In Proceedings of Dialogue 2016. [6] D. Küçük, N. Arıcı, Doğal dil işlemede derin öğrenme uygulamaları üzerine bir literatür çalışması. International Journal of Management Information Systems and Computer Science, (2018) 2(2):76-86. [7] M. Ayyüce Kizrak, B. Bolat, Derin Öğrenme ile Kalabalık Analizi Üzerine Detaylı Bir Araştırma. Bilişim Teknolojileri Dergisi, (2018) 11 (3), 263-286. DOI: 10.17671/gazibtd.419205. [8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative Adversarial Nets, Advances in Neural Information Processing Systems, 2672-2680 (2014). [9] A. Şeker, B. Diri, H. H. Balık, Derin Öğrenme Yöntemleri ve Uygulamaları Hakkında Bir İnceleme, Gazi Mühendislik Bilimleri Dergisi, 3(3), 47-64 (2017). [10] S. Yildirim, T. Yildiz, A comparative analysis of text classification for Turkish language. Pamukkale University Journal Of Engineering Sciences-Pamukkale Universitesi Muhendislik Bilimleri Dergisi, (2018) 24(5) 879-886. [11] D. Kılınç, A. Özçift, F. Bozyigit, P. Yıldırım, F. Yücalar, E. Borandag, TTC-3600: A new benchmark dataset for Turkish text categorization. Journal of Information Science, 43(2), 174-185, (2015). [12] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324 (1998). [13] R. Collobert, J. Weston, A unified architecture for natural language processing: Deep neural networks with multitask learning. International Conference on Machine Learning (ICML), (2008) 160-167. [14] Y. Kim, Convolutional neural networks for sentence classification, (2014) arXiv preprint arXiv:1408.5882. [15] X. Zhang, J. Zhao, Y. Lecun, Character-level convolutional networks for text classification. Advances in Neural Information Processing Systems, (2015) 649-657. [16] S. Lai, L. Xu, K. Liu, J. Zhao, Recurrent convolutional neural networks for text classification. AAAI Conference on Artificial Intelligence, (2015) 2267-2273. [17] R. J. Williams, D. Zipser, A learning algorithm for continually running fully recurrent neural networks. Neural Computation, (1989) 1(2), 270-280. [18] D. Ravì, C. Wong, F. Deligianni, M. Berthelot, J. Andreu-Perez, B. Lo, G. Z. Yang, Deep learning for health informatics. IEEE journal of Biomedical and Health Informatics, (2017) 21(1), 4-21. [19] S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Computation, (1997) 9(8) 1735-1780. [20] Y. Cho, L. K. Saul, Kernel methods for deep learning. Advances in Neural Information Processing Systems, (2009) 342-350. [21] J. Y. Lee, F. Dernoncourt, Sequential short-text classification with recurrent and convolutional neural networks, (2016) arXiv preprint arXiv:1603.03827. [22] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space. Proceedings of Workshop at ICLR. Scottsdale, (2013). [23] Q. Le, T. Mikolov, Distributed representations of sentences and documents. 31st International Conference on Machine Learning, China, 2014. [24] J. Pennington, R. Socher, C. Manning, Glove: Global vectors for word representation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Qatar, 2014. [25] F. Rosenblatt, The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, (1958) 65(6), 386.
Journal Pre-proof
Jo
urn
al
Pr e-
p ro
of
[26] L. Deng, D. Yu, Deep learning: methods and applications. Found Trends Signal Process, (2014) 7(3-4):197387. [27] Y. Qi, S. G. Das, R. Collobert, J. Weston, Deep learning for character-based information extraction. European Conference on Information Retrieval, (2014) 668-674. [28] R. Socher, A. Perelygin, J. Wu, Recurrent deep models for semantic compositionality over a sentiment treebank. Paper presented at: Conference on Empirical Methods in Natural Language Processing, (2013) Seattle. [29] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of tricks for efficient text classification, (2016) arXiv preprint arXiv:1607.01759. [30] R. Socher, Y. Bengio, C. D. Manning, Deep learning for NLP (without magic), Tutorial Abstracts of ACL, (2012) 5. [31] W. Yin, K. Kann, M. Yu, H. Schütze, Comparative study of cnn and rnn for natural language processing, (2017) arXiv preprint arXiv:1702.01923. [32] M. Wang, Z. Lu, H. Li, W. Jiang, and Q. Liu, A Convolutional Architecture for Word Sequence Prediction, Acl-2015, pp. 1567.1576, (2015). [33] N. Kalchbrenner, E. Grefenstette, P. Blunsom, A Convolutional Neural Network for Modelling Sentences, (2014) In Proceedings of ACL 2014. [34] C. N. dos Santos, B. Xiang, and B. Zhou, Classifying Relations by Ranking with Convolutional Neural Networks, (2015) arXiv preprint arXiv: 1504 06580. [35] Z. Bitvai, T. Cohn, Non-Linear Text Regression with a Deep Convolutional Neural Network, In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (2015) (Volume 2: Short Papers) (Vol. 2, pp. 180-185). [36] M. M. Lopez, J. Kalita, Deep Learning applied to NLP, (2017) arXiv preprint arXiv:1703.03091. [37] A. Severyn, A. Moschitti, Modeling Relational Information in Question-Answer Pairs with Convolutional Neural Networks, Arxiv, (2016). [38] C. N. Dos Santos, M. Gatti, Deep convolutional neural networks for sentiment analysis of short texts. International Conference on Computational Linguistics (COLING), (2014) 69-78. [39] A. Conneau, H. Schwenk, L. Barrault, Y. Lecun, Very deep convolutional networks for natural language processing. arXiv preprint arXiv:1606.01781 (2016). [40] R. Johnson, T. Zhang, Effective use of word order for text categorization with convolutional neural networks, (2014) arXiv preprint arXiv:1412.1058. [41] N. T. Vu, H. Adel, P. Gupta, H. Schutze, Combining Recurrent and Convolutional Neural Networks for Relation Classification, (2016) arXiv preprint arXiv:1605.07333. [42] P. Liu, X. Qiu, X. Huang, Recurrent neural network for text classification with multi-task learning, (2016) arXiv preprint arXiv:1605.05101. [43] I. Sutskever, J. Martens, G. E. Hinton, Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), (2011) (pp. 1017-1024). [44] A. Graves, Generating sequences with recurrent neural networks, (2013) arXiv preprint arXiv:1308.0850. [45] T. Mikolov, M. Karafiát, L. Burget, J. Černocký, S. Khudanpur, Recurrent neural network based language model, In Eleventh annual conference of the international speech communication association, (2010). [46] Z. C. Lipton, J. Berkowitz, C. Elkan, A critical review of recurrent neural networks for sequence learning, (2015) arXiv preprint arXiv:1506.00019. [47] L. Arras, G. Montavon, K. R. Müller, W. Samek, Explaining recurrent neural network predictions in sentiment analysis, (2017) arXiv preprint arXiv:1706.07206. [48] A. Graves, A. R. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, (2013) (pp. 6645-6649) IEEE. [49] S. Hochreiter, The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, (1998) 6(02), 107-116. [50] M. Tan, C. D. Santos, B. Xiang, B. Zhou, LSTM-based deep learning models for non-factoid answer selection, (2015) arXiv preprint arXiv:1511.04108. [51] K. S. Tai, R. Socher, C. D. Manning, Improved semantic representations from tree-structured long short-term memory networks, (2015) arXiv preprint arXiv:1503.00075. [52] M. Sundermeyer, R. Schlüter, H. Ney, LSTM neural networks for language modeling. In Thirteenth annual conference of the international speech communication association, (2012). [53] Y. Wang, M. Huang, L. Zhao, Attention-based LSTM for aspect-level sentiment classification. In Proceedings of the 2016 conference on empirical methods in natural language processing, (2016) (pp. 606-615).
Journal Pre-proof
Jo
urn
al
Pr e-
p ro
of
[54] C. Zhou, C. Sun, Z. Liu, F. Lau, A C-LSTM neural network for text classification, (2015) arXiv Preprint. arXiv1511.08630. [55] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, (2014) arXiv preprint arXivdean:1412.3555. [56] D. Tang, B. Qin, T. Liu, Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of EMNLP, (2015) 1422–1432. [57] Z. Wu, S. King, Investigating gated recurrent networks for speech synthesis, In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (2016) (pp. 5140-5144). IEEE. [58] M. Schuster, K. K. Paliwal, Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, (1997) 45(11) 2673-2681. [59] M. Sundermeyer, T. Alkhouli, J. Wuebker, H. Ney, Translation modeling with bidirectional recurrent neural networks, In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014) (pp. 14-25). [60] https://www.tensorflow.org/tutorials/Word2Vec (accessed 20 April 2019) [61] B. Plank, A. Søgaard, Y. Goldberg, Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss, (2016) arXiv preprint arXiv:1604.05529. [62] Z. Huang, W. Xu, K. Yu, Bidirectional LSTM-CRF models for sequence tagging, (2015) arXiv preprint arXiv:1508.01991. [63] T. Mikolov, I. Sutskever, K. C. 0010, G. Corrado, J. Dean, Distributed Representations of Words and Phrases and their Compositionality, AAAI Spring Symposium AI Technologies for Homeland Security 200591-98, vol. cs.CL, pp. 3111–3119, (2013). [64] T. Luong, R. Socher, C. D. Manning, Better word representations with recurrent neural networks for morphology, in CoNLL, (2013) pp. 104–113. [65] B. Ay Karakuş, M. Talo, İ. R. Hallaç, G. Aydin, Evaluating deep learning models for sentiment classification, Concurrency and Computation: Practice and Experience, (2018) 30(21), e4783. [66] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword information, (2016) arXiv preprint arXiv:1607.04606 [67] L. Richardson, Beautiful Soup Documentation (2017). [68] F. Chollet, Keras, https://keras.io/,(accessed 20 April 2019) [69] S. Jones, Karen, A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, (1972) 28(1) 11-21. [70] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, JMach Learn Res. 2014, (2014) 15(1):1929-1958. [71] J. Wang, L. C. Yu, K. R. Lai, X. Zhang, 2016. Dimensional sentiment analysis using a regional CNN-LSTM model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (Vol. 2, pp. 225-230). [72] S. Vosoughi, P. Vijayaraghavan, D. Roy, Tweet2vec: Learning tweet embeddings using character-level cnnlstm encoder-decoder, In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, (2016) (pp. 1041-1044). [73] D. P. Kingma, J. Ba, Adam: a method for stochastic optimization, (2014). arXiv Prepr. arXiv1412.6980. [74] A. P. Bradley, The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn. (1997) 30 1145–1159. [75] D. J. Hand, R. J. Till, A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach. Learn, (2001) 45 171–186. [76] J. Schmidhuber, Deep learning in neural networks: An overview. Neural networks, (2015) 61, 85-117.