Ontology-based automatic identification of public health-related Turkish tweets

Computers in Biology and Medicine 83 (2017) 1–9 Contents lists available at ScienceDirect Computers in Biology and Medicine journal homepage: www.el...

Download PDF

924KB Sizes 0 Downloads 18 Views

Report

PDF Reader
Full Text

Computers in Biology and Medicine 83 (2017) 1–9

Contents lists available at ScienceDirect

Computers in Biology and Medicine journal homepage: www.elsevier.com/locate/compbiomed

Ontology-based automatic identiﬁcation of public health-related Turkish tweets

MARK

⁎

Emine Ela Küçüka, Kürşad Yaparb, Dilek Küçükc, , Doğan Küçükd a

Department of Public Health, Faculty of Health Sciences, Giresun University, Giresun, Turkey Department of Medical Pharmacology, Faculty of Medicine, Giresun University, Giresun, Turkey c Electrical Power Technologies Group, TÜBİTAK Energy Institute, Ankara, Turkey d Department of Computer Engineering, Gazi University, Ankara , Turkey b

A R T I C L E I N F O

A BS T RAC T

Keywords: Public health Twitter Social media analysis Health informatics Automatic text processing

Social media analysis, such as the analysis of tweets, is a promising research topic for tracking public health concerns including epidemics. In this paper, we present an ontology-based approach to automatically identify public health-related Turkish tweets. The system is based on a public health ontology that we have constructed through a semi-automated procedure. The ontology concepts are expanded through a linguistically motivated relaxation scheme as the last stage of ontology development, before being integrated into our system to increase its coverage. The ultimate lexical resource which includes the terms corresponding to the ontology concepts is used to ﬁlter the Twitter stream so that a plausible tweet subset, including mostly public-health related tweets, can be obtained. Experiments are carried out on two million genuine tweets and promising precision rates are obtained. Also implemented within the course of the current study is a Web-based interface, to track the results of this identiﬁcation system, to be used by the related public health staﬀ. Hence, the current social media analysis study has both technical and practical contributions to the signiﬁcant domain of public health.

1. Introduction With the advent of social media tools like Twitter, Facebook, and Instagram; and their widespread use every day, social media analysis research is correspondingly boosted for diﬀerent purposes. These purposes include using social media for trend analysis, opinion mining (sentiment analysis) for brands/products, and for tracking epidemics/ diseases, to name a few. Similar to the last item, in this paper, we focus on the use of social media for automatic tracking of public healthrelated tweets in order to help the public health experts in determining the current public health concerns in a timely manner. In a related review paper [1], it is emphasized that Twitter oﬀers quite signiﬁcant opportunities for monitoring public health compared to the traditional mechanisms employed in the discipline. For instance, instead of manual, slow, and time-consuming data collection methods that the conventional systems employ; real-time monitoring and statistics on public health can be achieved with reduced costs through the use of natural language processing and machine learning techniques on Twitter and other social media [1]. Furthermore, the locations of these public health events such as disease outbreaks, epidemics, and health threats can be localized through the related Twitter facilities [1]. ⁎

Accordingly, related work on tweet analysis for disease or epidemic/ pandemic surveillance includes [2] where a number of regression models are applied in order to identify inﬂuenza-related tweets and several conclusions are drawn after the comparison of these models. In [3], an SVM classiﬁer is employed to detect health-related tweets where the training data of the classiﬁer includes 5128 labeled tweets. After the application of this classiﬁer to a set of about 11.7 million tweets, the authors produced a set of 1.63 million tweets related to public health, with high precision [3]. In [4], tweets are examined to track disease activity and public sentiment to inﬂuenza during an Inﬂuenza A H1N1 pandemic in US dated 2009. As a result of their analysis, the authors point out that Twitter traﬃc can be used to estimate disease activity in real-time and tweets can be used to measure disease-related public concern [4]. In another similar work published in the same year, tweets are analyzed to discover mentions of ailments, track illnesses over time, localize illnesses, analyze medication use and symptoms, among others [5]. They conclude that Twitter is a very promising application platform for public health research [5]. In [6], the authors analyze tweets for the purposes of Dengue (an infectious disease) surveillance. It is shown that spatial and temporal prediction of Dengue epidemics can be performed on Twitter [6]. Related tweets are also analyzed in a study to

Corresponding author. E-mail addresses: [email protected] (E.E. Küçük), [email protected] (K. Yapar), [email protected] (D. Küçük), [email protected] (D. Küçük). http://dx.doi.org/10.1016/j.compbiomed.2017.02.001 Received 1 July 2016; Received in revised form 1 February 2017; Accepted 3 February 2017 0010-4825/ © 2017 Elsevier Ltd. All rights reserved.

Computers in Biology and Medicine 83 (2017) 1–9

E.E. Küçük et al.

reveal statistics related to dental pain and actions taken against the pain, hence it is also argued that Twitter is a fruitful platform to be utilized by dental professionals [7]. In another study, tweets are analyzed to track drug and alcohol use as a showcase to determine public health-related topics on Twitter by means of topic models [8]. Another research on inﬂuenza surveillance in social media attempts to distinguish between real inﬂuenza inﬂection mentions and Twitter chatter not reporting actual infections [9]. It is concluded that the employed approach leads to promising results revealed with an accuracy of about 85% for the detection of weekly change in the direction (increase or decrease) of inﬂuenza prevalence in the real inﬂuenza epidemic in 2012–2013 [9]. In [10], the authors point out that tweets on inﬂuenza may be reporting actual infections or may be stating awareness/fear. Hence, they propose a learning algorithm to classify tweets into these two classes and conclude that deep content analysis like theirs is necessary for inﬂuenza surveillance as well as other similar tasks on Twitter [10]. In [11], a minimally-supervised learning algorithm is developed to determine the everyday jargon used to express inﬂuenza-like illnesses and later used these to form Twitter queries targeting at tweets reporting these illnesses. High correlation rates are reported between their tweet trends and similar trends obtained by the traditional surveillance services [11]. An extension of this latter system to be applicable to other syndromes is presented in [12] and similar evaluation results are reported. In [13], a Naive Bayes classiﬁer is used to train a classiﬁer for inﬂuenza-like illness detection in Portuguese tweets. They compare their results with the corresponding data from Inﬂuenzanet [14] and report high correlation rates [13]. More recently, in [15], the authors propose a machine learning approach which combines the data from social media, search, and conventional data sources to nowcast and forecast inﬂuenza activity. And in [16], an online tool called FluOutlook is presented which performs inﬂuenza forecasting by combining current and historical inﬂuenza data, social media, and diﬀerent forecast models. In this paper, we propose an ontology-based system for automatic tracking of public health-related tweets in Turkish1. The main contributions of the proposed system are listed below:

semi-automated procedure is described. The ultimate system based on this ontology for public health surveillance is presented in Section 3 together with its overall evaluation results on the data sets and a brief description of the Web interface of the system. Section 4 includes more detailed evaluation results regarding the system performance. Discussions on these evaluation results and also on the results of an SVM-based experiment are provided in Section 5. Finally, Section 6 concludes the paper with a summary of main points and future research directions.

•

•

•

•

•

2. Semi-automatic construction of the Turkish public health ontology Domain ontologies are important semantic information sources covering the concepts, relations, and rules within the domain under consideration [18]. Various signiﬁcant domain ontologies have been proposed so far in the literature, such as the gene ontology [19], the bioinformatics ontology [20], and the protein ontology [21]. In order to be used in our ultimate Twitter tracking system, we have constructed a public health ontology in Turkish comprising signiﬁcant public health concepts. Our approach is a semi-automated one including manual, automatic, and semi-automatic stages performed in a pipelined manner, as depicted in Fig. 1. The manual stages of the ontology development process are depicted with white boxes while those fully-automated stages are shown with green boxes, to diﬀerentiate the two. The only semi-automatic stage, Stage 6, is depicted with a light-green box. Similar to the study described in [22], we have used Wikipedia as a semantic resource together with other Web resources to facilitate our ontology development process. The main motivation for building this ontology is to compile a wide-coverage list of public health terms to be used by our ultimate system. The details of the development stages of our ontology, given in Fig. 1, are provided below. The number of ontology terms present at the end of each stage is shown within boxes connected to the stages with dashed arrows.

The system is based on a public health ontology covering signiﬁcant concepts including disease names, related symptoms, and medication categories. This ontology is created semi-automatically within the course of the current study and during the construction procedure the ontology concepts are expanded through linguistically-motivated expansion schemes so that a large list of terms related to public health is obtained. This term list distilled from the ontology is made publicly available for research purposes. The evaluations are performed on two randomly compiled tweets sets (of one million tweets each, collected during two distinct consecutive 20-day periods) without providing any search criterion except the language. We discuss the evaluation results and provide error analyses which can be utilized as a guide during further studies. To the best of our knowledge, this is the ﬁrst study to analyze Turkish tweets for public health surveillance purposes. Most of the previous work is carried out on English tweets, only one of them is on Portuguese tweets [13], but it is a signiﬁcant research issue to carry out analysis experiments on social media content in other languages, or on multilingual content. As a proof-of-concept, a Web-based interface is implemented so that the related public health experts can use the interface for public health surveillance through Twitter.

•

•

Stage 1: The authors of the study have determined the main concepts of our public health ontology as GeneralPublicHealth, Disease, Symptom, and Medication, where some of these authors are domain experts.2 The subconcepts of some of these main concepts are manually determined, such as Epidemic, Hospital, Emergency, and Vaccination as the subconcepts of GeneralPublicHealth. Similarly, under the concept of Medication, the generic medication types such as Antibiotic and Antihistamine. Stage 2: In this automatic stage, Wikipedia articles and other Web resources are automatically processed to determine the subconcepts of Disease and Symptom. Especially those pages including lists of related concepts such as 〈https://tr.wikipedia.org/wiki/Hastal% C4%B1k_isimleri_listesi〉 are considered where this Wikipedia page provides a list of disease names in Turkish. At this stage, we also include English resources as diseases may sometimes be expressed with their English names within tweets. Stage 3: The automatic extraction procedure might have extracted terms with writing errors mainly due to character encoding problems, particularly during the extraction of terms having one of the six Turkish characters with diacritics (ç, ğ, ı, ö, ş, and ü). Another source of errors propagated from the previous stage is that the community-created data sources like Wikipedia might include erroneous information which needs manual correction. These errors are corrected during this manual stage.

The rest of the paper is organized as follows: In Section 2, the details of building the public health ontology from scratch through a

1

2 To improve the readability and comprehensibility of our paper, the ontology concepts are given in English. However, when there is a need for actual examples, actual concepts in Turkish will be utilized.

A preliminary version of this paper is presented in [17].

2

Computers in Biology and Medicine 83 (2017) 1–9

E.E. Küçük et al.

Fig. 1. The Development process of the public health ontology in Turkish.

•

•

•

Stage 4: Several relevant public health-related terms like felç olmak (meaning “to be paralyzed”) and uyuz olmak (meaning “to have scabies”) are commonly used metaphorically as idiomatic expressions in informal texts. For instance, felç olmak is used to mean “to stop functioning” in the sentence “Traﬁk felç oldu” (meaning “The traﬃc is stuck”). Similarly, in informal texts, uyuz olmak is used to express feelings of dislike for something/someone. Through a manual revision of the current version of the ontology, disease names which are highly likely to be used metaphorically are eliminated from the ontology by the authors. Stage 5: As listed above in the third item, Turkish alphabet has six characters with diacritics (ç, ğ, ı, ö, ş, ü). It is reported in the literature that, the corresponding characters of (c, g, i, o, s, u) are often used instead of the ones with diacritics in informal texts [23]. In order to cover such forms of the public health terms as well, for each ontology concept having at least one character with a diacritic, another version of the same concept, in which all characters with diacritics are replaced with their corresponding characters given above, is automatically added to the ontology as a synonym of the existing concept. For instance, due to the existing concept şeker hastalığı (meaning “diabetes”), seker hastaligi is automatically included in the ontology during this stage. At the end of this stage, a total of 889 terms (concepts and their synonyms) can be obtained from the ontology. To provide informative statistics regarding this state of the ontology: 494 out of 889 concepts are original ontology concepts, there are 127 synonyms added to this set; and then for the resulting 621 concepts, 268 new concepts are added at the end of the diacritics-based expansion procedure performed at this stage. Stage 6: In Turkish, nouns can be inﬂected with the inﬂectional suﬃxes such as case markers. In this study, we have implemented a semi-automatic inﬂectional expansion approach through which new forms of the existing ontology concepts are produced by adding the inﬂected forms of the concepts with the ﬁve case suﬃxes (accusative (ACC), dative (DAT), locative (LOC), ablative (ABL), and genitive (GEN)) and the instrumental suﬃx3. Similar to the previous case, these inﬂected forms are added as the synonyms of the existing corresponding concepts. The implemented expansion scheme considers the related linguistic phenomena when adding these suﬃxes to the ontology concepts. To illustrate, for the existing ontology concept grip (meaning “ﬂu”), the following six inﬂected forms are 3

•

These are some of the most common inﬂectional suﬃxes in Turkish [24].

3

included: gribi (‘ﬂu+ACC’), gribe (‘ﬂu+DAT’), gripte (‘ﬂu+LOC’), gripten (‘ﬂu+ABL’), gribin (‘ﬂu+GEN’), and griple (‘with ﬂu’). The tool for this expansion scheme is implemented within the scope of the paper as a rule-based system due to the fact that the corresponding inﬂections are governed by a mostly deﬁnite set of rules in Turkish and hence the expected error rate of the tool is considerably low. Yet, the terms obtained as the result of the diacritics-based expansion scheme of the previous stage and the terms in English are not natural language constructs for Turkish. Therefore, the inﬂections of only these terms are reviewed by the authors and errors encountered are manually corrected. In order not to add an inﬂected form homonymous to a common name in Turkish not related to public health, an automated postprocessing step is employed to check each new inﬂected form against a list of well-formed word forms in Turkish and discard the new form if it is found in the list. To provide detailed information on this ﬁnal set, 747 Turkish terms are inﬂected (the others are from diﬀerent languages, most of them being in English) with the aforementioned suﬃxes leading to 4482 new forms, and two of these inﬂected forms turn out to homonymous to common names and hence discarded. Therefore, the resulting ontology includes 889 terms from the previous stage and 4480 inﬂected forms introduced at the end of this stage, leading to a total of 5369 ontology terms. Stage 7: At this ﬁnal stage, all of the resulting terms corresponding to the ontology concepts and their synonyms are compiled into a single lexical resource to be used by our ultimate identiﬁcation system. This resource is also made publicly available for research purposes at 〈http://user.ceng.metu.edu.tr/~e120329/Turkish_ Public_Health_Ontology.csv〉 in Comma Separated Values (CSV) format. The lines in this ﬁle indicate all of the relationships (associations) between all of the ﬁnal ontology terms. Apart from the ﬁrst line of the ﬁle which denotes information about each column, each line corresponds to an individual ontology term. The values given for each term at each line include the following information, in sequence: – Is-Diacritics-Expansion: A value as Yes/No which denotes whether this term is obtained as a result of the diacritics-based expansion procedure described in Stage 5. – Diacritics-Expansion-Of: If the value of the previous ﬁeld is Yes, this ﬁeld contains the nominative form of the original term on which the diacritics-based expansion procedure is applied. – Is-Synonym: A value as Yes/No which denotes whether this term

Computers in Biology and Medicine 83 (2017) 1–9

E.E. Küçük et al.

Fig. 2. Schematic representation of sample concepts in the constructed public health ontology.

the sole language criterion of “Turkish”) and classiﬁes a tweet as related to public health if it contains at least one of the entries in its lexical resource derived from the public health ontology. During the whole evaluation phase, the automatically classiﬁed tweets are manually reviewed and marked as “related” or “not related” by the ﬁrst author of this paper (who is an expert of the public health domain), in order to create the answer key. In order to evaluate the performance of the system, we have ﬁrst downloaded a tweet set of one million tweets (“First Data Set”). This data set includes random tweets from the consecutive 20 days between February 25 and March 16, 2015. During the collection of this data set, 50,000 tweets are downloaded on each day, using the Twitter Streaming API [25]. First, we have evaluated the system using only the ﬁrst version of the ontology terms (obtained at the end of Stage 5 in Fig. 1) on this data set. This initial version identiﬁed 1052 tweets (∼0.1%) as related to public health. 819 of the identiﬁed tweets are directly related to public health while 233 of them are false positives, hence the precision of the system is found as 78%.4 Next, we have evaluated the ﬁnal version of the system utilizing the ﬁnal form of the ontology (obtained at the end of Stage 7 in Fig. 1) on the “First Data Set”. This time, the system identiﬁed 1500 tweets (∼0.15%) where 1103 of them are correct and 397 of them are incorrect identiﬁcations. Hence the precision of the ﬁnal form of system on this data set is found as 73.5%. We can conclude that the automatic expansion stages of the ontology development process increase the volume of the tweets identiﬁed from ∼0.1% to ∼0.15% in return to a 4.5% decrease in precision.5 Some of the main ontology concepts that are frequently found by the system in the “First Data Set” include grip (‘ﬂu’), hastane

is a synonym of another term of the ontology. – Synonym-Of: If the value of the previous ﬁeld is Yes, this ﬁeld contains the nominative form of the original term of which the current term is a synonym. – Original-Nominative-Form: The nominative form of the original ontology term. – Accusative-Form: The original term's inﬂected version with the accusative case marker. – Dative-Form: The original term's inﬂected version with the dative case marker. – Locative-Form: The original term's inﬂected version with the locative case marker. – Ablative-Form: The original term's inﬂected version with the ablative case marker. – Genitive-Form: The original term's inﬂected version with the genitive case marker. – Instrumental-Form: The original term's inﬂected version with the instrumental suﬃx. The schematic representation of some main concepts of the public health ontology is presented in Fig. 2, for illustrative purposes. The schema shows the English translations of the concepts in parentheses. The aforementioned four main concepts (GeneralPublicHealth, Disease, Symptom, and Medication) and their subconcepts are shown with diﬀerent colors. As pointed out in the last item above, the list of all terms that can be extracted from the ontology, i.e., concept names, synonyms added at the ends of the diacritics-based expansion and inﬂectional expansion stages, are all made available as a single resource of 5369 entries. This resource is also utilized by our tweet identiﬁcation system to be described in the following section.

4 This initial version of the system and this evaluation phase is described in [17] as previously mentioned in Section 1. 5 As part of future work, a comprehensive assessment study can be carried out with a group of public health experts to determine the acceptable or optimal precision and recall rates for the tweet identiﬁcation problem. Within the course of this prospective study, the experts can be asked to assess the utility of diﬀerent tweet sets (obtained with diﬀerent system settings, and hence diﬀerent precision-recall rates) for their public health surveillance purposes.

3. Automatic identiﬁcation of public health-related tweets 3.1. System description and overall evaluations Our automatic identiﬁcation system, which aims to determine public health-related tweets, scans the incoming Twitter stream (with 4

Computers in Biology and Medicine 83 (2017) 1–9

E.E. Küçük et al.

high-precision system intuitively seems more favorable although it may have a low recall rate. Besides, in some studies like [3], high precision over recall is favored and a tagger with a precision of 90.4% and a recall of 32.0% is built for the tweet identiﬁcation task. Yet, we believe that optimal precision/recall rates for this task can best be determined through an analysis study with the experts of the domain as stated in an earlier footnote. Below provided are the main reasons that lead to the missing of the 53 tweets by the system:

Table 1 The Top-12 Frequent Terms Observed in the Relevant Tweets in the Tweet Data Sets. First data set

Second data set

Term

Term (in English)

Frequency

Term

Term (in English)

Frequency

grip hastane hastaneye

flu hospital hospital +DAT hospital +LOC emergency emergency +ACC flu+ABL

261 112 94

grip hastane hastaneye

182 112 109

73

hastanede

63 31

migren hastaneden

24

karın ağrısı

sore throat hospital +ABL migraine allergy cramp

23 21

alerji acil servis

flu hospital hospital +DAT hospital +LOC migraine hospital +ABL stomach ache allergy emergency

16 15 14

pansuman nezle obezite

dressing cold obesity

hastanede acil servis acil servisi gripten bogaz ağrısı hastaneden migren alerji kramp

•

84 56 23 22 22 22

•

15 14 13

• (‘hospital’), bogaz ağrısı (‘sore throat’), alerji (‘allergy’) domuz gribi (‘swine ﬂu’), nezle (‘cold’) and their expanded forms. These are expected signs of related public health problems due to the season of the data set (late winter, early spring). The actual frequencies of the corresponding tweets with the top-12 frequent ontology terms are provided in the ﬁrst three columns of Table 1. In order to observe the performance of the system during a diﬀerent seasonal period, we have similarly compiled another data set of one million tweets during the period between August 18 and September 6, 2015 (“Second Data Set”).6 The ﬁnal form of the system determined 1455 tweets (∼0.15%) where 1010 of them are correct and 445 of them are incorrect, hence precision is 69.4%. Some of the frequent concepts that lead to the identiﬁcation of the corresponding tweets are grip (‘ﬂu’), hastane (‘hospital’), migren (‘migraine’), karın ağrısı (‘stomach ache’), and obezite (‘obesity’), among others. The actual frequencies of the tweets with the top-12 frequent terms are given in last three columns of Table 1. Flu and related diseases are common in this data set as well but less frequent compared to the ﬁrst data set. Other types of disorders and symptoms such as migraine, stomach ache, and obesity seem to be increasingly mentioned in the tweets and seasonal conditions might be one of the reasons of this phenomenon. As it is not practical, we have not calculated the overall recall rates during our evaluation. Nevertheless, to observe the coverage of the system on a random sample, we have compiled a subset of the whole data sets by extracting the ﬁrst 250 tweets from each of the 50,000 tweets per day, leading to a data set of 10,000 tweets (published in 40 days) which corresponds to 0.5% of the two evaluation data sets in size. The authors of this study have annotated each of the tweets in this set and 65 tweets (0.65%) are found to be relevant to public health, while the remaining ones are considered irrelevant. By automated matching with the previously obtained and classiﬁed results of our system, it is found that 12 of these 65 tweets have also been identiﬁed by our system. Therefore, the recall of our system on this random subsample is found as 18.5%. At the end of this experiment, our system has identiﬁed 13 tweets as relevant, one of them being a false positive. Therefore, the precision of our system on this data set is found as 92.3% which is a favorable precision rate. Though the recall value is low, in order not to expose the system users to many false positives, a

•

In some tweets, the terms signifying public health-related phenomena are inﬂected into forms apart from the inﬂections considered in the current study. Additionally, some related terms in tweets are expressed in verb forms, instead of noun phrases considered mostly by the system. As mentioned as a line of future work in the last paragraph of the current subsection, the integration of a lemmatizer can help reduce such problems. In other tweets, the related terms are expressed with writing errors and hence the corresponding tweets are missed by the system. A tweet normalization procedure, mentioned in the last paragraph below, can be used to address this problem. In few tweets, the public health-related phenomena are expressed with some idiomatic expressions which are hard to determine by automated means. Again in few tweets, the public health-related terms in the tweets belong to the set of terms excluded during Stage 4 of the ontology development process in Fig. 1, and hence the corresponding tweets are missed. Stage 4 has been employed to increase the precision of the overall system and its employment leads to the missing of some relevant tweets as expected.

Before moving to the subsection describing the interface of the ultimate system, we should note that an alternative approach to the problem of automatic identiﬁcation of public health-related tweets in an inﬂected language would be to use the bare forms of the ontology concepts and process the tweet contents instead, for possible inﬂected forms and diacritics-based word modiﬁcations. This alternative strategy would require the incorporation of a lemmatizer during tweet processing and possible errors originated from the lemmatization procedure should also be considered accordingly. We have left this alternative approach as a plausible future research work based on our current study. Similarly, considering the diacritics-based word modiﬁcations and possible writing errors in tweets, tweet normalization as a form of tweet preprocessing is another alternative option to be considered as part of future work. Yet, current tweet normalization methods may fall short of expectations, since in studies such as [23,26], it is pointed out that only up to 1% improvement in F-Measure7 can be obtained with normalization for the named entity recognition task. 3.2. The user interface of the system As an initial prototype, we have implemented a Web-based interface to our system where the prospective users of the interface are the public health staﬀ in related organizations. A snapshot of the initial interface page illustrating the system results on the two tweet data sets is presented in Fig. 3. The interface enables its users to select its language among Turkish and English and in the snapshot the interface language is selected as English for illustrative purposes. Through the interface, the frequencies of the relevant tweets can be observed as time-series charts. Additionally, the top-12 most frequent terms that lead to the identiﬁcation of the corresponding tweets are displayed as bar charts, thereby facilitating the tracking of the current public health-related phenomena all over the country. In Fig. 3, the

6 The period between February 25 and March 16 (of the First Data Set) is mostly part of winter time in Turkey while the period between August 18 and September 6 (of the Second Data Set) can be considered part of summer time.

7

5

F-Measure is usually calculated as the harmonic mean of precision and recall.

Computers in Biology and Medicine 83 (2017) 1–9

E.E. Küçük et al.

Fig. 3. A snapshot of a page of the system interface illustrating system results.

time-series and bar charts on the upper part of the page correspond to the systems results on the ﬁrst data set while the charts on the lower part of the page correspond to the second data set. The interface also facilitates the exploration of the ontology concepts in tabular form in another page accessed through the “Edit the Ontology” button on the page shown in Fig. 3. The snapshot of this page is shown in Fig. 4. This page enables the system operators to make changes on the ontology through the addition of new concepts and the update/deletion of existing ontology concepts. As part of future work, the interface can be improved to display the system outcomes in an online fashion so that the domain experts can be informed of the important public health phenomena in a timely manner. Adding geographical display facilities such as conveniently showing the tweets on the country map, based on their location information will be a nice feature of the ultimate interface of the online system. Another prospective feature for the interface is the ability to review the actual tweets in a selected date or for a selected public health term.

•

•

•

4. Detailed evaluation results As presented in the previous section, the ﬁnal form of the automatic tweet identiﬁcation system detects a total of 2955 tweets (∼0.15%) in two million random Turkish tweets (“First Data Set” +“Second Data Set”) where 2113 of them are correctly relevant tweets while 842 of them are irrelevant. Therefore, the overall precision of the system is found as 71.5% which is a promising result. The main reasons for the false positives are elaborated below:

•

As previously mentioned, some disease names can be used metaphorically in informal texts. Although we had eliminated some of such terms during the Stage 4 of our ontology development process (Fig. 1), some phrases including disease names like verem olmak (“to have tuberculosis”) seem to be used in tweets to emphasize

anger or sadness. Tweets with such phrases account for more than half of the false positives. In some tweets, though a public health-related term is relevantly used, the author of the tweets express a concern regarding their health instead of expressing a genuine problem. For instance, some tweets are written to express that the person is afraid of having a disease or s/he has an anticipation of a disease soon. Such cases also constitute a signiﬁcant source of false positives. Some public health-related terms seem to be used as part of the proper names of places, as in Karantina Kafe (“Quarantine Cafe”). There are several instances of this case observable among the false positives. As part of future work, a list of patterns (such as Kafe (‘Cafe’)) that may follow the ontology terms can be compiled so that the matching sequences (such as Karantina Kafe) can be excluded during the identiﬁcation phase. As presented in Section 3.1, the precision obtained after the evaluation of the system with the initial version of the ontology (after Stage 5 in Fig. 1) is 78% while the precision of the evaluation of the system with the ﬁnal version of the ontology is 73.5% on the “First Data Set”. This is due to the fact that some of the inﬂected ontology terms lead to a considerable number of irrelevant tweets, in addition to the relevant ones. For instance, the six newly added inﬂected forms of the term hastane (‘hospital’) lead to the identiﬁcation of 199 relevant and 87 irrelevant tweets. The irrelevant tweets usually include chatter such as jokes or exaggerations about being taken to or staying at hospitals, usually published after sports matches. Although the number of relevant tweets identiﬁed increases considerably, the precision for the tweets identiﬁcations due to these inﬂected forms of hastane is 69.6%. Hence, the overall precision decrease due to all such inﬂected terms leading to false positives is 4.5%.

In order to observe the performance of our tweet identiﬁcation system for each main ontology concept, we provide the frequencies of 6

Computers in Biology and Medicine 83 (2017) 1–9

E.E. Küçük et al.

Fig. 4. A snapshot of a page of the system interface illustrating the content of the public health ontology in Turkish.

Fig. 5. The frequencies of relevant and irrelevant tweets in the system output and the corresponding precision rates for the ﬁrst data set for each ontology concept.

Fig. 6. The frequencies of relevant and irrelevant tweets in the system output and the corresponding precision rates for the second data set for each ontology concept.

7

Computers in Biology and Medicine 83 (2017) 1–9

E.E. Küçük et al.

the relevant and irrelevant tweets within the system output in Figs. 5 and 6, for the ﬁrst and second data sets, respectively. These frequencies are provided individually for each main ontology concept of General Public Health, Disease, Symptom, and Medication. Along with these frequency bar charts on the left sides of these ﬁgures, the corresponding precision rates for each ontology concept are also provided as bar charts. These evaluation results of the system for each ontology concept are discussed in the upcoming section.

•

5. Discussion The detailed evaluation results including the frequencies and the corresponding precision rates for each ontology concept given in Figs. 5 and 6 lead to the following conclusions regarding the contributions of these concepts to the system performance:

• • •

We have also tested our trained SVM model on our annotated subset, of 10,000 entries, which is quite unbalanced. The SVM model achieves precision and recall rates of 99.1% both, which are again quite favorable. Again using the Weka platform, we have tested NaiveBayes and J48 (decision tree) classiﬁers for comparison purposes and observed the following:

The concepts of General Public Health and Disease lead to more number of tweets (relevant+irrelevant) than the other concepts while the concept of Medication leads the lowest overall number of tweets on both data sets. The use of the terms derived from the Disease concept leads to more precise identiﬁcations when compared with the other concepts. The second most useful concept seems to be the General Public Health in terms of precision on both data sets. The performance contributions of the terms derived from the Symptom and Medication concepts are considerably lower. Particularly, considering the fact that the frequency of the tweets identiﬁed with the use of the Symptom terms is considerably high, we can conclude that this concept is a signiﬁcant source of the false positives.

• • •

Several studies on Twitter mining for health-related topics employ machine learning approaches as reviewed in Section 1. We have also considered training an SVM classiﬁer following the study presented in [3]. Preferably, a balanced training data set is required to arrive at a high-performance SVM classiﬁer. Previously in Section 3.1 while calculating recall on a subset of 10,000 tweets, it is found that the percentage of tweets related to public health is about 0.65% in this subset. To produce a balanced and annotated training set of reasonable size is quite time-consuming and labor-intensive due to this scarcity of relevant tweets in randomly-compiled tweet sets. Therefore, we have considered using the results of our system evaluation in order to form the training data set for our classiﬁer. We have taken the true positives of our system on both of the data sets which include a total of 2113 positively classiﬁed tweets. Next, we have randomly extracted 2113 negatively classiﬁed tweets from the annotated tweet subset of 10,000 entries which has been used to calculate the recall of our system, as presented in Section 3.1. Hence, we have compiled a training corpus of 4226 tweets each of which is annotated either as RELEVANT (50%) or IRRELEVANT (50%). After automated preprocessing to ﬁlter out the stopwords in tweets of this training data set, we have used unigrams as the features of our SVM classiﬁer. We have used the SMO (sequential minimal optimization) classiﬁer of the open-source data mining platform Weka [27], where this classiﬁer uses the SMO algorithm [28] to train an SVM classiﬁer with a linear kernel. 10-fold cross validation on this training data set results in quite high performance rates: a precision of 95% and a recall of 94.6% leading to an overall FMeasure of 94.8%. These performance rates are quite favorable and they suggest that:

• •

ontology concepts, compared to the previous related work (such as [3,10]) which emphasize the employment machine learning algorithms and the tuning of the related features utilized by these algorithms. In addition to our SVM-based experiment, other commonly employed automated classiﬁers for English tweets can also be applied to the problem of public health surveillance on Turkish tweets. Accordingly, further experiments should be performed with diﬀerent classiﬁer types and annotated data sets. The future work on Turkish tweets may consider and exploit the peculiarities of Turkish language such as its rich morphological constructs, as we have included some of the common inﬂected forms of the health-related concepts in our public health ontology (see Stage 6 in Section 2). For instance, some of the prospective features to be used by the automated classiﬁers may cover such peculiarities.

•

10-fold cross-validation of the NaiveBayes classiﬁer on the training data set has resulted in a precision and recall of 91.5% both. The evaluation of the NaiveBayes classiﬁer on the test data set of 10,000 tweets after being trained using the training set has resulted in a precision of 99.1%, a recall of 91.4%, and an F-Measure of 94.9%. 10-fold cross-validation of the J48 classiﬁer on the training data set has resulted in a precision of 95.5%, recall of 95.2% and an FMeasure of 95.2%. The evaluation of the J48 classiﬁer on the test data set after training (on the training set) has resulted in a precision and recall of 99.1% both, similar to the results of the SVM classiﬁer under the same settings.

These results indicate that the SVM and J48 classiﬁers achieve comparable results and outperform the NaiveBayes algorithm in our settings. This observation is in line with the ﬁndings reported in related work such as [3,4,13,15] where SVM and/or decision tree classiﬁers are used for public health surveillance on social media. As part of future work, we plan to increase the performance of our system by considering these observations regarding the errors. There might also be many relevant tweets which have gone unnoticed but due to the large size of our data sets, it is not feasible to scan the whole sets and calculate the actual recall rates, although we have calculated the recall values for a subset of the data sets. Yet, further studies should also consider increasing the coverage of the system so that the number of relevant but unnoticed tweets can be decreased. Lastly, it should be noted that the system utilizes Twitter Streaming API which provides a 1% sample of the current tweets. Therefore, other ways to increase the representability of the incoming social media stream should be considered in the future [29]. 6. Conclusion Social media analysis has gained signiﬁcant research attention especially due to the topics like trend analysis and opinion mining. Automatic tracking of public health-related phenomena like disease outbreaks or epidemics through social media analysis is also an important research issue. In this paper, we present an ontology-based system to automatically identify Turkish tweets related to public health. First, we have built a wide-coverage public health ontology in a semiautomated manner and then used the list of terms extracted from this ontology to implement the system. We have evaluated our system on

Joint utilization of our ontology-based identiﬁcation system and a machine learning approach can lead to quite improved performance results, through the utilization of system's output within the training data set of the machine learning algorithm. Our original approach and system can be considered more of a knowledge-based approach, by emphasizing the compilation of the 8

Computers in Biology and Medicine 83 (2017) 1–9

E.E. Küçük et al.

pp. 789–795. [11] F. Gesualdo, G. Stilo, E. Agricola, M.V. Gonﬁantini, E. Pandolﬁ, P. Velardi, A.E. Tozzi, Inﬂuenza-like illness surveillance on twitter through automated learning of Naïve language, PLOS ONE 8 (12) (2013) e82489. [12] P. Velardi, G. Stilo, A.E. Tozzi, F. Gesualdo, Twitter mining for ﬁne-grained syndromic surveillance, Artif. Intell. Med. 61 (3) (2014) 153–163. [13] J.C. Santos, S. Matos, Analysing twitter and web queries for ﬂu trend prediction, Theor. Biol. Med. Model. 11 (Suppl 1) (2014) S6. [14] D. Paolotti, A. Carnahan, V. Colizza, K. Eames, J. Edmunds, G. Gomes, C. Koppeschaar, M. Rehn, R. Smallenburg, C. Turbelin, S. van Noort, A. Vespignani, Web-based participatory surveillance of infectious diseases: the inﬂuenzanet participatory surveillance experience, Clin. Microbiol. Infect. 20 (1) (2014) 17–21. [15] M. Santillana, A.T. Nguyen, M. Dredze, M.J. Paul, E.O. Nsoesie, J.S. Brownstein, Combining search, social media, and traditional data sources to improve inﬂuenza surveillance, PLOS Comput. Biol. 11 (10) (2015) e1004513. [16] Q. Zhang, C. Gioannini, D. Paolotti, N. Perra, D. Perrotta, M. Quaggiotto, M. Tizzoni, A. Vespignani, Social Data Mining and Seasonal Inﬂuenza Forecasts: The FluOutlook Platform, in: Proceedings of ECML/PKDD, 2015, pp. 237–240. [17] E. Küçük, K. Yapar, D. Küçük, D. Küçük, Automatic Identiﬁcation of Public Health Related Turkish Tweets , in: Proceedings of the 9th European Public Health Conference, 2016. [18] A. Gómez-Pérez, M. Fernández-López, O. Corcho, Ontological Engineering, 3rd edition, Springer-Verlag, London, 2004. [19] M. Ashburner, C.A. Ball, J.A. Blake, D. Botstein, H. Butler, J.M. Cherry, A.P. Davis, K. Dolinski, S.S. Dwight, J.T. Eppig, et al., Gene ontology tool for the uniﬁcation of biology, Nat. Genet. 25 (1) (2000) 25–29. [20] R. Stevens, C. Goble, I. Horrocks, S. Bechhofer, Building a Bioinformatics Ontology Using OIL, IEEE Transactions on Information Technology in Biomedicine 6 (2). [21] D.A. Natale, C.N. Arighi, W.C. Barker, J.A. Blake, C.J. Bult, M. Caudy, H.J. Drabkin, P. DEustachio, A.V. Evsikov, H. Huang, et al., Protein ontology: a structured representation of protein forms and complexes, Nucleic Acids Res. 39 (suppl 1) (2011) D539–D545. [22] D. Küçük, Y. Arslan, Semi-automatic construction of a domain ontology for wind energy using wikipedia articles, Renew. Energy 62 (2014) 484–489. [23] D. Küçük, R. Steinberger, Experiments to Improve Named Entity Recognition on Turkish Tweets, in: EACL Workshop on Language Analysis for Social Media, 2014, pp. 71–78. [24] A. Göksel, C. Kerslake, Turkish: A comprehensive Grammar, Routledge, 2005. [25] Twitter, Twitter Streaming APIs (May 2016) 〈https://dev.twitter.com/streaming/ overview〉. [26] L. Derczynski, D. Maynard, N. Aswani, K. Bontcheva, Microblog-genre Noise and Impact on Semantic Annotation Accuracy, in: Proceedings of the 24th ACM Conference on Hypertext and Social Media, 2013, pp. 21–30. [27] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, WEKA data mining software: an update, ACM SIGKDD Explor. Newsl. 11 (1) (2009) 10–18. [28] J.C. Platt, Fast training of support vector machines using sequential minimal optimization, Adv. Kernel Methods (1999) 185–208. [29] F. Morstatter, J. Pfeﬀer, H. Liu, K.M. Carley, Is the Sample Good Enough? Comparing Data from Twitter’s Streaming API with Twitters Firehose, in: Proceedings of ICWSM, 2013.

two genuine and large tweet sets compiled during distinct times of the year and achieved promising performance rates. We have also implemented a Web-based graphical user interface for the system as a proof-of-concept. We believe that our study is a signiﬁcant contribution to related literature on social media analysis for public health surveillance and on semi-automated ontology construction for the health-related domains. Future work based on the current study includes increasing the precision and coverage rates of the system, adding relevant features to its user interface and to make the whole system execute in an online manner on data from diﬀerent social media sites. Conﬂict of interest None declared. References [1] M. Dredze, How social media will change public health, IEEE Intell. Syst. 27 (4) (2012) 81–84. [2] A. Culotta, Towards Detecting Inﬂuenza Epidemics by Analyzing Twitter Messages, in: Proceedings of the First Workshop on Social Media Analytics, 2010, pp. 115– 122. [3] M.J. Paul, M. Dredze, A Model for Mining Public Health Topics from Twitter, Tech. rep., Johns Hopkins University, Baltimore, MD (2011). [4] A. Signorini, A.M. Segre, P.M. Polgreen, The use of twitter to track levels of disease activity and public concern in the us during the inﬂuenza a h1n1 pandemic, PLOS ONE 6 (5) (2011) e19467. [5] M.J. Paul, M. Dredze, You Are What You Tweet: Analyzing Twitter for Public Health, in: Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media, 2011, pp. 265–272. [6] J. Gomide, A. Veloso, W. Meira Jr, V. Almeida, F. Benevenuto, F. Ferraz, M. Teixeira, Dengue Surveillance Based on a Computational Model of Spatio-temporal Locality of Twitter, in: Proceedings of the ACM International Web Science Conference, 2011, pp. 1–8. [7] N. Heaivilin, B. Gerbert, J. Page, J. Gibbs, Public health surveillance of dental pain via twitter, J. Dent. Res. 90 (9) (2011) 1047–1051. [8] K.W. Prier, M.S. Smith, C. Giraud-Carrier, C.L. Hanson, Identifying HealthRelated Topics on Twitter: An Exploration of Tobacco-Related Tweets as a Test Topic, in: Social Computing, Behavioral-Cultural Modeling and Prediction, Springer, 2011, pp. 18–25. [9] D.A. Broniatowski, M.J. Paul, M. Dredze, National and local inﬂuenza surveillance through twitter an analysis of the 2012-2013 inﬂuenza epidemic, PLOS ONE 8 (12) (2013) e83672. [10] A. Lamb, M.J. Paul, M. Dredze, Separating Fact from Fear: Tracking Flu Infections on Twitter, in: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies, 2013,

9

Ontology-based automatic identification of public health-related Turkish tweets

Ontology-based automatic identification of public health-related Turkish tweets

Recommend Documents