A distant supervision based approach to medical persona classification

A distant supervision based approach to medical persona classification

Journal of Biomedical Informatics 94 (2019) 103205 Contents lists available at ScienceDirect Journal of Biomedical Informatics journal homepage: www...

3MB Sizes 0 Downloads 38 Views

Journal of Biomedical Informatics 94 (2019) 103205

Contents lists available at ScienceDirect

Journal of Biomedical Informatics journal homepage: www.elsevier.com/locate/yjbin

A distant supervision based approach to medical persona classification Nikhil Pattisapu , Manish Gupta, Ponnurangam Kumaraguru, Vasudeva Varma ⁎

T

Information Retrieval and Extraction Lab, Kohli Center for Intelligent Systems, International Institute of Information Technology Hyderabad, 500032, India

ARTICLE INFO

ABSTRACT

Keywords: Distant supervision Medical social media Persona Medical personae Deep learning Convolutional neural network Long short term memory network Hierarchical attention network

Identifying medical persona from a social media post is critical for drug marketing, pharmacovigilance and patient recruitment. Medical persona classification aims to computationally model the medical persona associated with a social media post. We present a novel deep learning model for this task which consists of two parts: Convolutional Neural Networks (CNNs), which extract highly relevant features from the sentences of a social media post and average pooling, which aggregates the sentence embeddings to obtain task-specific document embedding. We compare our approach against standard baselines, such as Term Frequency - Inverse Document Frequency (TF-IDF), averaged word embedding based methods and popular neural architectures, such as CNNLong Short Term Memory (CNN-LSTM) and Hierarchical Attention Networks (HANs). Our model achieves an improvement of 19.7% for classification accuracy and 20.1% for micro F1 measure over the current state-of-theart. We eliminate the need for manual labeling by employing a distant supervision based method to obtain labeled examples for training the models. We thoroughly analyze our model to discover cues that are indicative of a particular persona. Particularly, we use first derivative saliency to identify the salient words in a particular social media post.

1. Introduction Social media has emerged as one of the main technologies for sharing information, ideas, interests, opinions and experiences. It has also transformed a lot of interactions and communications in medical and healthcare domains. Medical Social Media (MSM), which is a subset of social media restricted to medical and healthcare domains, has recently emerged as a rich source of vital information [1]. MSM includes healthcare portals,1 medical blogs,2 tweets,3 collaborative documents and medical wikis. It contains diversified knowledge about a plethora of healthcare specific topics, such as patient’s experience with a particular drug [2], consultant’s advice about a disease [3], researcher’s finding about a new medical compound [4] and a caretaker’s firsthand experience with caregiving to a diseased patient [5]. MSM is increasingly assuming a greater role in several organizations such as hospitals [6,7], clinics [8,9] and pharmaceutical firms [10]. Recent research reveals that MSM can be used for many tasks such as drug marketing, pharmacovigilance and public health monitoring. However, identifying the medical persona corresponding to an MSM post is of paramount importance for these applications. The task of medical persona

classification aims to identify the medical persona associated with a particular MSM post. MSM contributors usually belong to a variety of personae, such as patients, caretakers, consultants, researchers, medical journalists and pharmacists [1,11]. Table 1 shows excerpts from MSM posts and the medical personae associated with them. Although identifying medical persona is beneficial for hospitals, clinics, pharmaceutical firms, government bodies and non-profit organizations, here we present the use-cases specific to a pharmaceutical firm.

• The health profiles of patients can be used to discover the commonly • •

occurring conditions of a disease, unmet need of a drug and use this information to design effective clinical trials and marketing campaigns. It can be used to discover adverse affects of a particular drug and use it to enhance drug safety guidelines for patients. The profiles of caretakers can be used to help them with precautions, instructions while administering a particular drug. It can be used to help them form targeted communities, discuss their issues and improve their quality of life. The profiles of researchers can be used to obtain the latest research

Corresponding author. E-mail addresses: [email protected] (N. Pattisapu), [email protected] (M. Gupta), [email protected] (P. Kumaraguru), [email protected] (V. Varma). 1 http://kevinmd.com/blog/. 2 https://lunaoblog.blogspot.com. 3 https://twitter.com/SmaIIArms/status/825095345453543424. ⁎

https://doi.org/10.1016/j.jbi.2019.103205 Received 27 September 2018; Received in revised form 28 April 2019; Accepted 6 May 2019 Available online 11 May 2019 1532-0464/ © 2019 Elsevier Inc. All rights reserved.

Journal of Biomedical Informatics 94 (2019) 103205

N. Pattisapu, et al.

Table 1 Excerpts from MSM posts and the medical personae associated with them.

• • •

Excerpt from MSM Posts

Medical Persona

I am feeling awful pain in my head. I am feeling as if I’ve hit a train. My mother forgets her spectacles every few min. I suspect she is suffering from dementia. the presence of inflammation in the sinus determines the name of the sinusitis, like maxillary sinusitis etc spironolactone has been used for 30 years as a diuretic. Spironolactone is a synthetic steroid structurally related a multicenter phase ii study of carboplatin and paclitaxel for advanced thymic carcinoma A new report reveals that about 37 million americans experience migraines, some of them daily

Patient Caretaker Consultant Pharmacist Researcher Journalist

insights about a particular drug or disease area and use it in various stages of a drug pipeline. The profiles of consultants can be used to recruit them as key opinion leaders in a specific drug or disease area. They can gather consultants’ opinion about a drug and request them to promote it. The profiles of pharmacists can be used to gather information about drug dosage, interactions and therapeutic effects of a particular drug. They can also use this to find information about a competitor’s drug pipeline and landscape. The profiles of journalists can be used to gather information about quality of life of patients in a geographic region.

vocabulary of size |V |. In the second baseline, we represent each social media post as the averaged pre-trained embedding of all the words present in it. These representations are commonly used as baselines for several text classification tasks such as topic categorization [13], sentiment analysis [14,15] and authorship attribution [16,17]. In the third, fourth and fifth baselines, we obtain sentence representation using ELMo embeddings [18], Universal sentence encoder [19] and BERT [20] respectively and subsequently average the sentence embeddings to obtain document embedding. We then train a multi class SVM using these representations. The sentence representations from Elmo embeddings and universal sentence encoders were shown to be useful for several downstream tasks including text classification [21].

For most of these applications, identifying medical personae associated with a social media post is critical as it narrows the search space. For instance, information about unknown adverse reactions of a drug can be extracted exclusively from patient and caretaker posts. The task of medical persona classification is challenging due to a variety of reasons.

3. Recent neural models for document classification Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) with pre-trained word embeddings such as Word2Vec [22] or GloVe [23] are regularly used for document classification tasks [24–28]. Usually, Gated Recurrent Units (GRU) or Long Short Term Memory Network (LSTM) are used as recurrent units, instead of vanilla RNNs. We compare our approach (refer Section 5) against CNN [26] and LSTM based models for text classification. Recently, neural models that exploit hierarchical nature of a document have gained popularity [29,30]. In these models, task-specific features are first extracted from local text regions (like sentences), subsequently the representations of several text regions are aggregated to form document representations (global features). In this work, we use two of such popular architectures for medical persona classification as baselines.

• Lack of adequate labeled data to train deep learning models. • Identifying useful features requires a lot of domain expertise. Also, feature extraction is non-trivial. • MSM posts contain incorrect spellings, incorrect grammar and non standard abbreviations which degrade the performance of Natural Language Processing (NLP) and medical information extraction tools.

The task of medical persona classification was first introduced by Pattisapu et al. [12]. They discussed several methods based on manual feature engineering, pre-trained word embeddings and deep learning. However, their dataset consisted of only 1581 blogs which is too small to train deep learning models. In this work, we propose deep learning based methods to computationally identify the medical persona associated with a social media post. We propose a distant supervision based approach to obtain labeled data for this task without incurring additional labeling costs. We use this data to train our deep learning models. We use the human labeled data provided by Pattisapu et al. [12] to evaluate our models. Our main contributions in this work are enlisted below.

• CNN-LSTM • Hierarchical Attention Networks 3.1. CNN-LSTM Fig. 1 shows the CNN-LSTM model where Convolutional Neural Networks (CNN) are used to extract task-specific features from a sequence of word embeddings and LSTM is used to learn a document embedding from sentence embeddings. A plain softmax classifier is used on top of document embedding to categorize it into one of the predefined categories. The sequence of word embeddings can either constitute a linguistic unit such as a sentence or merely constitute a text region. In this model, the weights of CNN are shared across various text regions. Several architectures also use RNNs to extract task-specific features from text regions. The choice of using CNNs or RNNs to model a sentence purely depends on the task at hand. Generally, CNNs are good at extracting local and position invariant features and RNNs are good at encoding structure dependent semantics of the whole input [31].

• We create a new dataset for the persona classification using distant • •

supervision which is five times as large as the manually created dataset by [12]. We propose a novel deep learning based hierarchical neural architecture for this task. We thoroughly analyze our model to unearth the hidden cues in a post that are indicative of a particular persona.

2. Typical text classification approaches

3.2. Hierarchical Attention Networks

In this section, we treat persona classification from MSM as a generic text categorization problem and discuss most commonly used approaches. We use these approaches as baselines for our problem. In the first baseline, we represent each social media post (a.k.a document) as a row vector d = [TFIDF (wi)], 1

i

Hierarchical Attention Networks [32] a.k.a HAN (depicted in Fig. 2) use a similar network architecture as that of CNN-LSTM with two major differences. First, HANs use RNNs to extract task-specific features from a sentence. Second, HAN hypothesizes that for a particular

|V | where wi is the ith word in the

2

Journal of Biomedical Informatics 94 (2019) 103205

N. Pattisapu, et al.

Fig. 1. Architecture of the CNN-LSTM model. Colored viewing advised. The sample convolution filters of size d × 2, d × 3, d × 4, d × 5 are shown in blue, orange, green and purple colors respectively, where d is the dimensionality of pre-trained word embedding.

matrices are randomly initialized and are jointly learnt along with other parameters of the network. One of the main reasons for the popularity of HANs is their ability to differentiate important sentences or words from others. 4. Challenges in adapting current neural models to persona classification CNN-LSTM and HAN models rely on LSTMs to model document embedding from sentence embeddings. Recent research reveals that CNNs exhibit a better performance than LSTMs (or GRUs) especially when the classification task is essentially a keyphrase recognition task such as sentiment detection and question-answer matching settings [31]. We argue that persona classification is also essentially a keyphrase recognition task, wherein a few keyphrases determine the persona. Consider the excerpts shown in Table 2. Observe that, the bold faced sentences can solely determine the medical persona corresponding to an MSM post. Additionally, the order of sentences has little effect on persona. We have seen that both CNN-LSTM and HANs rely on LSTMs to model document embedding from sentence embeddings. This introduces additional parameters which have to be learnt, often using limited training data. HANs rely on two additional matrices for incorporating sentence and word attention mechanism. Large number of parameters in these models might cause the model to overfit on training data which in turn might cause poor generalization on test data. Apart from this, larger sequences are known to adversely affect the performance of LSTMs.

Fig. 2. Architecture of the HAN model.

5. Approach

classification task, some words are more important than others, similarly, some sentences are more important than others. They model the importance of word and sentence using a mechanism called attention. For this, they introduce two additional parameters (matrices) which capture the useful words and sentences in an input document. These

One of the advantages of CNN-LSTM and HANs is that they exploit the hierarchical nature of a document by sharing the weights that are used to construct sentence embeddings. This reduces the number of parameters and results in faster training. In this work we propose a 3

Journal of Biomedical Informatics 94 (2019) 103205

N. Pattisapu, et al.

Table 2 Excerpts from MSM posts. Bold faced sentences indicate crucial information for persona classification Persona

Excerpt from MSM

Caretaker

Cancer is a dangerous disease. Cancer kills more than a million people every year. My brother is fighting this monster as well. People suffering from cancer should exercise a lot of care. Cancer kills more than a million people every year. My brother is fighting this monster as well. People suffering from cancer should exercise a lot of care. Cancer is a dangerous disease. People suffering from cancer should exercise a lot of care. Cancer is a dangerous disease. Cancer kills more than a million people every year. My brother is fighting this monster as well. Cancer is a dangerous disease. Cancer kills more than a million people every year. In my career of 20 years, I’ve seen hundreds of cancer patients. People suffering from cancer should exercise a lot of care. Cancer kills more than a million people every year. In my career of 20 years, I’ve seen hundreds of cancer patients. People suffering from cancer should exercise a lot of care. Cancer is a dangerous disease. People suffering from cancer should exercise a lot of care. Cancer is a dangerous disease. Cancer kills more than a million people every year. In my career of 20 years, I’ve seen hundreds of cancer patients.

Caretaker Caretaker Consultant Consultant Consultant

Fig. 3. Architecture of our Sent CNN. Colored viewing advised. The sample filters of size d × 2, d × 3, d × 4, d × 5 are shown in blue, orange, green and purple colors respectively, where d is the dimensionality of pre-trained word embedding. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

neural model named Sent CNN (Fig. 3) which uses convolutional neural networks for extracting task specific features from each sentence. The resulting sentence embeddings are aggregated using average pooling to form a document embedding. Finally, we use a plain softmax classifier on document embedding which assigns a score corresponding to each persona category. The weight sharing mechanism of our model is similar to that of CNN-LSTM and HAN models, i.e. the CNN filters which are used to obtain sentence embeddings are shared across the sentences. We use ADAM [33] stochastic optimizer to update network parameters and minimize categorical cross entropy loss.

retrieved posts. Posts containing ill-parsed content, advertisements, duplicate or near duplicate content are removed using standard techniques. The resulting blogs are looked at by the human annotators and were assigned following persona labels: Patient, Caretaker, Consultant, Pharmacist, Researcher and Journalist. The inter-annotator agreement, Cohen’s kappa, between the four annotators was found to be 0.7. To perform distant data collection we use a variety of heuristics. For obtaining patient and consultant posts we fetch all posts of kevinmd. com which are associated with the category Patient and Physician respectively. The posts of caregiving.com and drugs.com are used as labeled examples for caretaker and pharmacist categories. Articles belonging to drugs.com/news are fetched to be used as labeled examples for journalist category. Similarly, pubmed abstracts are used as labeled examples for researcher category. Fig. 4 shows excerpts of distant MSM posts corresponding to each persona category. Table 3 details various characteristics of gold and distant data. The dataset and source code are made freely available to the public.5

6. Dataset For medical persona classification task, we fetch blogs from medical social media blogs and forums. We create two datasets – gold and distant. In gold data, each post is looked at by a human annotator and assigned a persona category. In distant data, we use heuristically motivated methods to get automatically labeled data. For gold data collection, we use a set of fifty most popular (drug, disease names) pairs obtained from drugs.com as queries. We query the Twingly blog search API4 to retrieve the matching blogs corresponding to each query. We randomly sample fifty blogs per query from the 4

7. Experiments and results We split the distant data into training (90%) and validation (10%). 5 https://drive.google.com/open?id=1EXLNjbRT0rbDwi91NYR16N0Dyr1pEoD.

https://developer.twingly.com/resources/search/. 4

Journal of Biomedical Informatics 94 (2019) 103205

N. Pattisapu, et al.

Fig. 4. Excerpts from distant data. (a) Patient (b) Consultant (c) Caretaker (d) Pharmacist (e) Researcher (f) Journalist. Table 3 Characteristics of Gold (G) and Distant (D) data. wmin, wmax , wavg , wstd denote minimum, maximum, average and standard deviation of number of words in a social media post. Category

Patient Caretaker Consultant Pharmacist Researcher Journalist

wstd

wavg

wmax

wmin

ndocs

G

D

G

D

G

D

G

D

G

D

13 18 13 17 43 18

27 14 61 118 12 17

1256 1309 1488 1380 1355 1733

4633 3612 5023 3770 2408 3020

347 398 568 444 474 455

1219 626 1511 819 278 1086

247 301 311 328 348 367

752 465 605 403 156 948

319 63 361 194 75 162

1274 639 2020 973 2176 466

The human labeled gold data is split into 10 folds and is reserved for evaluating the performance of various classifiers. Table 5 details the comparative performance of all models averaged over 10 folds. Table 6 details the category wise performance of these models averaged over 10 folds. For tuning the hyperparameters of all the aforementioned models, we used a grid search over the hyperparameter space and the configuration corresponding to highest macro averaged F1 on validation set was chosen to be optimal. Table 4 shows the optimum values for various hyperparameters of Sent CNN. In order to account for class imbalance, we use category weights while training all the models. These weights are inversely proportional to category frequencies in the training data. For all neural models viz. CNN, LSTM, CNN-LSTM, HAN, Sent CNN, we have used dropout layer to ensure that our model generalizes well and does not overfit on distant data. Fig. 5 shows the performance of Sent CNN across several dropout rates. The highlighted region shows

Fig. 5. Performance of Sent CNN model across various dropout rates.

Table 4 Optimum values of various hyperparameters for Sent-CNN. Patience is a parameter which indicates the number of epochs with no improvement after which learning rate will be reduced. Hyperparameter Convolution filter sizes Total number of filters Max sentence length Number of epochs Mini batch size Patience Dropout rate

Fig. 6. Training and Validation Loss of Sent CNN model per Epoch.

Optimum Value

that a dropout rate of 5% is optimal for our task. We have conducted experiments by removing stopwords and found that it reduces the classification accuracy as well as Micro F-score by 0.4%. Fig. 6 demonstrates the training and validation loss curves for Sent CNN model across multiple epochs. We use scikit-learn[34] for implementing SVM + TF-IDF and SVM + Averaged word embedding baselines. All the experiments

2,3,4,5 400 15 10 16 3 0.05

5

Journal of Biomedical Informatics 94 (2019) 103205

N. Pattisapu, et al.

In order to gain an understanding about what exactly is our model learning, we use the following two techniques.

Table 5 The Comparative Performance of the proposed approaches and baselines. P, R and F1 denote Precision, Recall and F-Measure. Approach

SVM + TFIDF SVM + AvgEmb SVM + ELMo SVM + USE SVM + BERT CNN LSTM CNN-LSTM HAN Sent CNN

Accuracy

0.388 0.421 0.212 0.289 0.306 0.352 0.292 0.319 0.354 0.504

Macro Averaged

• N-grams with the highest contribution • Salient sections of input

Micro Averaged

P

R

F1

P

R

F1

0.364 0.392 0.221 0.292 0.220 0.381 0.205 0.308 0.401 0.401

0.366 0.426 0.221 0.292 0.192 0.384 0.194 0.282 0.367 0.478

0.325 0.380 0.221 0.292 0.154 0.315 0.139 0.263 0.339 0.407

0.391 0.423 0.124 0.134 0.306 0.356 0.297 0.324 0.357 0.508

0.391 0.423 0.165 0.172 0.306 0.356 0.297 0.324 0.357 0.508

0.391 0.423 0.108 0.101 0.306 0.356 0.297 0.324 0.357 0.508

N-grams with the highest contribution - In this approach, we first determine the n-grams that generate the highest activation values aggregated over all CNN filters (Eq. 1). Next, we determine the n-grams which result in the highest activation values for each filter fk , i.e. , f represent word {argmax f (Wngram fk ) fk } where Wngram, fk , embedding matrix for an n-gram, kth filter matrix, convolution operator and activation function respectively. Both these kinds of n-grams reveal important clues about our model.

score (ngram) =

f (Wngram

fk )

fk

involving neural networks were realized using PyTorch [35] on a machine housing 24 CPU cores, 128 Gigabytes RAM and 8 NVIDIA 1080 GPUs. The training time for CNN-LSTM, HAN, Sent CNN models were 3.5, 4.5 and 2.5 h respectively.

(1)

Tables 8 and 9 show the n-grams with the highest overall activation and n-grams with the highest activation per filter respectively. Table 8 shows that the bigrams with the words says or said have the highest activation overall, similarly, several bigrams containing the word professor have a very high activation value. This might be due to a combination of two reasons, First, a lot of research posts are written by academicians and second academicians tend to use the term professor frequently. Similar pattern is visible in trigrams as well. Interestingly, the four grams capture the drug sequences, such as oxybutynin ditropan oxytrol tolterodine which are usually found in pharmacists posts.6 Table 9 shows the n-grams with the highest activation per filter. Observe that, there are several n-grams which have more than a single occurrence, for instance, doi 10, doi 1. This is because these n-grams generated the highest activation corresponding to multiple filters. Observe that a lot of bigrams and trigrams in Table 9 contain the word doi or a date such as jul 15. This might be due to the fact that a lot of research posts contain date of indexing of a journal abbreviated as doi. Similarly, journalist posts contain information about date of publishing such as Tuesday, Dec 23. Several bigrams contain personal pronouns such as I. Usually, patients and caretakers posts narrate firsthand experience about a disease/ drug or event. Therefore, these are rich in terms of personal pronouns. Salient sections of input - Li et al. [36] define the salience score (S ) S (e ) as |l (e )| where l (e ) = (ec) , e is the input word embedding and Sc is the class score. We adapt this method to find the salient words in an input for a particular category (such as patient). Eq. (2) shows the score corresponding to word wk w.r.t class c where e (wk ) is the embedding corresponding to wk and K is the maximum number of words in the input sentence. score (wk ) captures the salience of the word wk in a sentence.

8. Analysis and discussion Table 5 shows that Sent CNN outperforms other methods across various metrics. We found that Sent CNN model significantly ( p < 0.01, paired t-test) outperforms rest of the models. Specifically, Sent CNN outperforms the three best performing models shown in Table 5: SVM + AvgEmb ( p = 0.0008, paired t-test), CNN ( p = 5.59 10 6 , paired ttest) and HAN ( p = 2.79 10 5 , paired t-test). We observe that despite using a simpler architecture, Sent CNN outperforms other models. We identify two key reasons for that. First, Sent CNN model uses CNN filters with shared weights to extract important task specific features. This reduces the number of trainable parameters, avoids overfitting on distant data and enables the model to generalize well on human labeled data. Second, CNNs are efficient at extracting task specific features (such as important bi-grams and trigrams) using small amount of training data. On the other hand, sophisticated models are efficient at encoding topic information (which deals with the question of what is the blog about?) but are inefficient at encoding persona specific information (which deals with the question of who has written the blog?). We also observe that CNN-LSTM and HAN based models lag behind the Sent CNN model by more than 15% in the micro-averaged F1 score. We analyzed these models based on their efficiency to categorize the human labeled blogs (gold dataset). We discovered that unlike Sent CNN, both CNN-LSTM and HAN based models were incorrectly tagging lengthy blogs which contained a large number of sentences. This indicates that these models fail to learn efficient document representations for this task if the number of input sentences is high. Additionally, we also observed that several blogs lacked proper punctuation which lead to a poor sentence tokenization which in turn has adversely the performance of these models. Table 6 shows that the precision, recall and F-measure of all methods corresponding to Caretaker persona is low. Upon manual inspection, we found that the reason for this is the huge semantic gap between the Caretaker documents in distant and gold data. Table 7 shows few caretakers’ sentences obtained from distant and gold data. Observe that unlike distant data, sentences extracted using gold data discuss actual caregiving experience. We also notice that Sent CNN results in low recall (0.064) and therefore low F1 (0.104) values for Consultant category (refer Table 6). We found that Sent CNN wrongly labels several Consultant post as Patient and therefore results in a poor recall for Consultant persona and a very high recall (0.909) for Patient persona. Upon deeper inspection, we found that this is due to significant semantic overlap between Patient and Consultant posts, primarily because both were obtained from the same source kevinmd.com.

score (wk ) =

e swk K

e swk k=1

,

s wk =

(Sc ) (e (wk )) (2)

We now use Eq. (2) to discover various interesting properties of our model. In order to visualize salience per word we use text heatmaps. Figs. 7 and 8 show the salient sections of the same input but for different classes. Observe that the word me has a higher salience if the input was sourced from a patient post as compared to a consultant’s post. We find that the words that represent the class names, or morphological variants of class names, or words which are highly similar to a class name were salient. Figs. 9–14 show that the words such as patient, 6 Upon further inspection, we found that majority of the pharmacist posts in our gold data contain portions of advertisements which are aimed at selling drugs. Often, they have the structure such as Get flat 25% discount on ionamine bontril phendimetrazine tartrate online, buy now.

6

Journal of Biomedical Informatics 94 (2019) 103205

N. Pattisapu, et al.

Table 6 Category wise performance of our proposed approach and baselines. Cat

Metric

SVM + TFIDF

SVM + AvgEmb

SVM + ELMo

SVM + USE

SVM + BERT

CNN

LSTM

CNN-LSTM

HAN

Sent CNN

Caretaker

P R F1

0.100 0.014 0.025

0.070 0.116 0.079

0.024 0.099 0.037

0.021 0.029 0.018

0.000 0.000 0.000

0.108 0.121 0.108

0.000 0.000 0.000

0.133 0.059 0.082

0.133 0.044 0.067

0.050 0.011 0.018

Consultant

P R F1

0.295 0.256 0.273

0.303 0.166 0.213

0.299 0.434 0.322

0.320 0.913 0.472

0.316 0.837 0.456

0.320 0.024 0.043

0.301 0.877 0.445

0.300 0.414 0.346

0.226 0.288 0.251

0.256 0.064 0.104

Journalist

P R F1

0.576 0.271 0.360

0.354 0.792 0.488

0.078 0.274 0.121

0.095 0.027 0.041

0.258 0.094 0.136

0.339 0.589 0.430

0.242 0.046 0.077

0.497 0.138 0.212

0.640 0.525 0.569

0.654 0.618 0.628

Patient

P R F1

0.684 0.570 0.622

0.714 0.532 0.601

0.307 0.146 0.138

0.300 0.023 0.041

0.112 0.036 0.053

0.806 0.391 0.525

0.000 0.000 0.000

0.395 0.422 0.406

0.684 0.187 0.293

0.670 0.909 0.770

Pharmacist

P R F1

0.327 0.502 0.394

0.447 0.518 0.478

0.000 0.000 0.000

0.050 0.005 0.009

0.416 0.134 0.200

0.275 0.796 0.409

0.476 0.116 0.175

0.288 0.217 0.243

0.347 0.693 0.461

0.361 0.752 0.487

Researcher

P R F1

0.194 0.580 0.282

0.473 0.421 0.424

0.038 0.034 0.035

0.018 0.037 0.023

0.216 0.047 0.076

0.435 0.386 0.387

0.207 0.122 0.145

0.240 0.445 0.295

0.378 0.460 0.395

0.421 0.517 0.440

Table 7 Sample Caretakers’ Sentences from Distant and Gold Data. Sample sentences from Caretakers’ post (Distant Data)

Sample sentences from Caretakers’ post (Gold Data)

Did you have a caregiving experience in your childhood. it s hard to imagine life without caregiving. Thank you all for very supportive and I am very thankful to have found this site. I tell her she still is a a caregiver to me.

lately she keeps her eyes closed most of the time. he is very fatigued now and he is losing his hair. taking, lipitor, was totally unnecessary and was lowering her cholesterol. maybe she had a clot in her brain.

Table 8 N-grams with the highest activations overall. All N-grams are comma seperated. N

N-grams with the highest activations overall

2

physician said, blumenfeld says, psychiatry says, frieden said, researchers said, psychiatry asserts, assistant professor, adjunct professor, collaborator professor, immunology professor, emeritus professor davies associate professor, blumenfeld md director, professor belz said, phd ensign professor, clinical assistant professor, state epidemiologist said, woolsey woolsey says, dr blumenfeld says, hiring unqualified assistants, said chief executive, miller pharmd professor, md emeritus professor, md emeritus professor dec 11 310 22, 2015 jul 26 pii, androl jul 2012 14, jan apr 21 1, jun 26 6 651, jun 1 151 6, jul aug 35 4, 2015 apr 6 doi, zoloft sertraline hydrochloride paxil, fluocinolone betamethasone dipropionate fluocinonide, risedronate actonel zoledronate zometa, fosamax ibandronate boniva risedronate, oxybutynin ditropan oxytrol tolterodine, urispas oxybutynin ditropan oxytrol, timoptic ocudose timolol timoptic, methylparaben ethylparaben propylparaben butylparaben, alendronate fosamax ibandronate boniva, flavoxate urispas oxybutynin ditropan, ditropan oxytrol tolterodine detrol, ionamine bontril phendimetrazine tartrate, ibandronate boniva risedronate actonel, oxybutynin ditropan tolterodine detrol

3 4

Table 9 N-grams with the highest activations per filter. All N-grams are comma seperated. N

N-grams with the highest activations per filter

2

doi 10, doi 10, doi 10, doi 10, doi 10, 1 doi, 1 doi, 1 doi, com doi, print doi, rating i, i review, i boght, i review, i said, i appreciate, ive liked, i review, i concur, i loathed, jul 2012, jun 1, jul 2013, aug 2014, fda said, said Monday, study said, researchers said, researchers said, researchers said, said chief, fda said, said Thursday, said jeff fda said on, notification fda said, state epidemiologist said, fda said wockhardt, the fda said, ministry said recently, said the researchers, 2015 patient education, 2015 jun 1, 2015 jun 1, epidemiology apr 2014, date of aug, published nov 2015, Tuesday dec 23, 16 2016 announced, a physician employed, physicians office or, physician is to, a physicians supervision, patient pdf physician, a physicians supervision, a physician or, headache yesterday i, chronic fighters i, review ive had, i blog about, i review my, air filter i, 0 i have, most helpful i ibandronate boniva risedronate actonel, ibandronate boniva risedronate actonel, zoloft lexapro effexor cymbalta, zoloft lexapro effexor cymbalta, prozac celexa zoloft cymbalta, apixaban pradaxa dabigatran xarelto, fluoxetine prozac paroxetine paxil, ibandronate boniva risedronate actonel, benadryl tylenol sinus tavist, methylparaben ethylparaben propylparaben butylparaben, benadryl tylenol pm unisom, dihydrochloride 0 5 mg, detrol or solifenacin vesicare, zoloft lexapro effexor cymbalta

3

4

Fig. 7. Salient sections of an input for Patient Class.

Fig. 8. Salient sections of an input for Consultant Class. 7

Journal of Biomedical Informatics 94 (2019) 103205

N. Pattisapu, et al.

Fig. 9. Salient sections of an input for Patient Class.

Fig. 10. Salient sections of an input for Consultant Class. Fig. 11. Salient sections of an input for Caretaker Class. Fig. 12. Salient sections of an input for Journalist Class. Fig. 13. Salient sections of an input for Researcher Class. Fig. 14. Salient sections of an input for Pharmacist Class.

Fig. 15. Salient sections of various inputs for Patient Class.

Fig. 16. Salient sections of various inputs for Caretaker Class.

Fig. 17. Salient sections of various inputs for Caretaker Class. 8

Journal of Biomedical Informatics 94 (2019) 103205

N. Pattisapu, et al.

Fig. 18. Salient sections of various inputs for Consultant Class.

Fig. 19. Salient sections of various inputs for Pharmacist Class. Fig. 20. Salient sections of various inputs for Researcher Class.

Fig. 21. Salient sections of various inputs for Journalist Class.

consultant, nurse, caregiver, journalist, pharmacy are highly salient for the corresponding classes. Interestingly, we observed that the word patient has high salience scores even if it occurs in a Consultant’s post. We observed that for Patient persona the salience score for abusive words such as shit, fck is high (Refer Fig. 15). Also, words with negative sentiments have a very high salience scores such as doomed, die, broken. This might be owing to the fact that patients are increasingly expressing their displeasure w.r.t their condition/ treatment on MSM. For the caretaker persona, the words which have a positive sentiment associated with them such as hope, recover, pray, friends, god,

prayer, mercy were found to be salient (Fig. 16). This might be because the users of caregiving.com share positive experiences with caregiving and usually have a positive tone.7

7 MSM posts for caretaker category were obtained by fetching publicly available posts of caregiving.com. In such cases content moderation policies of the particular website significantly influence the salience scores. For instance, if all posts containing abusive language are removed by the website owners, then abusive words would tend to have low salience scores which might not reflect the true characteristic of caretaker persona.

9

Journal of Biomedical Informatics 94 (2019) 103205

N. Pattisapu, et al.

For caretaker persona, we find a high salience scores corresponding to words which describe a relationship, such as mother, father, sister (refer Fig. 17). This is because in most posts in our training corpus, caretakers were the relatives of patients. Interestingly, for consultant and pharmacist personae, our model does not find the words describing drug or disease names to be salient (refer Figs. 18 and 19). This might seem counter-intuitive. However, this signifies that mere drug or disease names are not important cues to distinguish personae. For researcher persona, terms containing money, date, time, time period, percentage symbol were found to be highly salient (refer Fig. 20). This is owing to the fact that a lot of research articles in our training dataset contained date of indexing and reported some improvement over previous methods which were usually quantified with the help of numbers and percentage symbol. For instance, We beat the best results by 7%. Since we used PubMed abstracts as distant data for researcher persona, the salience scores for words such as pubmed, abstract were also very high. Additionally, we find that words such as professor, department, laboratory, university were also salient. This might be because, a lot of research articles are authored by academicians in universities (refer Fig. 20). For journalist persona, we found that words corresponding to date and locations were highly salient (refer Fig. 21).

[2] M.L. Antheunis, K. Tates, T.E. Nieboer, Patients’ and health professionals’ use of social media in health care: motives, barriers and expectations, Patient Edu. Counseling 92 (3) (2013) 426–431. [3] C. Hawn, Take two aspirin and tweet me in the morning: how twitter, facebook, and other social media are reshaping health care, Health Affairs 28 (2) (2009) 361–368. [4] B. Keller, A. Labrique, K.M. Jain, A. Pekosz, O. Levine, Mind the gap: social media engagement by public health researchers, J. Med. Internet Res. 16 (1) (2014). [5] M.P. Hamm, A. Chisholm, J. Shulhan, A. Milne, S.D. Scott, L.M. Given, L. Hartling, Social media use among patients and caregivers: a scoping review, BMJ Open 3 (5) (2013) e002819. [6] H.M. Griffis, A.S. Kilaru, R.M. Werner, D.A. Asch, J.C. Hershey, S. Hill, Y.P. Ha, A. Sellers, K. Mahoney, R.M. Merchant, Use of social media across us hospitals: descriptive analysis of adoption and utilization, J. Med. Internet Res. 16 (11) (2014). [7] T.H. Van de Belt, S.A. Berben, M. Samsom, L.J. Engelen, L. Schoonhoven, Use of social media by western european hospitals: longitudinal study, J. Med. Internet Res. 14 (3) (2012). [8] K. Courtney, et al., The use of social media in healthcare: organizational, clinical, and patient perspectives, Enabling Health Healthcare ICT: Avail. Tailored Closer 183 (2013) 244. [9] M. Von Muhlen, L. Ohno-Machado, Reviewing social media use by clinicians, J. Am. Med. Inform. Assoc. 19 (5) (2012) 777–781. [10] P. Kukreja, A. Heck Sheehan, J. Riggins, Use of social media by pharmacy preceptors, Am. J. Pharmaceut. Edu. 75 (9) (2011) 176. [11] G. Eysenbach, Medicine 2.0: social networking, collaboration, participation, apomediation, and openness, J. Med. Internet Res. 10 (3) (2008). [12] N. Pattisapu, M. Gupta, P. Kumaraguru, V. Varma, Medical persona classification in social media, Proceedings of the 2017 IEEE/ACM international conference on advances in social networks analysis and mining 2017, ACM, 2017, pp. 377–384. [13] G. Forman, Bns feature scaling: an improved representation over tf-idf for svm text classification, Proceedings of the 17th ACM conference on Information and knowledge management, ACM, 2008, pp. 263–270. [14] S. Wang, C.D. Manning, Baselines and bigrams: Simple, good sentiment and topic classification, in: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, Association for Computational Linguistics, 2012, pp. 90–94. [15] D. Tang, F. Wei, N. Yang, M. Zhou, T. Liu, B. Qin, Learning sentiment-specific word embedding for twitter sentiment classification, in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, 2014, pp. 1555–1565. [16] J. Diederich, J. Kindermann, E. Leopold, G. Paass, Authorship attribution with support vector machines, Appl. Intell. 19 (1–2) (2003) 109–123. [17] M. Koppel, J. Schler, S. Argamon, Computational methods in authorship attribution, J. Am. Soc. Inf. Sci. Technol. 60 (1) (2009) 9–26. [18] M.E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep contextualized word representations, arXiv preprint arXiv:<1802.05365>. [19] D. Cer, Y. Yang, S.-Y. Kong, N. Hua, N. Limtiaco, R.S. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, et al., Universal sentence encoder, arXiv preprint arXiv:<1803.11175>. [20] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:<1810. 04805>. [21] C.S. Perone, R. Silveira, T.S. Paula, Evaluation of sentence embeddings in downstream and linguistic probing tasks, arXiv preprint arXiv:<1806.06259>. [22] T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, Adv. Neural Inf. Process. Syst. (2013) 3111–3119. [23] J. Pennington, R. Socher, C. Manning, Glove: Global vectors for word representation, Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543. [24] P. Shrestha, S. Sierra, F. Gonzalez, M. Montes, P. Rosso, T. Solorio, Convolutional neural networks for authorship attribution of short texts, in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, vol. 2, 2017, pp. 669–674. [25] M. Hughes, I. Li, S. Kotoulas, T. Suzumura, Medical text classification using convolutional neural networks, Stud. Health Technol. Inform. 235 (2017) 246–250. [26] Y. Kim, Convolutional neural networks for sentence classification, arXiv preprint arXiv:<1408.5882>. [27] D. Tang, B. Qin, T. Liu, Document modeling with gated recurrent neural network for sentiment classification, Proceedings of the 2015 conference on empirical methods in natural language processing, 2015, pp. 1422–1432. [28] R. Kiros, Y. Zhu, R.R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, S. Fidler, Skip-thought vectors, Adv. Neural Inf. Process. Syst. (2015) 3294–3302. [29] J. Wang, L.-C. Yu, K.R. Lai, X. Zhang, Dimensional sentiment analysis using a regional cnn-lstm model, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), vol. 2, 2016, pp. 225–230. [30] S. Vosoughi, P. Vijayaraghavan, D. Roy, Tweet2vec: Learning tweet embeddings using character-level cnn-lstm encoder-decoder, Proceedings of the 39th International ACM SIGIR conference on research and development in information retrieval, ACM, 2016, pp. 1041–1044. [31] W. Yin, K. Kann, M. Yu, H. Schütze, Comparative study of cnn and rnn for natural language processing, arXiv preprint arXiv:<1702.01923>. [32] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, E. Hovy, Hierarchical attention networks for document classification, Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics: human language

9. Conclusions and future work In this work, we explored several supervised machine learning models for medical persona classification in social media. We discussed two popular neural architectures viz. CNN-LSTM and HANs which are commonly used for several text classification tasks. We discussed the challenges associated with these models when used for medical persona classification. We introduced a model which overcomes these challenges while retaining the strengths of these models. In order to minimize manual labeling costs, we trained our model using purely heuristically obtained data. We found that our model outperforms all other models by a significant margin across multiple metrics. We conducted a thorough analysis of our model to unearth the hidden cues in a post that are indicative of a particular persona. Specifically, we used the recent visualization techniques in NLP to identify salient words in a particular social media post. In future, we would like to analyze and compare the robustness of our model with other models by studying the impact of wrongly labeled data on the model performance. We would also like to use the visualization mechanism discussed in Section 8 to understand the main reasons for the failure of CNN-LSTM and HAN in great detail. In our current experiments, we worked with limited sources of distant data, in future we would like to extend our experiments to a much larger and diverse datasets. It would be of great interest to compare the performance of all models enlisted in Table 5 based on varying training data sizes. We also want to quantify the adaptability of our models to multiple domains such as tweets, facebook posts, etc. We can easily apply our model on tweets by posing a tweet as an MSM post consisting of one sentence. In order to circumvent the problem of manual labeling costs, we used purely distant data for training our models. We would like to conduct an in-depth study to understand whether our model is capturing the persona or merely the characteristics of source from which labeled examples were obtained. Declaration of Competing Interest None. References [1] K. Denecke, W. Nejdl, How valuable is medical social media data? content analysis of the medical web, Inf. Sci. 179 (12) (2009) 1870–1880.

10

Journal of Biomedical Informatics 94 (2019) 103205

N. Pattisapu, et al. technologies, 2016, pp. 1480–1489. [33] D. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, arXiv preprint arXiv:<1412.6980>. [34] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: Machine

learning in python, J. Mach. Learn. Res. 12 (Oct) (2011) 2825–2830. [35] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, A. Lerer, Automatic differentiation in pytorch, in: NIPS-W, 2017. [36] J. Li, X. Chen, E. Hovy, D. Jurafsky, Visualizing and understanding neural models in nlp, arXiv preprint arXiv:<1506.01066>.

11