Artificial Intelligence in Medicine 56 (2012) 19–25
Contents lists available at SciVerse ScienceDirect
Artificial Intelligence in Medicine journal homepage: www.elsevier.com/locate/aiim
Proactive screening for depression through metaphorical and automatic text analysis Yair Neuman a,∗ , Yohai Cohen b,1 , Dan Assaf a , Gabbi Kedma a a b
Department of Education, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel Gilasio Coding, Tel-Aviv, Israel
a r t i c l e
i n f o
Article history: Received 14 December 2010 Received in revised form 22 March 2012 Accepted 13 June 2012 Keywords: Depression Mental health Automatic screening Natural language processing
a b s t r a c t Objective: Proactive and automatic screening for depression is a challenge facing the public health system. This paper describes a system for addressing the above challenge. Materials and method: The system implementing the methodology – Pedesis – harvests the Web for metaphorical relations in which depression is embedded and extracts the relevant conceptual domains describing it. This information is used by human experts for the construction of a “depression lexicon”. The lexicon is used to automatically evaluate the level of depression in texts or whether the text is dealing with depression as a topic. Results: Tested on three corpora of questions addressed to a mental health site the system provides 9% improvement in prediction whether the question is dealing with depression. Tested on a corpus of Blogs, the system provides 84.2% correct classification rate (p < .001) whether a post includes signs of depression. By comparing the system’s prediction to the judgment of human experts we achieved an average 78% precision and 76% recall. Conclusion: Depression can be automatically screened in texts and the mental health system may benefit from this screening ability. © 2012 Elsevier B.V. All rights reserved.
1. Introduction The prevalence of depression in the Western society [1] puts heavy constraints on the ability of the health system to provide individual diagnosis and treatment. Therefore, despite the difficulties associated with screening and diagnosis of depression [2,3], a preliminary phase of screening for depression is inevitable. In this context, the expanding popularity of the Web and various forms of social media has introduced new platforms for addressing this challenge. Screening for depression through online questionnaires [4] was only a first step in this direction. However, this process is passive in the sense that the individual is fully responsible for getting access to and actively participating in the screening procedure. Therefore, the price of using the online questionnaire is self-selection and the exclusion of relevant subjects. A different complementary and proactive approach may actively analyze texts written by individuals, such as posts published in their personal
∗ Corresponding author. Tel.: +972 8 6461844; fax: +972 8 6472897. E-mail addresses:
[email protected] (Y. Neuman),
[email protected] (Y. Cohen). 1 Tel.: +972 54 7926 997. 0933-3657/$ – see front matter © 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.artmed.2012.06.001
Blogs, and identify signs of depression in the text through automatic text analysis. The aim of this paper is to present a system that screens for depression in texts. The system is based on a novel approach for identifying the meaning associated with a target term [5] through metaphorical analysis. However, because of space limitations and the specific focus of the paper, this paper focuses only on providing empirical evidence for the system’s ability to screen for depression in texts. Such an application may be highly relevant, for instance, to mental health agencies seeking a tool for automatic screening for depression. The rationale of this idea is as follows. A mental health agency may provide subjects with free access to our screening system. Given the subject’s permission, the system may proactively and automatically screen for signs of depression in texts written by the subject, such as posts s(he) writes in her Blogs. When the system identifies signs of depression, it may inform the subject and offer the opportunity to complete a short online questionnaire. If the questionnaire, as a second phase of screening, also identifies symptoms of depression then the subject is advised to consult a mental health expert for professional diagnosis and treatment. By using this graded procedure the mental health system may proactively and economically screen for depression in a massive population, before more exact, albeit more expensive, steps will be taken.
20
Y. Neuman et al. / Artificial Intelligence in Medicine 56 (2012) 19–25
At this point we must add an important qualification. Identifying signs of depression in a text does not necessarily indicate that the subject who wrote the text is depressed. To determine whether the subject is depressed one has to diagnose him or her. In this study, we did not diagnose real patients but only textual data. Therefore, we limited ourselves only to the modest aim of identifying depressionassociated terms in the text. Nevertheless, even this modest aim has relevance, as detailed below. The paper is organized as follows. First, we briefly present depression as a mood disorder and the difficulties of automatically screening for depression in free texts, that is, unstructured texts written by the subject with no instructions or a specific diagnostic purpose. Second, we present the unique solution that we have developed for addressing this challenge. Third, we introduce “Pedesis”, which is a system for identifying metaphors associated with a target term, and its specific implementation for identifying signs of depression in texts. Finally, we present several tests on two different corpora that provide empirical evidence supporting Pedesis’ screening ability.
2. Depression as a mood disorder By definition, mood disorders have a disturbance of mood as a predominant feature. Depression disorders have different subcategories. In typical mild, moderate, or severe depressive episodes the patient suffers from lowering of mood, reduction of energy, and decrease in activity [6]; the capacity for enjoyment, interest, and concentration is reduced and marked tiredness is common. Several other characteristics include ideas of guilt or worthlessness, and recurrent thoughts of death. In this study, we do not consider the clinical particularities of depression as a mood disorder, but study depression in the most general sense. Some work has been done with regard to automated analysis of depression in texts [7,8]. However, this work draws on psychiatric sources rather than on free text, uses symptoms as defined by experts in a top-down manner rather than bottom-up signs of depression produced by ordinary people, and was designed with the aim of helping individuals to study their own problems rather than with the aim of automatic screening for depression. For instance, Wu et al. [8] propose a framework for mining depressive symptoms and the semantic relations between them. They use the Hamilton depression rating scale (HDRS) as a predefined list of symptoms to identify depression. To evaluate their system they collected consultation records from PsychPark, which is a virtual psychiatric clinic from Taiwan. Three experienced psychiatrists were used to analyze a set of 306 records that served as a goldstandard. It is not quite clear though what exactly they were asked to evaluate, but it is clear that by analyzing consultation records only with the aim of mining symptoms, their system cannot be directly used for screening for depression in free text. Another study by this group [7] aims to present a “novel mechanism to automatically retrieve the relevant consultation documents with respect to users’ problems” [7, p. 818] and has no aim of screening for depression too. Pestian et al. [9] used natural language processing (NLP) tools to classify suicide notes. However, their brief paper does not concern the task of screening for “depression”, but focuses on suicidal patterns and depression is only one of the predictors they use. There are several problems with this paper. The “ontology” they use to measure depression is not detailed and validated. Second, as a criterion they use a very small (random?) sample of 33 notes from “completers”, those who complete suicide, and compare them to “simulators” that were matched to the “completers” and were asked as a comparison group to write suicide notes. This comparison is irrelevant as the diagnostic ability of automatic “screening”
for suicide should identify in advance those who would attempt suicide. In this context, the only relevant evaluation of the system is through the precision and recall of the system’s identification of potential attempters at point T1 as against the criterion that is their real behavior (attempted suicide or not) in a pre-defined future time window (e.g., 7 months after signs of suicidal intentions). In other words, such a system should identify through textual analysis whether a subject might attempt suicide in the future and compare its prediction to real future behavior. This methodology of evaluation has not been used in this paper. Jarrold et al. [10] present another attempt to predict depression based on linguistic features. This study too is limited in many respects. First, it does not provide a new tool for screening as its first aim is to “replicate findings of prior studies demonstrating that depression is associated with a higher frequency of self-focused words”. In addition, it aims to evaluate the diagnostic potential of machine learning using the above linguistic features and to show how the diagnosis of depression varies as a function of conversational context [10, p. 304]. As this short paper does not include a method section, neither a detailed presentation of the ML algorithm nor results reported in conventional terms of precision and recall, the reader has no means for evaluating the paper’s value and contribution to the efforts to build a system for automatic screening for depression.
3. Identifying depression in texts In the context of automatic textual affect sensing, there are a number of classical approaches, such as “keyword spotting”, “lexical affinity”, “statistical natural language processing”, and “hand crafted models” [11]. In a case where one does not have a tagged and large training corpus at hand, and in the case where one wants to rely on a lexical analysis per se, that is the analysis of words in the text, the researcher faces a major and general problem faced by all lexical approaches to textual affect sensing. The problem is that in natural language there is no one-to-one correspondence between the word and the concept it represents [12]. This means that (1) the same linguistic unit can mean different things in different contexts (i.e., word-to-concept relation) and (2) the same concept may be represented by different linguistic means (i.e., concept-to-word relation). For instance, “depression” may be used to denote a mood disorder as well as economic depression. Differentiating different senses of the same word is addressed in the NLP literature by various models of word sense disambiguation (WSD). In this paper, we address the second problem only, that is, the problem of conceptto-word relation, as the disambiguation problem is solved through other means as will be later described. In the context of depression, the concept-to-word relation problem is that although depression is a well-defined mood disorder with well-defined symptoms and criteria [13], the expression of depression by human beings cannot be easily and automatically identified in a text because there are many linguistic means to express depression, means that cannot be easily exhausted by a short list of symptoms identified in advance by a group of experts. The clinical articulation of depression provides us with a list of symptoms identified through experts’ consensus, such as the one summarized in the diagnostic and statistical manual of mental disorders (DSM) [13]. This is a top-down articulation of depression through its “symptomatology”. However, this list of symptoms cannot be used to represent the whole spectrum of the linguistic means ordinary people use to express depression. In this context, it may be of great importance to elucidate the meaning of depression as it is experienced and expressed by the people themselves from the “first-person perspective”. In other words, we may be interested in the “phenomenology” [14] of depression rather than
Y. Neuman et al. / Artificial Intelligence in Medicine 56 (2012) 19–25
in the symptomatology of depression, and in a bottom-up process of elucidating the meaning of depression by harvesting “collective intelligence” [15], rather than in a top-down process determined by experts. Such an approach may provide us with a better lexicon for identifying depression in texts. We must emphasize that we have nothing against the symptoms identified by experts. In fact, we accept them as those guiding the diagnosis of depression. However, for the specific task of automatically screening for depression through the analysis of free text, we believe that one should start “bottom-up” from the analysis of the way ordinary people describe their experience. By identifying the variety of ways people describe experience, we may gain a more comprehensive picture of the linguistic means people use to address the experience of depression. In this way, we may build a lexicon that approximates the variety of means people use to describe the same experience, and through this description solve the concept-to-word relation problem. To address the challenge of building a depression lexicon, we must (1) bridge the gap between the linguistic and the conceptual realms of depression and (2) find a way to build the lexicon in a bottom-up manner. It is important to bridge the gap between the linguistic and conceptual realms in order to address the difficulty of concept-to-word mappings. Building a lexicon in a bottom-up manner is important for validly representing the private experience of people. It should be noted that in itself the idea of bottom-up construction of an emotional lexicon is not new. For instance, Mohammad and Turney [16] recently developed EmoLex by using Amazon’s mechanical turk. This lexicon has been constructed using the judgment of people who were paid for this task. However, the approach we present for a bottom-up construction of the lexicon is different because we harvest the Web for relevant information (detailed below) without using a selective group of judges and without asking their conscious judgment of a word. It was argued by Neuman and Nave [5] that one way of addressing the challenge of identifying the meaning of a concept is by analyzing conceptual mapping relations in which the target concept (e.g., depression) is embedded. A rather common psycholinguistic vehicle for mapping between concepts is “metaphor”. A conceptual metaphor is a common psycho-linguistic vehicle people use in order to describe one concept in terms of another concept. For instance, the metaphor “Anger is like a volcano” involves a mapping function from “volcano” to “anger”. It was further argued that by carefully studying the metaphors people naturally and explicitly use to describe a given term, we may address the concept-to-word mapping problem as the metaphors cover a variety of conceptto-language mappings and therefore may help us to identify the variety of linguistic means people use to describe the same experience, whether depression or anger. Let us better explain this point. The experience of depression may be represented in a text through a variety of linguistic means. One may say, for instance, “I feel that I’m being absorbed by a black hole”. This utterance does not include any explicit reference to depression or an explicit reference to the depression symptoms used by the mental health practitioners, at least as summarized in the DSM. The question is how we can infer the existence of depression in the text given the variety of linguistic means one may use to express it. The answer we give in this paper is in terms of metaphorical analysis. Our basic argument is that by collecting a large number of metaphors people naturally use to describe the experience of depression, we may identify the “semantic space” of depression and use this space for building a lexicon that will be used for screening for depression. Although this is our main assumption/argument, one may consider it a hypothesis that is confirmed by the current study. The next sections present the construction of this lexicon and the way it was used for screening for depression.
21
4. Pedesis—a system for building the depression lexicon Based on Neuman and Nave [5], Pedesis [17] is a system for identifying the lexicon representing a certain target concept by harvesting the Web for metaphors in which the target term (e.g., depression) is embedded and for identifying signs of depression in text. The system was built in several phases through automatic and manual analysis of textual data. As the system comprises thousands of code lines, here we present only a simple and schematic description of the system’s construction and structure. For identifying depression in texts the reader may use the lexicon produced by Pedesis. The lexicon is available from the authors upon request. Pedesis is written in C# and takes as input a target concept T. This is the concept we would like to understand, in our case “depression”. The system uses a search engine and looks for web pages in which there is the expression “T is like *”, where * is a wild card. Although this pattern does not exhaust the richness of metaphorical expressions, it provides a useful starting point. One can argue whether the simple pattern “X is like Y” identifies metaphorical patterns or similes indicating only surface structure similarities. The full theoretical justification for using this pattern is presented in Neuman and Nave [5], but manually analyzing thousands of sentences identified by the above pattern, we found that the overwhelming majority of the sentences concerned metaphors rather than surface structure similarities or similes per se. By metaphor we mean any deep structure mapping between concepts according to the norms accepted in the study of metaphor [18–21]. For this specific paper we used depression as a target concept and seek the pattern “depression is like *”. Again, this is a simple pattern that does not cover all the metaphorical relations associated with depression. During the period the paper was under review, we developed the state-of-the-art algorithm for differentiating between metaphorical and non-metaphorical language [22]. This algorithm may be incorporated in future developments of the system, although the simple pattern we used here provides good results. As “depression” may concern economic depression, we did not retrieve Web pages in which the string “econo” appeared in a window of 5 words to the left and to the right of our target term. As we manually verified, this filtering heuristics prevented the use of metaphors associated with economic depression. We used Bing, Microsoft’s search engine [23], and for practical reasons arbitrarily limited our search to 20,000 search results. All the unique URLs were gathered and then added to the database (DB) and marked for download. The sentence module runs on the file, extracts the sentences that include the search term, and marks them for the NLP module in the DB by operating procedure to determine the sentence boundaries. The NLP module parses the sentences using Stanford parser and produces a dependency representation of each sentence [24]. A dozen rules are used for extracting the relevant domain from each dependency representation of the sentence. For instance, the sentence: “Depression is like living in a room of pain” is represented as follows: nsubj(is-2, depression-1) prepc like(is-2, living-4) det(room-7, a-6) prep in(living-4, room-7) prep of(room-7, pain-9) The relevant domain is: “living in a room of pain”. First, we consider each line as a relation between two arguments ARG1 and ARG2. For instance, prep in(living-4, room-7) is a relation (i.e. prep in) between ARG1 (living) and ARG2 (room). In the above case, we operate the following rule. If prepc like is followed by another
22
Y. Neuman et al. / Artificial Intelligence in Medicine 56 (2012) 19–25
prep relation to exclude “prep like” and where ARG2 of prepc like is the same as ARG1 of the second relation then include ARG1 and ARG2 of the second prep in the domain + their relation specification (e.g. “in” in prep in) in between ARG1 and ARG2. Repeat this process for any prep relation (1) located under the prepc like relation and (2) which is derived from it through a chain of shared arguments. This rule extracts the following domains and combines them in the following steps: 1. Living in room 2. Room of pain 3. Final domain: Living in room of pain We manually identified in the domains of depression and in their explanations, words and phrases that represent the experience of depression. For brevity, phrases of length 1 (e.g., cancer), length 2 (black hole), or longer will be described as “phrases”. For instance, a subject who used the metaphor “depression is like a big cage” explained it as follows: “You are locked inside a cage where you feel no happiness”. From this explanation we extract the phrases “locked” and “no happiness” and added them to our lexicon. Another subject used the metaphor “depression is like a cold darkness” and explained it by saying “You feel scared and lonely”. From this explanation we extract the phrases “scared” and “lonely”. These phrases were entered into the corpus of contemporary American English (COCA) [25], and for each of our depression words and phrases we retrieved its synonyms and the synonyms of these synonyms. In other words, the second order synonyms. For verbs we retrieved the lemmatization of each word and its synonyms. For the phrase we used the COCA to design paraphrases by substituting the nouns in the phrases with their synonyms. By using the above procedure, we constructed a lexicon that included 1723 phrases associated with depression. The most frequent phrases that were found in the Web search were: dark (14), disease (9), pain (8), quicksand (6), black hole (5), cancer (5), box (5), emotional (5), life (4), death (4), black (4), and cloud (4). This lexicon supposes to provide a good approximation of the semantic space of “depression”. We are well aware that by identifying the various domains of the metaphor “Depression is like *”, we have encountered a long tail of idiosyncratic metaphors. However, several dominant metaphors and idiosyncratic metaphors have been filtered through expert analysis. Future development of the system should include a component through which metaphorical domains are automatically clustered and idiosyncratic metaphors are removed. In sum, our lexicon construction procedure involves the following steps: 1. Perform a web search, search for “depression is like *”. 2. Take text from the first 20 K documents returned by the search engine and apply a parser. From the dependency representation extract “domains” (i.e., phrases or single words that are the source of “depression”). 3. Build a list of phrases from the output of step 2 and generate a lexicon from (a) this list plus (b) the synonyms and synonyms of those synonyms and (c) the relevant phrases that appear in the explanations of the metaphors. The lexicon we have produced did not include the trivial string “depress”. The reason is trivial. We wanted to test our ability to screen for depression in free text without having an explicit expression of depression in the text. It is quite trivial that a text containing the word “depressed” associated with the first person pronoun personal (e.g., “I am depressed”) may signal depression . . . We are well aware though that a lexicon that includes explicit symptoms of depression as well as the string depress may better screen for
depression. In this sense, the results presented in this paper are an underestimation of our ability to screen for depression. 5. Study 1: the mental help site corpus If our depression lexicon has a screening ability, then at the minimal level it should efficiently identify whether depression is a topic/issue in a text. To test this ability, we used a corpus of questions addressed to three columnists at the Mentalhelp.net [26]. This is a web site that aims to promote mental health and wellness and gets 1.2–1.4 M pages viewed each month [email correspondence with Mark Dombeck, Mentalhelp.net, June 18, 2010]. 5.1. Method We analyzed three columns; “Ask Dr. Dombeck” (N = 484 questions) and “Ask Dr. Schwartz” (N = 470 questions) are two columns that respond to questions about psychotherapy and mental health problems from the clinical perspective of these experts. “Ask Anne” (N = 642 questions) is the relationship advice column written under the pseudonym “Anne”. Overall, we automatically analyzed 1596 questions addressed to these columnists. It must be emphasized again that we did not analyze the texts for identifying signs of depression that may indicate a depressive mood of the writer, but more specifically whether the question concerns depression as a topic. For this task, we used a very simple and straightforward criterion of depression as a topic. If the question included the string “depress-” (e.g., depressed or depressive) then we assumed the question concerns issues of depression. A question that included “depress-” was marked as D and ¬D otherwise. A preliminary analysis of the corpus’ sample verified the validity of our criterion. In each corpus we randomly sampled 10% of the questions marked as D and 10% of the questions marked as ¬D. Through manual expert analysis, we found that almost all of the questions marked as D were dealing with depression in one way or another. In contrast, in the ¬D category only 15% (Ask Dr. Dombeck), 12% (Ask Dr. Schwartz), and 8% (Ask Anne) of the questions dealt with depression. These findings suggest that our criterion is biased against Pedesis and the performance of the system. A majority of the questions concerning depression (75% in Ask Dr. Dombeck and Ask Dr. Schwartz, and 69% in Ask Anne) were dealing with the subject who asked the question. The rest concerned family members and friends. 5.2. Analysis and results To screen for depression in the questions, we used only one measure: the number of different phrases from our depression lexicon. This measure was titled DepScore. The process was fully automatic. First, we marked the questions as D or ¬D. Then we automatically analyzed the questions using the above measure. Each corpus was analyzed separately. To test the screening efficiency of the system, we used the median score of DepScore. A value above the median scored “1” and a value equal or below the median scored “0”. By using the Chi-Square test on the cross-tabulated data, statistically significant associations between DepScore and depression were found for the corpora of Anne (2 = 20.89, p < .001), Dombeck (2 = 12.11, p = .001), and Schwartz (2 = 11.56, p = .001). As medical/psychological diagnosis involves a Bayesian inference, we organized our data in a 2 × 2 table of depression in the questions (D vs. ¬D) and the categorical test result of our diagnosis: “1” or “0”. The results of the Bayesian analysis are presented in Table 1.
Y. Neuman et al. / Artificial Intelligence in Medicine 56 (2012) 19–25
23
Table 1 Bayesian analysis of Pedesis performance.
Prior probability Posterior probability 95% Confidence Interval
Corpus 1 Ask Dr. Schwartz (N = 470)
Corpus 2 Ask Dr. Dombeck (N = 484)
Corpus 3 Ask Anne (N = 642)
26 34 30–39
38 48 43–54
18 27 25–34
The prior probabilities for depression in the questions are 26% (for Schwartz), 38% (for Dombeck), and 18% (for Anne). The probability of being depressed given the system’s positive test result is 27% for Anne (9% improvement over prior probability), 48% for Dombeck (10% improvement) and 34% for Schwartz (8% improvement). Therefore, the average improvement in prediction over the odds is 9%. These results suggest not only a statistically significant association between DepScore and the criterion but an averaged improvement in diagnosis over the base-rate. With these results we continue to the next study. 6. Study 2: the blog authorship corpus The blog authorship corpus was built by Schler et al. [27] for academic use and it is available from these authors upon request. The corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words – or approximately 35 posts and 7250 words per person. As posts may include quotations from previous posts, we cleaned the posts in a way that assured each post does not include texts that appeared in previous posts. Out of this corpus, we selected posts with a number of words ranging from 10 to 200, and removed “garbage” posts. Overall we had 398,691 posts of 17,031 bloggers. 6.1. Method To test the screening ability of Pedesis we identified 83 posts that were expressing signs of depression. The posts were identified by searching for the root depress- and by manually verifying that the post’s author testifies that s(he) is in depression. In addition, we manually identified 100 posts in which no sign of depression was evident. These posts were used as a control group. 6.2. Analysis and results We used a binary logistic regression analysis with depression as the criterion and DepScore as the independent variable. The Regression was statistically significant (2 = 124.81, p < .001) with 84.2% correct classification rate. That is a 29.6% improvement over the base-rate (54.6%). To evaluate the diagnostic performance of Pedesis, we used the mean of DepScore and categorize our data into high (above the average) and low (below or equal to the average). Then we cross tabulated the results against the criterion (1 = depressed, 0 = nondepressed). See Table 2. By examining the diagnostic ability of Pedesis, we found that its sensitivity was 0.69 and its specificity 0.97. The prior probability for depression was 45%. The positive posterior probability was 95% Table 2 Cross-tabulated data of Pedesis prediction by criterion. Criterion
Pedesis prediction Non-depressed Depressed
Non-depressed
Depressed
97 3
26 57
[.95 CI 86–98] and its negative posterior probability 21% [.95 CI 16–27]. In other words, ∼1 in 1.1 of the cases identified by Pedesis as having signs of depression were depressed and ∼1 in 1.3 of the cases where Pedesis identified the text as having no signs of depression (i.e., negative test) were non-depressed. These results present clear evidence to the screening performance of Pedesis. Latent semantic analysis (LSA) is a powerful theory and methodology for identifying the semantic space of a concept [28,29]. Therefore, it was a natural choice to construct a depression lexicon using LSA. We used the online version of the LSA [30] and the nearest neighbor procedure in order to extract the maximum (1500) words closest to the word “depressed”. As the word “depression” may be associated with economic depression we used only the word “depressed”. By using the lexicon of the LSA and exactly the same procedure we used for Pedesis, we received almost identical results, but while Pedesis reached 0.69 sensitivity, LSA reached only 0.59. The above results provide empirical support for the diagnostic ability of Pedesis. The diagnostic ability may be improved by including additional variables in the regression. For instance, as the use of the first person pronoun “I” was argued to be a factor differentiating between depressive and non-depressive people [31], we used the percent of the first person pronoun (“Percent I”) in each post as an additional variable. When including “percent I” in the regression analysis, Pedesis’ prediction was improved to 90.7% correct classification rate! (2 = 162.77, p < .001). Adding “percent I” to LSA improved the prediction to a lesser extent (2 = 160.49, p < .001, 87.4% correct classification rate). It must be noted that the comparison to the LSA is somehow “unfair” as the seed word for the LSA lexicon is “depressed”, which has a direct reference to the mental situation. If we add to the Binary Logistic Regression Analysis a measure of whether the string “depress” originally appeared in the post, then Pedesis yields a 92.9% correct classification rate (2 = 183.27, p < .001) and outperforms the prediction yields by LSA. 7. Study 3: testing Pedesis through human experts We have shown that our depression lexicon provides a significant screening ability. Another way of testing Pedesis is by defining a different measure of depression and testing it against experts’ judgments. 7.1. Method Based on our “depression lexicon”, we built a different depression measure – Dep Scale*. The measure was the sum of several variables. The first variable is the percent of phrases drawn from our lexicon of the total number of phrases in the text. The second variable is the percent of different phrases from our lexicon of the total number of phrases in the text. The second variable was aimed to solve the problem of repetition evident in posts (e.g., “I’m sad, I’m sad, I’m sad . . .). We also used the percent occurrence of the first person pronoun in each post as a third variable. In addition we included five word categories of the linguistic inquiry and word count (LIWC) [32,33]. These categories were intuitively considered as relevant for the
24
Y. Neuman et al. / Artificial Intelligence in Medicine 56 (2012) 19–25
Table 3 Precision and recall in identifying depressive posts.
Judge 1 Judge 2 Judge 3 Judge 4 Average
8. Discussion
Recall
Precision
69 72 78 85 76
98 87 49 76 78
screening procedure. The categories are: negative emotion, present (i.e., the use of present tense), sad, insight, and feel. Dep Scale* was defined as a linear function of weighted variables that included the weights given to each variable according to our intuitive evaluation of their relative importance in understanding depression. 7.2. Analysis and results We automatically scored each post in the corpus and then applied an extreme group design as follows. We ranked the posts according to Dep Scale* and selected the 1000 posts that scored highest. Next, we sorted our 1000-post sample and identified the 100 posts that scored highest (i.e., the High level of Dep Scale* list abbreviated as the H List) and the 100 posts that scored lowest on Dep Scale* (i.e., the Low level of Dep Scale* list abbreviated as the L list). For instance, three posts that scored high on the Dep Scale* measure are: 1. “I feel completely lost not to mention alone I’ve never felt more alone”. 2. “I feel so sad and I can’t figure out why help”. 3. “Some days I wish I could just I don’t know escape everything maybe just go for a . . .” We hypothesized that if Pedesis can validly identify signs of depression by differentiating between the H list and the L list, then it should present high levels of recall when tested against human judgment. Four psychologists participated in the experiment that aimed to test this hypothesis. Three of the psychologists each have more than 20 years of experience and the fourth psychologist graduated several years ago. Two of the psychologists are women. Each psychologist received a booklet with a random mixture of the posts that scored highest and lowest by Pedesis and was asked to judge 175 posts as the 25 other posts were duplicates that aimed to assure the internal reliability of the judgment. The posts were cleaned of all forms of the root “depress” to avoid trivial judgment. The psychologists, who were blind to the experiment’s aims and hypotheses, were asked to rate the level of depression expressed by each post on a 4-point Likert scale (“0” – not depressed, “1” – mildly depressed, “2” – moderately depressed, and “3” – severely depressed). These categories correspond to the categories used by the DSM. Inter-judge reliability showed Cronbach’s alpha = .86, which is high according to the norms in psychology. We considered each expert as representing the “actual” state of the world and as a valid measure of depression, and measured Pedesis’ performance against this criterion. First, we grouped depression scores 1, 2, and 3 into a single category indicating that depression is evident in the post. Next we compared the number of cases in which Pedesis assigned a category of depression (High or Low) to posts judged by the experts as “Depressed” or “Nondepressed”. The system’s performance in terms of precision and recall is shown in Table 3. On average, Pedesis performed with 78% precision and 76% recall. The judges classified, on average, 80 posts as “depressed” and 95 as “non-depressed”.
To date, the judgment of a mental health expert has no technological substitute. However, under the constraints imposed by reality, we cannot afford in a reasonable time and at low cost a massive and personal process of diagnosis. In this context, creative technological solutions should be considered at least for a preliminary screening process. Technologies of voice recognition [34] and automatic text analysis, such as the one presented in this paper, may be used to improve the performance of the health system. These technologies should not be considered a magic substitute for the human expert, but as a part of incremental efforts to improve the screening process promoted by the US preventive task force [35]. In this context, this paper adds an additional tool to the arsenal of public health systems. As we mentioned in the introduction, Pedesis may be freely given to Blog users who may use it as a first tool for screening for depression. In addition, mental health agencies may use it to identify depression as a topic in an online consultation service and immediately refer a question dealing with depression to the relevant expert. The system may be also used to monitor the levels of depression is a given population. Under conditions of economic crisis, social turbulences, or an armed conflict it is important for decision makers to monitor the depression level of the population and to offer a relevant solution for optimal coping with the problem. We are well aware of the gap between the “virtual” world of the social media and the “real” world, but as a working assumption and with the necessary qualifications we see no reason to ignore signs of depression reported by a subject even if these signs appear through the platform of social media. Whether these signs are valid signs of depression is an open question and this is the reason why we have offered in the introduction a stepwise methodology starting with our automated system, moving to the use of the questionnaire and ending with the human expert. Such a graded procedure may contribute to the economic and valid diagnosis of depression. Despite the simplicity of the idea presented in this paper, it has been never been applied or tested before. Although the system we present is simple and effective it relies on a simple lexical analysis. The system may probably be improved by using pre-defined symptoms of depression and by using a more complex algorithm for metaphorical analysis. Recently, we developed a state-of-the-art algorithm for identifying metaphorical language and other technologies for semi-automatic analysis of data in the clinical setting [36]. These tools can greatly improve the automatic screening of depression and provide a supporting tool for the indispensable work of psychiatrists, clinical psychologists, and other mental health practitioners.
Acknowledgments This research was supported by the Israel Ministry of Defense. The authors would like to thank Peter Turney for constructive reading of a previous draft, Mark Dombeck and the Mentalhelp.net for their permission to use the questions addressed to their site, Itzik Dayan, Neil Natanel, and Dror Ofer for their good advice, Doron Havazelet for his trust and support, and the anonymous reviewers for their helpful comments.
References [1] Pratt LA, Brody DJ. Depression in the United States Household Population, 2005–2006 (NCHS Data Brief no. 7. September 2008). [2] Gilbody S, Sheldon T, House A. Screening and case-finding instruments for depression: a meta analysis. Canadian Medical Association Journal 2008;178:997–1003.
Y. Neuman et al. / Artificial Intelligence in Medicine 56 (2012) 19–25 [3] Mitchell AJ. Why do clinicians have difficulties detecting depression? In: Mitchell AJ, Coyne JC, editors. Screening for depression in clinical practice: an evidence-based guide. Oxford: Oxford University Press; 2009. p. 57–75. [4] Houston TK, Cooper LA, Thi Vu H, Kahn J, Toser J, Ford DE. Screening the public for depression through the internet. Psychiatric Services 2001;52:362–7. [5] Neuman Y, Nave O. Metaphor-based meaning excavation. Information Sciences 2009;179:2719–28. [6] World Health Organization. Multiaxial presentation of the ICD-10 for use in adult psychiatry. Cambridge, UK: Cambridge University Press; 1997. [7] Yu L-C, Wu C-H, Jang F-L. Psychiatric document retrieval using a discourseaware model. Artificial Intelligence 2008;173:817–29. [8] Wu C-H, Yu L-C, Jang F. Using semantic dependencies to mine depressive symptoms from consultation records. IEEE Intelligent Systems 2005;20:50–8. [9] Pestian JP, Matykiewicz P, Grupp-Phelan J. Using natural language processing to classify suicide notes. In: BioNLP. 2008. p. 96–7. [10] Jarrold WL, Peintner B, Yeh E, Krasnow R, Javitz HS, Swan GE. Language analytics for assessing brain health: cognitive impairment, depression and pre-symptomatic Alzheimer’s disease. Brain Informatics: Lecture Notes in Computer Science 2010;6334:299–307. [11] Liu H, Lieberman H, Selker T. A model of textual affect sensing using real-world knowledge. In: International conference on intelligent user interfaces. 2003. p. 12–5. [12] Neuman Y. Reviving the living: meaning making in living systems. Oxford, UK: Elsevier; 2008. [13] American Psychiatric Association (APA). Diagnostic and statistical manual of the mental disorders. 4th ed. Washington, DC: APA; 1994. [14] Petitmengin C. Describing one’s subjective experience in the second person: an interview method for the science of consciousness. Phenomenology and Cognitive Science 2006;5:229–69. [15] Levy P. Collective intelligence. New York: Basic Books; 1997. [16] Mohammad SM, Turney PD. Emotions evoked by common words and phrases: using mechanical turk to create an emotion Lexicon. In: Proceedings of the NAACL HLT workshop on computational approaches to analysis and generation of emotion in text. 2010. p. 26–34. [17] Neuman Y, Cohen Y, Kedma G, Nave O. Using web-intelligence for excavating the emerging meaning of target-concepts. In: Proceedings of IEEE/WIC/ACM international conference on web intelligence. 2010. p. 22–5. [18] Danesi M. Metaphorical “networks” and verbal communication: a semiotic perspective on human discourse. Sign Systems Studies 2003;31:341–63. [19] Gentner D, Bowdle B, Wolff P, Boronat C. Metaphor is like analogy. In: Gentner D, Holyoak KJ, Kokinov BN, editors. The analogical mind:
[20] [21] [22]
[23] [24]
[25] [26] [27]
[28] [29] [30] [31] [32] [33]
[34]
[35]
[36]
25
perspectives from cognitive science. Cambridge, MA: MIT Press; 2001. p. 199–253. Lakoff G, Johnson M. Philosophy in the flesh. New York: Basic Books; 1999. Sebeok TA, Danesi M. The forms of meaning: modeling systems theory and semiotic analysis. Berlin: Walter de Gruyter; 2000. Turney P, Neuman Y, Assaf D, Cohen Y. Literal and metaphorical sense identification through concrete and abstract context. In: Proceedings of the 2011 conference on empirical methods in natural language processing. 2011. p. 680–90. www.Bing.com [accessed: 01.09.11]. de Marneffe M-C, Manning CD. The Stanford typed dependencies representation. In: COLING workshop on cross-framework and cross-domain parser evaluation. 2008. p. 1–8. Davies M. The 385 + million word Corpus of Contemporary American English (1990–2008+). International Journal of Corpus Linguistics 2009;14:159–90. http://www.mentalhelp.net [accessed: 01.09.11]. Schler J, Koppel M, Argamon S, Pennebaker J. Effects of age and gender on blogging. In: Proceedings of 2006 AAAI spring symposium on computational approaches for analyzing weblogs. 2006. p. 191–7. Landauer T, Foltz PW, Laham D. An introduction to latent semantic analysis. Discourse Process 1998;25:259–84. Landauer T, McNamara DS, Dennis S, Kintsch W, editors. Handbook of latent semantic analysis. Hove, UK: Psychology Press; 2007. www.lsa.colorado.edu [accessed: 01.09.11]. Rude S, Gortner E-M, Pennebaker J. Language use of depressed and depressionvulnerable college students. Cognition Emotion 2004;18:1121–33. Pennebaker JW, Francis ME. Linguistic inquiry and word count. Mahway, NJ: LEA; 2001. Tausczik YR, Pennebaker JW. The psychological meaning of words: LIWC and computerized text analysis method. Journal of Language and Social Psychology 2010;29:24–54. Rogers WH, Lerner D, Adler DA. Technological approaches to screening and case finding for depression. In: Mitchell AJ, Coyne JC, editors. Screening for depression in clinical practice: an evidence-based guide. Oxford: Oxford University Press; 2009. p. 143–54. Task Force Recommends Screening Adolescents for Clinical Depression. [accessed:
01.09.11]. Neuman Y, Assaf D, Cohen Y. A novel methodology for identifying emerging themes in small group dynamics. Bulletin of the Menninger Clinic 2012;76:53–68.