On the predictive ability of narrative disclosures in annual reports

European Journal of Operational Research 202 (2010) 789–801 Contents lists available at ScienceDirect European Journal of Operational Research journ...

Download PDF

287KB Sizes 0 Downloads 52 Views

Report

PDF Reader
Full Text

European Journal of Operational Research 202 (2010) 789–801

Contents lists available at ScienceDirect

European Journal of Operational Research journal homepage: www.elsevier.com/locate/ejor

Stochastics and Statistics

On the predictive ability of narrative disclosures in annual reports Ramji Balakrishnan a,*, Xin Ying Qiu b, Padmini Srinivasan c a

The University of Iowa, Tippie College of Business, Iowa City, IA 52246, USA Christopher Newport University, Luter School of Business, Newport News, VA 23606, USA c The University of Iowa, Computer Science Department and Tippie College of Business, Iowa City, IA 52246, USA b

a r t i c l e

i n f o

Article history: Received 6 February 2009 Accepted 18 June 2009 Available online 30 June 2009 Keywords: Economics Finance Text mining Capital markets

a b s t r a c t We investigate whether narrative disclosures in 10-K and 10K-405 ﬁlings contain value-relevant information for predicting market performance. We apply text classiﬁcation techniques from computer science to machine code text disclosures in a sample of 4280 ﬁlings by 1236 ﬁrms over ﬁve years. Our methodology develops a model using documents and actual performance for a training sample. This model, when applied to documents from a test set, leads to performance prediction. We ﬁnd that a portfolio based on model predictions earns signiﬁcantly positive size-adjusted returns, indicating that narrative disclosures contain value-relevant information. Supplementary analyses show that the text classiﬁcation model captures information not contained in document-level features of clarity, tone and risk sentiment considered in prior research. However, we ﬁnd that the narrative score is not providing information incremental to traditional predictors such as size, market-to-book and momentum, but rather affects investors’ use of price momentum as a factor that predicts excess returns. Ó 2009 Elsevier B.V. All rights reserved.

1. Introduction A primary use of accounting reports is to help investors evaluate an organization’s ﬁnancial prospects. Narratives are an important information source for analysts and a critical component in annual reports (Rogers and Grant, 1997). A majority of the ﬁnancial analysts surveyed by the AIMR (2000) indicate that management discussion is a very or extremely important item when evaluating ﬁrm value. However, perhaps because of the relative costs of gathering and analyzing numerical versus textual data, most academic research has focused on the quantitative disclosures in annual reports. Moreover, because of ﬂexibility in framing these disclosures with respect to choice of words and tone in addition to content, it is likely that the information in narratives is not fully impounded into contemporaneous prices (see Li, 2006 for additional observations in this regard). In this study, we modify and apply techniques from the text classiﬁcation branch of computer science to the narrative disclosures in 10-K and 10-K405 ﬁlings in order to predict market returns. In a training sample, we pair the narrative disclosure in the 10K documents with the subsequent performance and use standard text classiﬁcation techniques to build a predictive model. In particular, we deﬁne out- and under-performing ﬁrms as the top (bottom) 25% of all ﬁrms, and group ﬁrms into three classes (out-performing, average and under-performing) based on their actual performance from period t to t þ 1. We then use text disclosed in period t (that relate to performance for the period t 1 to t) and the performance class to build a model that associates the text in a 10K report for a period with next period’s performance. This automated text-classiﬁcation exercise, which employs many features such as the number, frequency, and count of words that are similar (dissimilar) across documents, yields a model that can classify the text for an arbitrary ﬁrm as to its predicted performance. We test the model’s predictive ability by applying it to the documents for period t þ 1 (this testing sample of documents relates to performance for t to t þ 1) to predict performance for the period t þ 1 to t þ 2. We use these predictions to form and maintain a portfolio. Speciﬁcally, for each year, our equally weighted portfolio buys stock in ﬁrms we predict to out-perform the market and sells predicted under-performers. The magnitude of the size-adjusted returns for the portfolio is then a joint test of the presence of value relevant information in narrative disclosures and our ability to systematically extract it. (Of course, like the anomalies literature, our analysis also assumes that the information is not impounded immediately into prices.) On average, our portfolio yields an average size-adjusted return of 12.16% per year. Our classiﬁer is word-based, i.e., it extracts key words from the texts and uses these as features to build predictive models. We conduct additional tests to examine the extent to which * Corresponding author. Tel.: +1 (319) 335 0958; fax: +1 (319) 335 1956. E-mail addresses: [email protected], [email protected] (R. Balakrishnan). 0377-2217/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.ejor.2009.06.023

790

R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801

our (word-level) model adds to models built using (document-level) meta-features such as clarity and tone. Motivated by prior research we add the following three meta-features to our model: Fog index (Li, 2006), risk sentiment (Henry, 2006a) and optimistic tone (Davis et al., 2006). (As a check, we replicated the association between changes in clarity (the fog index) and market returns. A portfolio based on changes in clarity leads to a size adjusted return of 10%, a magnitude similar to that reported in Li, 2006.) Adding these three document-level meta-features to the text classiﬁcation model, however, leads to statistically similar returns (11.74% for the augmented model versus 12.16% for the raw model). In contrast, a model that contains only the three textual meta-features does not generate any excess return whatsoever. We also ﬁnd that while the meta-features distinguish between average and extreme performance (as also predicted by our model) they are less able to distinguish the direction of the difference (i.e., between out- and under-perform), particularly for sub-samples of ﬁrms. That is, while the scores on the meta-features of out- and under-performing ﬁrms differ from the average ﬁrm (i.e., they are denser and reﬂect greater risk sentiment), the two groups do not differ between themselves. We conclude that the word-level text classiﬁer model we employ captures more information than is represented in the meta-features of clarity, risk sentiment, and tone, suggesting a fruitful role for word-based text classiﬁcation methods in accounting and ﬁnance research. Next, to gain insight into the source of the information content, we examine quantitative properties of ﬁrms in the predicted classes. These univariate comparisons indicate that the text-based disclosure score we develop is correlated with attributes such as size and market-to-book. Indeed, sub-sample analyses indicate a greater portfolio return for glamour versus value ﬁrms, and for small versus large ﬁrms. Thus, one explanation for our results is that our text-classiﬁcation captures ﬁrm attributes that may be readily computed using ﬁnancial data. While the correlation between text disclosure and ﬁnancial characteristics is an interesting ﬁnding in itself, it also is possible that the text disclosures affect the association between numeric dimensions and market performance. We examine these conjectures by regressing the excess returns on known factors such as earnings surprise, price momentum, ﬁrm size, and market-to-book ratio. We ﬁnd no main effect for the score on narrative disclosures, suggesting that the disclosure score provides no new information over that provided by known ﬁnancial factors. However, we ﬁnd that the coefﬁcient for interaction term with price momentum reliably differs from zero. (The interaction with market-to-book ratio is weakly signiﬁcant.) We infer that differences in the disclosures across ﬁrms affect conﬁdence in the numerical estimates, a ﬁnding consistent with ﬁrms with differing proﬁles following differing text disclosure strategies. Our study makes both methodological and economic contributions. Recent literature (e.g., Li, 2006; Davis et al., 2006; Henry, 2006a,b; Tetlock et al., 2006) that conducts a large sample study of the characteristics of narrative disclosures considers speciﬁc dimensions of narrative disclosure.1 Li, 2006) examines how changes in a ﬁrm’s fog index (a measure of readability) correlates with earnings prospects and persistence, thereby shedding light on managerial incentives to alter readability. In a levels study, Davis et al. (2006) consider the association between the tone (optimistic/pessimistic) of current reports with future ROA. Henry (2006a) conducts an event study that links the tone of press releases with market reactions.2 In contrast, our text-classiﬁcation algorithm offers three key methodological advantages. First, it simultaneously considers all aspects of the disclosure such as length and word choice, thereby avoiding the need to impose an external model to generate meta-features such as optimism or readability. Allowing an unconstrained relation lets the predictive model capture complex interactions among features. This attribute is particularly important for analysis of text because the relations can occur at the word, sentence and/or document level. And yet (the second advantage), our approach is open to including meta-features such as the fog index, thereby helping us understand the information captured by meta-features. (We perform such an extension.) Third, our approach can be readily extended to include other text sources such as analyst forecasts, economic reports or industry analyses that also might be relevant for ﬁrm valuation and for predicting performance. Indeed, it is possible to differentially assign weights to these sources in terms of their credibility, freshness, and so on, which extensions are not possible with the current approaches which rely on features developed from external models.3 Economically, we show that current period disclosure quality is associated with future returns and that the disclosures affect the conﬁdence in estimates.4 Our results indicate considerable beneﬁts from research that reﬁnes such predictive models by increasing the dataset (e.g., adding economic forecasts), and by conditioning the model on parameters such as industry and product-life cycle. Overall, the techniques we explore in this paper point toward a rich set of questions that parallel the use of numeric disclosures and examine the use of narrative disclosures by market participants as well as management incentives connected with such disclosures. The rest of this paper is as follows: Section 2 describes our research question and Section 3 provides an overview of the methodology. We discuss sample selection process in Section 4 and provide sample descriptions. We report results in Section 5 and offer concluding remarks in Section 6. 2. Background Beginning with the seminal work by Ball and Brown (1968), a vast literature examines whether and how market participants employ ﬁnancial reports to evaluate a ﬁrm’s future performance, and thus, its value. Fields et al. (2001); Kothari (2001) and Healy and Palepu (2001) provide recent surveys. In contrast to the attention paid to the properties of and the information contained in ﬁnancial data disclosed by ﬁrms, there is a paucity of research examining the narrative disclosures. However, such narratives are an important information source to the analysts and a critical component in annual reports. Rogers and Grant (1997) found that the management discussion and analysis (MD&A) section in annual reports constituted the largest proportion of information cited by the analysts. They state (p. 17), ‘‘[I]n total, the narrative portions of the annual report provide almost twice as much information as the basic ﬁnancial statements”. 1 There is also a research stream (Barron et al., 1999; Clarkson et al., 1999; Subramanian et al., 1993; Smith and Tafﬂer, 2000) that primarily relies on hand-coded classiﬁcation of a small sample of ﬁrms when investigating their research question. 2 Henry (2006b) considers a partitioning algorithm (CART) and shows that including data about key words and document style improves classiﬁcation accuracy. She performs a 10-fold analysis on contemporaneous data. That is, the model is trained with 90% of the observations and tested on the remaining 10%. Thus, the model is not implementable because it uses data from the same period to predict returns. That is, she uses actual data from 1998 for 90% of ﬁrms to predict returns in 1998 for the remaining 10% of ﬁrms. In contrast, our approach and tests lead to implementable approach in that we use actual data from 1998 to predict returns for 1999. 3 The disadvantage is that the underlying model is not transparent because it might be non-linear. Although not our focus, with additional structure and analyses, it is possible to determine the relative ‘‘weights” of the attributes. 4 Botosan (1997) and Botosan and Plumlee (2000), who follow the convention of using the AIMR ranking of corporate disclosure as a measure of disclosure quality, are notable exceptions.

R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801

791

Similarly, a survey of ﬁnancial analysts by the Association for Investment Management and Research (AIMR, 2000) found that ‘‘86% of surveyed ﬁnancial analysts indicated that management’s discussion of corporate performance was an ‘extremely’ or ‘very’ important factor in assessing ﬁrm value” (AIMR, 2000). Research corroborates practitioners’ claims that the narrative in an annual report contains value relevant information. For instance, using quality scores provided by analysts, Clarkson et al. (1999) ﬁnds that the quality of forward-looking information in MD&A directly relates to the ﬁrm’s upcoming performance. Botosan (1997) studies the association between disclosure level and the cost of equity capital, and ﬁnds that voluntary disclosures substitute for analyst following in lowering cost of capital. Bryan (1997) shows that discussions of future operations and planned capital expenditures are associated with one-period-ahead changes in sales, earnings per share, and capital expenditures. Barron et al. (1999) ﬁnds that high MD&A quality (in terms of compliance with the disclosure requirements) reliably reduces errors in analysts’ earnings forecasts. We also interpret the SEC’s plain English disclosure rules as acknowledging the importance of narrative disclosures when evaluating earnings and cash ﬂow (Firtel, 1999). The source of the information content in narrative disclosures is subtle and hard to measure. Subramanian et al. (1993) ﬁnd that well performing ﬁrms used ‘‘strong” writing in their reports while poor performers’ reports contained signiﬁcantly more jargon or modiﬁers and were hard to read. Smith and Tafﬂer (2000) identify thematic keywords from Chairman’s statements and generate discriminant functions to predict company failure. Kohut and Segars (1992) study president’s letters in annual reports and suggest that, as a communications stratagem, poor performing ﬁrms tend to emphasize future opportunities over poor past ﬁnancial performance. Lang and Lundholm (2000) ﬁnd that ‘‘optimistic” pre-announcement disclosures of equity offerings lower the cost of equity capital. Because of the difﬁculty in data collection and measurement, early studies that examine the qualitative aspects of the disclosure usually employ hand-collected data and examine small samples. They also typically rely on experts to code the quality of disclosure (e.g., AIMR scores). Recognizing these limitations, Core (2001) suggests that computing the measure of disclosure quality could greatly beneﬁt from the techniques of other research areas such as computer science, computational linguistics, and artiﬁcial intelligence. There also is interest in developing analyses that test the information content and the predictive ability of narrative disclosures in a large-sample study with automatic coding of data. Recent research (Li, 2006; Henry, 2006a,b; Davis et al., 2006) has responded to this call. Typical examples include Li (2006) who shows that changes in the readability of the MD&A section are predictive of future return and Davis et al. (2006) who show that tone (a count of pessimistic versus optimistic words) is associated with future ROA. Note that, like our study, Li (2006) assumes that market price does not instantaneously impound the information contained in narrative disclosures. We view these papers as positing a relation between some dimension(s) of textual data and future performance. Thus, these papers construct a measure (e.g., fog index, count of positive words) of the typically one dimension (readability, optimism) studied, and use traditional statistical methodology such as OLS regressions to test the association between the measure and performance. The values and relations among the parameter estimates form the basis for inferences about patterns in the data. Our innovation is the use of an algorithmic approach (see also Henry, 2006b) to develop a predictive model.5 Our approach, which draws from foundations in computer science, focuses on predictive accuracy and treats the data structure or pattern as an unknown. The goal is to let the algorithm ‘‘learn” the underlying model using the most relevant information from the entire set available. Thus, the focus is not on generating model parameters but on ﬁtting the best possible model. Such an approach confers at least three advantages: We can simultaneously consider many different aspects of the disclosure such as length, readability and word choice, thereby avoiding the need to specify ex ante the meta-features of interest such as optimism or readability. Such an unconstrained relation lets the predictive model discover and capture complex interactions among features. This attribute is particularly important for analysis of text because the relations can occur at the word, sentence and/or document level. Indeed, we can (and do so in our extensions) include document-level meta-features such as the fog index, thereby helping us understand the information captured by metafeatures. The approach can be easily extended to include other information sources such as economic reports. Including such data is particularly useful because market participants parse the annual report in the broader economic context and the other information available to them.6 Indeed, current development in computer science allows for models that differentially weight information sources in terms of their credibility, freshness, and so on. We can use the model to identify sub-sets of the population that systematically differ in terms of the information content of their disclosure.7 Because of these advantages, the use of algorithmic text classiﬁcation models is widespread in diverse areas such as marketing, biomedicine, music, law and web crawlers (Dave et al., 2003; Popa et al., 2007; Pérez-Sancho et al., 2005; Thompson, 2001; Pant and Srinivasan, 2005), although their use in ﬁnance and accounting is nascent. The primary disadvantage is that the method does not readily yield parameters that we could use to assess the statistical/economic signiﬁcance of individual dimensions and/or sources. While possible, such analyses require the researcher to impose considerably more structure and are left open for future research.8

5 Our method differs from the CART method in Henry (2006b) in that we do not sequentially add measures of constructs to partition the data. Rather, the entire set of words is used to construct a model. 6 For instance, Asquith et al. (2006) examine the information content of qualitative analysis provided by equity analysts. 7 As an example, consider a model that tests the ability of ﬁlm reviews to predict box ofﬁce receipts. We can then identify the reviewers whose reviews consistently outperform reviews by other reviewers. Studying this sub-sample of reviews then can help us understand the features that make a review more predictive of box ofﬁce success. Similarly, we can use this methodology to ﬁnd sub-sets of ﬁrms whose narrative disclosures are more informative regarding market and/or accounting performance. We can then study these disclosures to glean the reason why. 8 The two approaches are complements. The algorithmic approach can potentially help identify the constructs and an outline of the model. We can then employ traditional statistical methodology to ﬁt the model and identify parameters.

792

R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801

3. Methodology We focus on whether we could use narrative disclosures to construct measures that predict the ﬁrms’ performance. Constructing such an algorithm requires that we deﬁne (1) a method to quantitatively represent a document’s narrative disclosure, (2) measures of a ﬁrm’s performance, and (3) a model that will enable the use of the disclosure measure in step (1) to predict performance as deﬁned in step (2). We address these issues next. (See Appendix A for a non-technical description of the text classiﬁcation problem; Mahinovs and Tiwari (2007) provide an accessible review of the literature. See http://videolectures.net/mlas06_cohen_tc/ (accessed on 7/8/08) and Sebastiani (2002) for an in-depth review of the area.) Brieﬂy, text classiﬁcation is the task of automatically putting documents into predeﬁned categories. (A ready example is assigning news articles by topic such as politics, sports or culture.) This classiﬁcation task comprises several steps, the ﬁrst of which is text representation. For this step, we employ standard text representation techniques used in computer science, with suitable modiﬁcations for ﬁnancial reports. Consistent with the literature (Sebastiani, 2005), we ﬁrst stem the words in a document to their morphological roots (e.g., running is stemmed to run) and eliminate common words such as a and the. We then represent the document as a vector of stems using the ‘‘bag of words” approach. The approach is so named because it uses all the terms (stems) in a document regardless of the order or position of the terms. Loosely, the set of ‘‘independent variables” in the model is the set of all stemmed words. We can then map a document in n-space, treating each term as a dimension and using a numerical weight for each stem. This weight is usually a function of frequency of the stem in the document and in the full collection of documents (Hand, 2001).9 Naturally, because the method treats each unique term as a separate dimension, this step leads to a large term space. Accordingly, the next step is to reduce the term space and generate a smaller vocabulary (loosely, identify the words that have the greatest ability to distinguish among documents).This step is particularly important in our study because the term space generated from 10K reports is of extremely high dimension. We employ the document frequency (DF) method to reduce the term space. This method ranks words by the number of documents that contain the word and uses a threshold level to reduce the number of words considered. Yang and Pedesen (1997) shows that DF method produces an overall efﬁciency gain by eliminating less informative terms and reducing the vocabulary size without sacriﬁcing classiﬁcation accuracy. Finally, we use the term frequency * inverse document frequency (TF*IDF) method (Singhal et al., 1996), the most commonly used weighting scheme, for estimating the term weights for individual terms identiﬁed by the DF method. Intuitively, this weighting scheme assumes that the best descriptive terms for a given document are those that occur very often in the given document (term frequency) but not much in other documents (inverse document frequency) (Salton and Buckley, 1988). Note that the document frequency is calculated in the context of our collection of 10K ﬁling documents. Thus, these words will do well in separating the considered document from other documents. In this way, we represent each document as a point in the n-dimensional term space. Step 2 in our method is to identify the predictive attribute of interest. We focus our analysis on size-adjusted returns because the market return is the metric of most interest to shareholders, analysts and other users.10 This performance measure becomes the n þ 1 dimension associated with a document. In this context, note that predicting a speciﬁc value of a certain performance measure is a harder task than predicting a category of a performance measure, because a real value prediction is more granular than a category prediction. As an exploratory study, we start with a coarser approach and classify ﬁrms into three classes relative to their peers: under-performing, average, and out-performing. Each year, we rank the ﬁrms by their actual performance for the next year, and use the 25th and the 75th percentiles to deﬁne the cutoffs for the three classes. For step 3 in our method, ideally, we could develop a mapping between a ﬁrm’s disclosure vector (as developed in step 1) and the performance measure (in step 2). The classical statistical approach (which includes studies that examine one or more speciﬁed aspects of the text) then ﬁnds parameters that ﬁt a speciﬁed model to the data. Our approach differs in that we do not adopt a model or specify the attributes of interest. Rather, akin to a neural net, we let the data-driven text-classiﬁcation algorithm ‘‘learn” the potentially non-linear and multi-faceted relation between the text attributes and future returns. Essentially, the model seeks to construct an n-dimensional hyperplane that best separates the data points as per their categories.11 Once we ‘‘train” the model, we apply it to a hold-out sample (in our case to the annual reports for the next year). The output from this analysis is a prediction for each ﬁrm in the hold-out sample as to its category: out-perform, average or under-perform. We then construct equally-weighted portfolios based on these predictions. That is, we allocate the same $ amount to two sets of ﬁrms – we buy ﬁrms predicted to outperform and sell ﬁrms expected to under-perform. The size adjusted return earned by the portfolio is our measure of incremental value and the predictive ability of narrative disclosures. 3.1. Design For our design, a data point represents the results from a particular measure and year. We draw the training document set and the test document set from adjacent years. We use documents that report performance for the period t 1 to t (available at time t) to build the predictive model.12 We then apply the model to the documents reporting results for year t (available at time t þ 1) to predict the performance category for the period t þ 1 to t þ 2. (Notice that the standard 10-fold validation in text classiﬁcation as in Witten and Frank (2000) is not

9 In accounting, Smith and Tafﬂer (2000) and Hussainey et al. (2003) show that counts of keywords are related to bankruptcy and the association between current earnings and future stock returns. 10 Unlike two accounting metrics, the return metric impounds other information not reﬂected in the ﬁrm’s ﬁnancial statements because market prices are based on forward looking information (Kothari, 2001). Thus, the market return is the hardest to predict. On the other side, ﬁrm’s management exercises greater control over accounting data. Even though there is ongoing debate on whether earnings management is generally opportunistic or strategic (e.g., Arya et al., 1998, 2003), there is broad consensus that ﬁrms employ discretionary accruals to manage reported income. Such practices add noise to the accounting measures we consider. 11 The method is ideally suited for binary classiﬁcations. Because we have three classes, we perform three two-way classiﬁcations and combine the predictions to generate an overall classiﬁcation. See last paragraph of Appendix A for details. 12 The number of years to consider when building a predictive model is an interesting question. We could use all available data to construct the model, weighting recent years more. We use a conservative approach and only employ the most recently available information. In essence, our approach assumes that the patterns unearthed in last year’s annual report would hold for the current year’s annual report, and can help predict performance in the forthcoming year.

R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801

793

Overview of Design Year t-1 SAR Annual Report

A Doct-1

B

t

t+1

SARt

SARt+1

Doct

C

where: SARt Size adjusted return cumulated from April 1 of year t to March 31 of year t+1 Doct-1 Annual report for year t-1, usually available in March of year t. A. For firms in year t-1, build predictive model of year t using firms’ SAR in year t (i.e. size-adjusted return cumulated from April of year t to March of year t+1) and annual reports for year t-1 which are usually published in March of year t. B. For firms in year t, apply the predictive model built in step A) to the annual reports for year t which are published in March year t+1, and predict the class of SAR performance of these firms in year t+1, i.e. the 3-class of SAR (size-adjusted return cumulated from April of year t+1 to March of year t+2). C. On March 31st, year t+1, given a set of predicted out-performing firms and a set of predicted under-performing firms from step B), we sell the under-performing firms’ stocks at a total value of (for example) 10 million dollars and buy the out-performing firms’ stocks with a total value of 10 million dollars. In both the buying and selling transactions, we will allocate equal values of stocks among the firms. On March 31st, year t+2, we will sell the stocks of the out-performing firms and buy the stocks of the under-performing firms. If our prediction was correct, this transaction should generate non-negative profit. Fig. 1. Overview of design.

sensible in this context as we build a single model for each year in the sample.) Based on the classiﬁcation, we examine if an implementable trading strategy based on predictions from our model earns a positive size adjusted return. Such a test is interesting because predictive accuracy is a relatively coarse performance metric. Further, portfolio returns have an endogenous cost of prediction successes and failures. Finally, a returns test is the appropriate measure to examine if there is incremental information content in the narrative disclosures relative to the information impounded into contemporaneous prices. We calculate a portfolio return as the average size adjusted return difference between the out-performing ﬁrms and the under-performing ﬁrms for each year. We report results for a 25–50–25% cut-off for deﬁning the three classes of out-performing, average, and under-performing ﬁrms. (We veriﬁed robustness with a 10–80–10 cut-off.) We calculate a portfolio level return for a buy and hold strategy (see Fig. 1). Speciﬁcally, consider the model constructed using documents for the year ending 12/31/1999 (data available March 2000) and calendar year 2000 performance (data available in March 2001). We apply this model to documents available in March 2001, make predictions, and measure the cumulative size adjusted return for the portfolio from April 1, 2001 to April 1, 2002. We veriﬁed that such a strategy is implementable in that documents are available before 3/31. Further, because SAR for a random portfolio is zero by construction, this return is the incremental return relative to constructing a random portfolio. We perform two analyses to understand the source of any excess return. Our ﬁrst approach checks whether our disclosure score is picking up known document-level features such as the fog index or risk-sentiment. For each document, we add these features to the term space and construct a new model, and use the predictions of the augmented model to construct portfolios. If these meta-features are incrementally useful, the prediction of the augmented model should exhibit greater returns relative to the base model. Our second approach employs cross-sectional regressions. We estimate:

SAR ¼ a þ b1 Dummy þ b2 Size þ b3 MTB þ b4 PM þ b5 Earning Surprise þ b6 Size Dummy þ b7 MTB Dummy þ b8 PM Dummy þ b9 Earnings Surprise Dummy þ error; where SAR = Size adjusted buy-and-hold return for the year; Dummy = 1 if the ﬁrm is classiﬁed as out-performing and 0 for predicted underperforming ﬁrms. Average ﬁrms are excluded from this analysis; Size = The size of the ﬁrm, measured as the natural logarithm of total assets; MTB = Market-to-book ratio (a valuation proxy), using the closing market price as of the start of the holding period; PM = Price momentum, measured as the SAR for the six months preceding the start of the holding period; Surprise = Actual EPS – Forecast EPS, where the forecast is the latest available consensus analyst forecast. Our choice of the regressors stems from studies (e.g., Jegadeesh et al., 2004) that examine the incremental information content of analyst forecast revisions after controlling for factors known to affect returns. In the above regression, a positive coefﬁcient for b1 is consistent with the narrative disclosures providing incremental value-relevant information to market participants. A non-zero interaction term is consistent with the narrative disclosure altering the conﬁdence market participants place in the numeric estimates.

4. Data and descriptive statistics The primary data for our experiments include the ﬁrms’ ﬁnancial data, size-adjusted return, and the ﬁrms’ annual reports. To increase the homogeneity of ﬁrms in the sample, we restrict the sample to ﬁrms in the manufacturing industry (SIC codes 2000 to 3999), having December as ﬁscal year ending month. The sample period is from 1997 to 2002 (we include return data for 2003 as well). We ensure data integrity and accuracy by using the values for gvkey from the COMPUSTAT database, permno from the CRSP database, and OFTIC from I/B/E/S to identify a unique ﬁrm. We collect a total of 1236 unique ﬁrms’ ﬁnancial data. Each annual report has an accession

794

R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801

Table 1 Descriptive statistics. Item

Number of ﬁrm-years

Panel A: sample selection Total number of documents (1997–2002) Do not have SAR data Potentially useful in SAR exercise Truncated extreme observations Documents used to develop in SAR prediction model Loss due to 1-year ahead prediction Documents with portfolio experiment results No data for sub-sample classiﬁcation Available for sub-sample analysis Trimmed for extreme observations Net available for sub-sample analysis Used only extreme quintiles in regressions Lost due to missing data Used in regression

4755 (295) 4460 (180) 4280 (751) 3529 (459) 3070 (124) 2946 1473 406 1007

Notes

Trimmed top and bottom 1% of observations Statistics presented in Table 1 We do not have predictions for 1997, the ﬁrst year with documents Results presented in Tables 2, 3, and Panel A of Table 4

Top and bottom 1% of observations removed for each variable Results presented in Panel A of Table 3

Results presented in Panel B of Table 4

Item

N (ﬁrm-years)

Mean

Median

25th percentile

75th percentile

Panel B: sample characteristics Sales (millions) Net assets (millions) ROE EPS Size adjusted return Market-to-book ratio

4099 4099 3073 4255 4280 4098

$2749.66 $3153.43 3.48% $0.248 2.23% 1.95

$336.44 $385.9 7.77% $0.63 12.91% 1.18

$65.93 $95.52 10.64% $0.18 41.11% 0.64

$1666.91 $1735.89 18.09% $1.39 18.48% 2.36

SIC codes

Number of ﬁrms in sample

Panel C: industry composition 20–25 26 27 28 29–32 33 34 35 36 37 38 39

99 38 32 276 59 43 38 171 185 55 199 23

Total

1236

code as its unique identiﬁer. We manually download from Mergent Online the accession codes of the annual reports for each ﬁrm from 1997 to 2002. Then we automatically retrieve from the EdgarScan website the annual reports using the downloaded accession code. There are 10 different submission types for annual reports: 10K (10K ﬁlings), 10K405 (10K ﬁlings where regulation S-K Item 405 box on the cover page is checked), 10K405A (amendments to 10K405), 10KA (amendments to 10K ﬁlings), 10KSB (10K ﬁlings for small business), 10KSBA (amendments to 10KSB), 10KSB40 (optional form for small business where regulation S-B Item 405 box on the cover page is checked), 10KSB40A (amendment to 10KSB40), 10KT (10K transition report), 10KTA (amendment to 10K transition report). We focus on the major submission types of 10K and 10K405. Our ﬁnal useable documents with matching ﬁnancial performance measure values are 4280 annual reports from 1236 ﬁrms published in years 1997 to 2002. Using the CRSP database, we calculate size-adjusted cumulative return as the size-adjusted buy-and-hold return cumulated for 12 months from April 1 of the ﬁscal year to the next April. We verify that the relevant documents are available, and that the strategy is implementable. 4.1. Sample description Table 1, panel A provides the number of observations considered for each analysis. We begin with 4755 documents but make only 3529 predictions because of missing data and because of the lagging nature of the predictive model. We also trim the top and bottom 1% of observations on size-adjusted returns and other classiﬁcation variables to reduce the inﬂuence of outlying observations.13 We use 3070 observations in sub-sample analysis because we could not collect the classiﬁcation data required for 1997. Panel B provides descriptive data for our sample, with each observation representing a ﬁrm-year. Over all years, the average ﬁrm has mean sales of $2749 million and median sales of $336 million indicating the presence of several large ﬁrms in the sample. The average ROE is 3.48%, while the median is 7.77%. The mean and median values for the market-to-book ratio are 1.95 and 1.18, respectively. Panel C of Table 1 provides the industry breakdown for sample ﬁrms. We do not ﬁnd any signiﬁcant clustering of industries speciﬁc to our sample. Tests (not reported) do not reveal any systematic difference between the spread of ﬁrms in our sample and the distribution of 13 In the accounting and ﬁnance literatures, such trimming is standard when dealing with security returns. The average return in the bottom (top) 1% is close to (well over) 100% (+100%) which is not representative of average returns.

795

R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801

Table 2 Average return difference between predicted out-performing and under -performing ﬁrms model based on documents for year t and performance for year t + 1. Tested on documents for year t + 1 and performance for year t + 2. Year

With 25–50–25% performance class deﬁnition (%)

With 10–80–10 performance class deﬁnition (%)

Panel A: portfolio returns 1998 1999 2000 2001 2002

2.46 63.68 36.84 19.6 16.82

4.69 42.62 45.73 19.11 12.25

12.16

6.59

Average Year

With 25–50–25% performance class deﬁnition

With 10–80–10 performance class deﬁnition

Predicted out-performing

Predicted out-performing

Predicted under-performing

Panel B: number of out-performing and under-performing ﬁrms predicted as in panel A 1998 98 126 1999 110 83 2000 206 110 2001 97 170 2002 97 112 Total Year

608

601 Large size (%)

Predicted under-performing

45 61 78 55 56

54 41 46 54 37

295

232

Small size (%)

Value (%)

Glamour (%)

Panel C: portfolio returns (sub-sample analysis) 1999 18.08 2000 18.78 2001 16.61 2002 11.08

88.4 41.08 0 9.58

7.76 17.49 3.68 11.9

83.36 46.66 15.04 13.85

Average

14.23

1.46

16.39

6.75

Notes: 1. Cell entries represent the portfolio level buy-and-hold size-adjusted return for a year beginning April 1 and ending March 31. The portfolio is long on the predicted outperform ﬁrms and short on the predicted under-performers. 2. The performance class speciﬁcation relates to the performance cutoffs used to deﬁne classes in the training sample.

all COMPUSTAT ﬁrms from the relevant SIC codes. We also do not ﬁnd systematic differences in key ﬁrm characteristics. Thus, our results appear generalizable, albeit only to manufacturing ﬁrms. We use the size-adjusted cumulative return as the key metric of ﬁrms’ ﬁnancial performance. As noted earlier, this metric contains the market response information that is generally not reﬂected in the ﬁnancial statements. 5. Results The dependent variable in our analysis is a portfolio size-adjusted return, rebalanced each year. We construct an equally weighted buyand-hold portfolio that sells the under-performing ﬁrms and invests in the predicted out-performing ﬁrms. We ensure that we employ an implementable strategy by verifying that all of the documents were available before April. We calculate annual returns for the prediction period (April to April). For robustness, we replicate the analysis both for the 25–50–25 partition (reported), and for the 10–80–10 partitions of the sample for identifying out- and under-performing ﬁrms. Panel A of Table 2 presents results on the cumulative size-adjusted return by year. For both partitions, we ﬁnd a signiﬁcant return for every year except for 1998 and 2000 (when we ﬁnd a signiﬁcantly negative portfolio return). One reason for this anomaly might be the considerable turbulence experienced by ﬁnancial markets during 2000 (see, for example, Barber et al., 2003). On average, we ﬁnd an annual excess return of 12.16% using the 25–50–25 partition and 6.59% using the 10–80–10 partition for developing the model. These estimates are consistent with earlier research that hints at the considerable information content of narrative disclosures. These results also suggest that the market has difﬁculty in immediately parsing the information content of the disclosures, meaning that this information shows up in the return for the next year.14 In panel B, we report the number of ﬁrms classiﬁed as out- and under-performing, by year, for each of our partitions. These data show, while the predictive model was constructed using a 25–50–25 partition of actual performance, the actual number predicted to out or under-perform is not 25% of the hold-out sample of ﬁrms. For instance, only 608 ﬁrm-years predicted to out-perform when a naı¨ve expectation is for 882 = 3529 total observations classiﬁed * 0.25. (Using a proportions test, this difference is statistically signiﬁcant.) Thus, as is intuitive, our predictive model is better able to pick up ‘‘extreme” differences from the average ﬁrm relative to smaller differences. Panel C of Table 2 presents results for sub-samples of ﬁrms. We investigate two partitions based on market-to-book and on ﬁrm size. For the ﬁrst set of results, we determined the median market-to-book value for each year. We then partitioned sample ﬁrm-years into the value

14 Inspection suggests that general market volatility affects participant’s ability to gainfully use narrative disclosures to predict future performance. Systematic analysis of this inference is hampered because we only have 5 observations. Extending the analysis to more years (and/or using quarterly reports) is one way to obtain enough data to test this conjecture.

796

R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801

Table 3 Supplementary analysis. Firms partitioned by market-to-book, separate analysis for each group

Firms partitioned by size, separate analysis for each group

Glamour ﬁrms

Value ﬁrms

Large ﬁrms

Small ﬁrms

Panel A: semantics values by ﬁrm partitions Mean values of attribute All ﬁrms N (ﬁrm years) 3529 Fog index 18.41 Risk sentiment 28.81 Tone 0.401

1473 18.41 29.23 0.404

1472 17.97 24.37 0.392

1473 17.82 27.30 0.401

1472 18.57 26.20 0.391

Firms-predicted to out-perform N (ﬁrm years) Fog index Risk sentiment Tone

608 18.515 31.939 0.399

249 18.59 33.76 0.406

129 17.49 30.92 0.384

217 18.02 37.21 0.404

192 18.65 30.92 0.394

Firms predicted to under-perform N (ﬁrm years) Fog index Risk sentiment Tone

601 18.740 34.600 0.393

141 18.86 37.95 0.400

181 17.96 32.41 0.383

203 18.30 35.77 0.383

167 18.81 34.70 0.387

t-Tests of differences Firms predicted to out-perform versus all ﬁrms Fog index 2.34* Risk sentiment 4.65*** Tone 1.99

1.06 3.01*** 0.34

0.68 3.32*** 2.46

0.75 3.31*** 0.22

0.02 2.84** 1.42

Firms predicted to under-perform versus all ﬁrms Fog index 6.19*** Risk sentiment 7.40*** Tone 4.25***

3.44*** 3.69*** 1.19

1.02 4.25*** 1.02

2.75** 5.06*** 2.73**

1.83 4.38*** 2.65**

Predicted out performers versus predicted under-performers Fog index 2.46* Risk sentiment 1.57 Tone 2.78**

1.95 1.82 0.80

1.19 0.65 0.15

1.53 1.30 1.47

1.29 1.80 0.87

Panel B: average return difference between predicted out-performing and under-performing ﬁrms. Model (augmented with three document-level meta-features) based on documents for year t and performance for year t + 1. Tested on documents for year t + 1 and performance for year t + 2. Year

With 25–50–25% performance class deﬁnition (%)

1998 1999 2000 2001 2002

1.85 64.65 37.48 17.01 16.39

Average

11.74

Notes: 1. Variable deﬁnitions are as follows: Risk Sum of risk term frequency Tone (Optimism term frequency pessimism term frequency)/(optimism term frequency + pessimism term frequency)) Fog index A measure of readability. Calculated as

0:4

word wordswith 2syllables þ 100 sentences words

2. Entries in panel A are the raw values. We performed a log transformation when including the three textual feature deﬁnitions in the model. 3. Cell entries in Panel B represent the portfolio level buy-and-hold size-adjusted return for a year beginning April 1 and ending March 31. The portfolio is long on the predicted out-perform ﬁrms and short on the predicted under-performers. 4. The performance class speciﬁcation relates to the performance cutoffs used to deﬁne classes in the training sample. * ** ***

p < 0.05. p < 0.01. p < 0.001.

or glamour categories based on their value relative to the median value for the relevant year. We then re-estimated the textual model for each of the sub-samples separately. We repeated the exercise for size, using total assets as the measure. Our textual model indicates greater value relevance in the disclosures made by glamour ﬁrms and by small ﬁrms. Portfolios based on the model predictions have a size-adjusted return of 14.23% on average for small ﬁrms but only 6.75% for large ﬁrms. Similarly, expectations about future growth drive the valuations of glamour ﬁrms more so than for value ﬁrms. Again, we ﬁnd size adjusted returns of 16.39% (1.46%) when we form portfolios for glamour (value) ﬁrms. In other words, our ﬁndings show that ﬁrms grouped on readily observable metrics such as size and market-to-book ratio follow detectably different text disclosure strategies. (However, our analysis does not speak to the dimensions in which the disclosures differ, a matter for additional research in this area.)

797

R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801

Table 4 Sample characteristics and incremental returns model based on documents for year t and performance for year t + 1 Tested on documents for year t + 1 and performance for year t + 2. Sample ﬁrm/year from SAR implementable experiment

Panel A: sample characteristics Assets ($ million) Sales ($ million) EPS Market-to-book Leverage Earnings Surprise Price Momentum Size adjusted return (annual)

Predicted under-perform Mean (Median) N = 601

Predicted average-perform Mean (Median) N = 2320

Predicted outperform Mean (Median) N = 608

$1736.93 (170.88) $1389.62 (91.72) $0.305 (0.231) 1.636 (1.042) 0.441 (0.396) 0.153 (0.175) 0.332 (0.39) 0.057 (0.173)

$3625.59 (488.60) $3136.83 (473.51) $0.746 (0.74) 1.718 (1.012) 0.493 (0.505) 0.249 (0.1) 0.017 (0.067) 0.009 (0.097)

$3715.88 (482.53) $3125.95 (376.17) $0.699 (0.72) 3.124 (1.953) 0.447 (0.445) 0.099 (0) 0.696 (0.315) 0.019 (0.126)

SAR ¼ a þ b1 Dummy þ b2 Size þ b3 MTB þ b3 PM þ b4 Earning Surprise þ b5 Size Dummy þ b6 MTB Dummy þ b7 PM Dummy þ b8 Earning Surprise Dummy þ error Item

Panel B: incremental information content Intercept Dummy for model prediction Log (total assets) Log (market-to-book) Earnings surprise Price momentum Dummy log (total assets) Dummy log (market-to-book) Dummy earning surprise Dummy price momentum N Adjusted R-square F-value

Regression model 1

Regression model 2

Estimate

t-Value

Estimate

t-Value

0.068 0.054 0.014 0.003 0.001 0.083

0.87 1.08 1.17 0.12 0.36 3.11***

0.042 0.043 0.011 0.027 0.004 0.035 0.002 0.071 0.011 0.291

0.36 0.26 0.56 0.75 0.68 1.23 0.08 1.65 1.39 4.00***

1007 0.009 2.71**

1007 0.023 3.54***

Notes: 1. Variable deﬁnitions are as follows: SAR = Size adjusted buy-and-hold return for the year. Dummy = 1 if the ﬁrm is classiﬁed as out-performing and 0 for predicted under-performing ﬁrms. Average ﬁrms are excluded from this analysis. Size = The size of the ﬁrm, measured as the natural logarithm of total assets. MTB = Market-to-book ratio (a valuation proxy), using the closing market price as of the start of the holding period. PM = Price momentum, measured as the SAR for the six months preceding the start of the holding period. Surprise = Actual EPS Forecast EPS, where the forecast is the latest available consensus analyst forecast. 2. Test statistics employ cluster-adjusted standard errors to control for multiple observations from the same ﬁrm. ** ***

p < 0.01. p < 0.001.

5.1. Link to meta-features It is possible that our text classiﬁcation model is merely replicating the previously known association between document-level metafeatures (e.g., clarity, tone) and future performance. Panel A of Table 3 presents descriptive data for the three features we study, both for the full sample and for the sub-samples we consider. (In this table, each ﬁrm-year is a separate observation.) We report relevant t-statistics at the bottom of this panel. The ﬁrst column of this panel reports data relating to the entire sample for which we obtain performance predictions. Relative to the average ﬁrm, we ﬁnd that ﬁrms predicted to out-perform have a denser text (fog index of 18.515 versus 18.41, t = 2.34), have more words expressing risk (sentiment 31.93 versus 28.81, t = 4.65) but have a similar tone. We ﬁnd the similar pattern for ﬁrms predicted to underperform, with even the tone turning more pessimistic. Thus, ﬁrms in the ‘‘tails” of the distribution of predicted performance differ relative to average ﬁrm. A ﬁrst conclusion is that the attributes that our classiﬁcation models pick up ﬁrms that differ systematically on the meta-features of tone, clarity and risk sentiment. However, we have weaker evidence for this conclusion when we compare the feature scores for

798

R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801

ﬁrms predicted to out-perform with those predicted to under-perform. We ﬁnd that predicted under-performers have a marginally less readable document and have slightly greater pessimism. The two groups express similar risk proﬁles. These comparisons suggest that text disclosures do have information content relating to performance, and that meta-features can help us identify the extremes. The comparisons in the columns highlight that we cannot simply use the meta-features to replace the model. This inability arises because meta-features are of less use in distinguishing the direction of the performance differential, the key attribute of interest. Data in the next four columns (for the sub-samples of glamour, value, large and small ﬁrms) provide additional evidence that support our inference. For all four sub-samples, the predicted under- and out-performing ﬁrms have greater scores for risk sentiment relative to the average ﬁrm. However, we do not ﬁnd differences between the sets of ﬁrms predicted to out- and under-perform, for every sub-sample and for every measure. We conclude that while meta-features are picking up differences in style and tone that are systematically related to performance differentials predicted by our model, they seem unable to distinguish the nature of the performance differential.15 For an additional test of whether our predictive model captures more than the meta-features, we reﬁt the predictive model including the meta-features in the term space. As shown in Panel B of Table 3, we obtain a similar (11.74%) return from the augmented model. More importantly, we also ﬁt a model using only the meta-features. Such a model should, in theory, produce the same predictive ability, if the meta-features contained all of the information in the documents. However, we ﬁnd that such a parsimonious representation of a document (as 3 meta-features of clarity, tone and risk sentiment) has no explanatory power at all (results not tabled). Overall, we conclude that our text classiﬁcation model is capturing features not picked up by the selected meta-features. 5.2. Incremental information content Panel A of Table 4 returns to the full sample analysis. This table provides descriptive data on the ﬁrms predicted in the three classes, for the 25–50–25 classiﬁcation. Relative to the average ﬁrm in the out-perform sample, the ﬁrms in the under-perform sample are reliably smaller, have lower market-to-book ratios, and are less proﬁtable (as measured by EPS). This distinction provides additional evidence about the information content of disclosures because the classiﬁcation does not use any numeric item. The text in the annual reports is enough to identify a distinct sample of under- and out-performing ﬁrms. Panel B of Table 4 provides results that speak to the relation between the information in the narratives and the information in quantitative disclosures. In particular, it is of interest to know whether the information in the narrative disclosure is subsumed by or is incremental to the information in the quantitative disclosure. The ﬁrst column reports results for a model that considers main effects only. We ﬁnd that the coefﬁcient for the model prediction (for ‘‘dummy”) is not reliably different than zero. Thus, the text disclosure does not appear to provide value-relevant information incremental to that provided by known factors but rather captures known features. We ﬁnd that large ﬁrms earn smaller returns (Reinganum, 1981), and that a high market-to-book ratio presages lower returns as well (Fama and French, 1992). However, once we account for all other factors, our data do not show the expected relation between price momentum and excess returns (Jegadeesh and Titman, 1993). Interestingly, we note that the univariate comparison in Panel A is signiﬁcant at the 5% level (the price momentum is 0.019 for predicted out-performers versus 0.332 for the predicted under-performing ﬁrms). The regression estimate, however, indicates that the incremental effect (after accounting for other factors such as market-to-book, size and earnings momentum) is negative. The second column in this panel reports results for a complete model that includes interaction terms for the model prediction (a binary variable) with established causes. We continue to ﬁnd an insigniﬁcant main effect for our model’s prediction. However, as indicated by the signiﬁcant interaction terms, the disclosure score is weakly informative as to whether the glamour/value partition will continue to yield excess returns for the next period. Moreover, the interaction with price momentum is signiﬁcant. That is, the disclosure score indicates that the effect of price momentum for ﬁrms predicted to out-perform is reliably smaller than for the average ﬁrm. Thus, while Jegadeesh and Titman (1993) shows abnormal returns to buying winners and selling losers, our results suggest the possibilities of ﬁner partitions.16 Thus, one interpretation of our results is that narrative disclosures can help identify ﬁrms with negative price momentum that reverses over the next year. Stated differently, narrative disclosures could help identify if the price momentum will sustain into the next period or will reverse. 6. Conclusions This paper is part of a nascent literature that explores the narrative disclosures made by ﬁrms and complements established literature that considers the ability of numeric data to predict market performance (e.g., the accrual or the post earnings announcement drift anomalies). Most of the prior studies that studied textual disclosures have largely relied on expert classiﬁcation thereby limiting sample sizes and the kinds of questions that could be asked. This study demonstrates a methodology for large-scale text mining of the narrative disclosures in annual reports. Even a relatively simple model, when applied to the narrative data alone, successfully predicts future accounting and market performance. There are several limitations of our approach. Our methodology only allows for limited economic insight into what characteristics of the disclosure lead to certain predictions (see e.g., Li, 2006). We also employ a simple ‘bag of words’ approach, without paying attention to the context of the usage of speciﬁc words. Further, we limit ourselves to the disclosures in the annual report and thus restrict the information that market participants would employ. However, these limitations can be addressed using some of the emerging techniques in text mining (see, e.g., Pant and Srinivasan, 2005). We could expand this study along several dimensions. The ﬁrst is the use of text mining models that consider attributes such as tone, phrasing and so on. The second avenue is to augment the disclosures with additional disclosures such as press releases. We also could overweight economic predictions such as the news releases from the Federal Reserve, and sector-speciﬁc forecasts by trade associations. A third 15 We note that, in a changes analysis, Li (2006) shows the predictive ability of the fog index (we replicate this ﬁnding as well). The other studies do not focus on market performance in an implementable way. We also note that, considering all ﬁrms, the group of glamour ﬁrms has greater risk sentiment and a more optimistic tone (p < 0:001 for both comparisons) relative to the scores recorded for value ﬁrms. We also ﬁnd that larger ﬁrms tend to have greater readability but their count of risk-related words is also higher. 16 These results hold even if we discard the data for the year 2000 in the analysis. We also note that portfolios formed on price momentum generate abnormal negative returns after an initial holding period (Jegadeesh and Titman, 1993).

R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801

799

avenue is to identify the nature of the differences in disclosures by large versus small ﬁrms, and glamour versus value ﬁrms as our results demonstrate that these sub-sets follow differing strategies. We also could study extreme observations (e.g., high positive score but high negative return) to identify features that diminish the informativeness of text disclosures. Finally, it is of interest to examine the time it takes market participants to impound textual information. While we have focused on annual returns, we conjecture that studies that examined shorter-time frames might provide sharper differences while the additional economic noise would wipe out the effect if we considered longer time frames. However, the relation is not likely monotonic because quantitative data likely dominates the returns for very short (intra-day or a few days) return intervals. Acknowledgements We thank Mort Pincus, Cristi Gleason, Paul Hribar, the editor, two anonymous reviewers, and workshop participants at the University of Iowa and Christopher Newport University for helpful comments. Xin Ying Qiu also acknowledges contributions from members of her thesis committee. Appendix A. Building text classiﬁers for prediction Text classiﬁcation is a core activity in information science. The goal is to assign each text to one (or more) of a given set of categories. As an example, we may be interested in classifying a news article using the categories of sports, health, famous persons, entertainment, gardening, real estate or ﬁnance. It could be that the article belongs to sports or it could be that the article belongs to both the famous persons and sports categories. Trained individuals may perform text classiﬁcation manually. Alternately classiﬁcation may be accomplished using computational tools called text classiﬁers. The design and evaluation of algorithms for automatic text classiﬁcation has been the basis of a highly active ﬁeld of research for several decades. The ﬁeld is now mature to the point that we are seeing text classiﬁers used to decide not only conceptual categories (as in the above example) but also to capture more subtle human phenomenon such as sentiment; classiﬁers are being used to identify sentences that are speculative (versus presenting ideas with conﬁdence), to identify sentence tone as positive, negative or neutral and so on. Developments in these more subtle realms in part motivate our current research on text classiﬁers to predict market performance. The automatic methods employed in text classiﬁcation derive predominantly from research in machine learning, a subﬁeld of artiﬁcial intelligence. Major examples of these derivations include text classiﬁcation algorithms based on support vector machine (as in this paper), neural nets, decision trees and association rules. Of these the Support Vector Machine (SVM) based algorithms are at least amongst the most effective algorithmic methods Sebastianip. 49, 2002. A given classiﬁcation problem generally (but not always) starts with some training data that has been classiﬁed by some reliable mechanism, such as an expert, into one of two classes. Alternatively, we can use a known outcome such as next period’s return to classify the text. An SVM represents each example in the training data as a vector in an n-dimensional space and proceeds to ﬁnd an n 1 dimensional hyperplane that separates the two classes. This strategy produces a linear classiﬁer. Here, the parameter ‘‘n” represents the number of features considered. Thus, in text classiﬁcation problems, n can be fairly large consisting of every nontrivial word in the collection of texts being classiﬁed. Because many candidate hyperplanes are likely to exist, SVMs are additionally designed to achieve the best or maximum separation (also called margin) between the two classes of the training data. That is, the nearest distance between a point in one separated hyperplane and a point in the other separated hyperplane is maximized. The ‘‘trained” classiﬁer may then be applied to new data, classifying each new text into one of the two classes. Several key extensions have been made to the basic linear SVM. For instance, when a clean separation between the two classes of points is not possible, soft margins allow for some amount of classiﬁcation error through the use of slack variables. SVM then aims at maximizing margin while minimizing error. In addition, researchers often employ one of several functions to transform the initial n-dimensional space. The classiﬁer then looks for a separating hyperplane in this transformed space, a hyperplane that may be non-linear in the original space. This strategy may be useful in cases where linear classiﬁers are not sufﬁciently effective. Several such ‘‘kernel functions” to transform the initial space are available in implementations of SVM tools, including polynomial and Sigmoid functions. In this paper we use the base linear SVM classiﬁer.17 SVMs are designed mainly for solving binary or two-class classiﬁcation problems. Since our research problem is to classify documents into three classes, we consider some options to extend SVMs to multi-class problems. We perform one-against-rest classiﬁcation for each class, and combine the results to make a ﬁnal decision. The computing time for this option is linear in n (the number of classes). That is we produce a total of three binary (one-against-rest) SVM models. We use the highest predictive scores generated by the three models to assign a class label to the document. A.1. Document representation In information retrieval and text-classiﬁcation research, the most common approach to encode (or represent) a text document is to model a document as a vector of weighted of terms. There are generally three aspects to consider when constructing such a document model: (1) What are the terms in the vector? Are they all the words from the document set, or phrases, or some transformation of the words or phrases? (2) How many terms do we need to construct the document representation? Do we use all the deﬁned terms, or a subset of the terms? And, if we want only a subset of the terms, how do we select this subset and why?

17 We build our classiﬁers using the SVM-Light implementation of Support Vector Machines with default parameter settings and linear kernel function. (See http:// svmlight.joachims.org/.)

800

R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801

(3) How do we construct a weighting scheme for the terms in the document vector, to best indicate the terms’ relative informativeness and importance with respect to representing the document? In addressing the ﬁrst question of deﬁning the terms to represent document, the most widely used ‘‘bag of words” approach starts with the complete vocabulary in the training corpus (the set of words used as ‘‘independent variables” in the model). Functional or connective words, such as ‘‘a, hence, and, the,” are considered as stop words and are generally removed since they are assumed to have no information content. Stemming (e.g., connecting or connected is the same as connect) is sometimes performed to remove the sufﬁxes and to map words to their morphological roots. Researchers have explored other more complex textual representations (e.g. Peng and Schuurmans, 2003; Dumais et al., 1998; Apte et al., 1994). While each method has its strengths and weaknesses, more complex deﬁnitions have not been shown to be superior to the basic ‘‘bag of words” approach in solving classiﬁcation problems. In this study, we use the stemmed words of the document corpus to construct the document vector representation. Since the term space generated from our 10K report collection is of extremely high dimension, we will need to reduce the term space and generate smaller vocabularies. The beneﬁts of such a reduced term space include better generalization ability of the model, saving of computing time, and possibly better interpretation and understanding of the predictive features. Most term selection methods either compute statistical feature scores to select high-scoring terms or apply simpler feature selection algorithms from machine learning research (e.g. Yang and Liu, 1999; Larkey, 1998; Yang and Pedesen, 1997). We use the document frequency (DF) threshold method for reducing the term space. Relative to other methods, this method (which employs a count of the number of 10K ﬁling documents in our collection that use a given word) shows an overall efﬁciency in eliminating less informative terms and reducing the vocabulary size without sacriﬁcing classiﬁcation accuracy. Researchers have used many ways to calculate term weights in document vectors. The term frequency * inverse document frequency or TF*IDF is the most commonly used weighting scheme for estimating the usefulness of a given term as a descriptor of a document. Its interpretation is that the best descriptive terms of a given document are those that occur very often in the given document (high term frequency or TF) but not much in the other documents (IDF). In our previous study, we explored several constructions of TF*IDF weights. The best performer is the atn weight formulated as:

atn ¼ 0:5 þ 0:5

tf max tf

ln

N ; n

where tf is raw term frequency; max tf is the maximum term frequency for term in the document collection; N is the total number of documents in the collection; n is the number of documents containing the given term i. Therefore, we report results only using atn as our weighting scheme for the terms in the document vector. References Apte, C., Damerau, F.J., Weiss, S.M., 1994. Automated learning of decision rules for text categorization. ACM Transaction on Information Systems 12 (3), 233–251. Arya, A., Glover, J., Sunder, S., 1998. Earnings management and the revelation principle. Review of Accounting Studies, 7–34. Arya, A., Glover, J., Sunder, S., 2003. Are unmanaged earnings always better for shareholders? Accounting Horizons 17, 111–116. Association for Investment Management and Research (AIMR), 2000. AIMR Corporate Disclosure Survey: A Report to AIMR. Fleishman-Hillard Research, St. Louis, MO. Asquith, P., Mikhail, M., Au, A., 2006. Information content of equity analyst’s reports. Journal of Financial Economics 75, 245–282. Ball, R., Brown, P., 1968. An empirical evaluation of accounting income numbers. Journal of Accounting Research 6 (2), 159–178. Barber, B., Lehavy, R., McNichols, M., Trueman, B., 2003. Reassessing the returns to analysts’ stock recommendations’. Financial Analysts Journal 59 (2), 16–18. Barron, O., Kile, C., O’Keefe, T., 1999. MD&A quality as measured by the SEC and analysts’ earnings forecasts. Contemporary Accounting Research 16 (Spring), 75–109. Botosan, C., 1997. Disclosure level and the cost of equity capital. The Accounting Review 72, 323–349. Botosan, C., Plumlee, M., 2000. Disclosure level and expected cost of equity capital: An examination of analysts’ rankings of corporate disclosures and alternative methods for estimating the cost of capital. Working paper, The University of Utah. Bryan, S.H., 1997. Incremental information content of required disclosures contained in management discussion and analysis. The Accounting Review 72 (2), 285–301. Clarkson, P., Kao, J., Richardson, G., 1999. Evidence that management discussion and analysis (MD&A) is a part of a ﬁrm’s overall disclosure package. Contemporary Accounting Research 61, 111–134. Core, J.E., 2001. Firm’s disclosure and their cost of capital: A discussion of a review of the empirical disclosure literature. Journal of Accounting and Economics 31, 441–456. Dave, D., Lawrence, S., Pennock, D.M., 2003. Mining the peanut gallery: Opinion extraction and semantic classiﬁcation of product reviews. In: Proceedings of the 12th International World Wide Web Conference, (WWW 2003), ACM, pp. 519–528. Davis, A., Piger, J., Sedor, L., 2006. Beyond the numbers: An analysis of optimistic and pessimistic language in earnings press releases. Working paper, Washington University in St. Louis. Dumais, S.T., Platt, J., Heckerman, D., Sahami, M., 1998. Inductive learning algorithms and representations for text categorization. In: Proceedings of CIKM-98, Seventh ACM International Conference on Information and Knowledge Management, pp. 148–155. Fama, E., French, K., 1992. The cross section of expected stock returns. Journal of Finance 47, 427–465. Fields, T., Lys, T., Vincent, L., 2001. Empirical research in accounting choice. Journal of Accounting and Economics 31 (1–3). Firtel, K., 1999. Plain English: A reappraisal of the intended audience of disclosure under the securities at of 1933. Southern California Law Review 72, 851–889. Hand, D., Mannila, H., Smyth, P., 2001. Principles of Data Mining. MIT Press, Cambridge, MA. Healy, P., Palepu, K.G., 2001. Information asymmetry, corporate disclosure, and the capital markets: A review of the empirical disclosure literature. Journal of Accounting and Economics 31 (1–3), 405–440. Henry, E., 2006a. Market reaction to verbal components of earnings press releases: Event study using a predictive algorithm. Journal of Emerging Technologies in Accounting 3 (1–19). Henry, E., 2006b. Are Investors inﬂuenced by how earnings releases are written?. Working paper, University of Miami. Hussainey, K., Schleicher, T., Walker, M., 2003. Undertaking large-scale disclosure studies when AIMR-FAF ratings are not available: The case for prices leading earnings. Accounting and Business Research 33 (4), 275–294. Jegadeesh, N., Kim, J., Krische, S.D., Lee, C.M.C., 2004. Analyzing the analysts: When do recommendations add value? The Journal of Finance LIX (3), 1083–1124. Jegadeesh, N., Titman, S., 1993. Returns to buying winners and selling losers: Implications for stock market efﬁciency. Journal of Finance 48, 65–91. Kothari, S.P., 2001. Capital markets research in accounting. Journal of Accounting and Economics 31 (1–3). Kohut, G., Segars, A., 1992. The president’s letter to stockholders: An examination of corporate communication strategy. Journal of Business Communication 29 (1), 7–21. Lang, M., Lundholm, R., 2000. Voluntary disclosure during equity offerings: Reducing information asymmetry or hyping the stock? Contemporary Accounting Research 17, 623–662. Larkey, L.S., 1998. Automatic essay grading using text categorization techniques. In: Proceedings of ICML-98, 12th International Conference on Machine Learning, pp. 90–95. Li, F., 2006. Annual Report readability, current earnings and earnings persistence. Working paper, University of Michigan, Ann Arbor. Mahinovs, A., Tiwari, A., 2007. Text Classiﬁcation Method Review. In: Roy, R., Baxter, D. (Eds.), Decision Engineering Report Series. Miemo. Cranﬁeld University, UK. Pant, G., Srinivasan, P., 2005. Learning to crawl: Comparing classiﬁer schemes. ACM Transactions on Information Systems 23 (4), 430–462.

R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801

801

Peng, F., Schuurmans Dales, 2003. Combining naive Bayes and n-gram language models for text categorization. In: Proceedings of the 25th European Conference on Information Retrieval Research (ECIR03). Pérez-Sancho, C., Iñesta, J.M., Calera-Rubio, J., 2005. A text categorization approach for music style recognition: Pattern recognition and image analysis. Lecture Notes in Computer Science 3523, 649–657. Popa, S., Zeitouni, K., Gardarin, G., Nakache, D., Métai, E., 2007. Text categorization for multi-label documents and many categories. In: Proceedings of the 12th IEEE International Symposium on Computer-Based Medical Systems (CBMS’07), IEEE Computer Society, Washington, DC, pp. 421–426. Reinganum, M., 1981. Misspeciﬁcation of the capital asset pricing: Empirical anomalies based on earnings’ yield and market values. Journal of Financial Economics 9, 19–46. Rogers, K., Grant, J., 1997. Content analysis of information cited in reports of sell-side ﬁnancial analysts. Journal of Financial Statement Analysis 3, 17–30. Salton, G., Buckley, C., 1988. Term-weighting approaches in automatic text retrieval. Information Processing & Management 24 (5), 513–523. Sebastiani, F., 2002. Machine learning in automated text categorization. ACM Computing Surveys 34 (1), 1–47. Sebastiani, F., 2005. Text categorization. In: Zanasi, Alessandro (Ed.), Text Mining and its Applications to Intelligence. CRM and Knowledge Management, WIT Press, Southampton, UK, pp. 109–129. Singhal, A., Buckley, C., Mitra, M., 1996. Pivoted document length normalization. In: Proceedings of the 1996 ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 21–29. Smith, M., Tafﬂer, R.J., 2000. The chairman’s statement: A content analysis of discretionary narrative disclosures. Accounting Auditing & Accountability Journal 13 (5), 624– 646. Subramanian, R., Insley, R.G., Blackwell, R.D., 1993. Performance and readability: A comparison of annual reports of proﬁtable and unproﬁtable corporations. Journal of Business Communication 30, 50–61. Tetlock, P., Saar-Tsechansky, M., Macsakssy, S. 2006. More than words: Quantifying language to measure ﬁrm’s fundamentals. Working paper, University of Texas at Austin. Thompson, P., 2001. Automatic categorization of case law. In: Proceedings of the 8th international conference on Artiﬁcial intelligence and law, ACM, pp. 70–71. Witten, I., Frank, E., 2000. Data Mining. Morgan Kaufmann Publishers, San Francisco. Yang, Y., Liu, X., 1999. A re-examination of text categorization methods. In: Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval, pp. 42–49. Yang, Y., Pedesen, J.O., 1997. A comparative study in feature selection in text categorization. In: Proceedings of the 14th International Conference on Machine Learning, pp. 412–420.

On the predictive ability of narrative disclosures in annual reports

On the predictive ability of narrative disclosures in annual reports

Recommend Documents