On the predictive ability of narrative disclosures in annual reports

On the predictive ability of narrative disclosures in annual reports

European Journal of Operational Research 202 (2010) 789–801 Contents lists available at ScienceDirect European Journal of Operational Research journ...

287KB Sizes 0 Downloads 52 Views

European Journal of Operational Research 202 (2010) 789–801

Contents lists available at ScienceDirect

European Journal of Operational Research journal homepage: www.elsevier.com/locate/ejor

Stochastics and Statistics

On the predictive ability of narrative disclosures in annual reports Ramji Balakrishnan a,*, Xin Ying Qiu b, Padmini Srinivasan c a

The University of Iowa, Tippie College of Business, Iowa City, IA 52246, USA Christopher Newport University, Luter School of Business, Newport News, VA 23606, USA c The University of Iowa, Computer Science Department and Tippie College of Business, Iowa City, IA 52246, USA b

a r t i c l e

i n f o

Article history: Received 6 February 2009 Accepted 18 June 2009 Available online 30 June 2009 Keywords: Economics Finance Text mining Capital markets

a b s t r a c t We investigate whether narrative disclosures in 10-K and 10K-405 filings contain value-relevant information for predicting market performance. We apply text classification techniques from computer science to machine code text disclosures in a sample of 4280 filings by 1236 firms over five years. Our methodology develops a model using documents and actual performance for a training sample. This model, when applied to documents from a test set, leads to performance prediction. We find that a portfolio based on model predictions earns significantly positive size-adjusted returns, indicating that narrative disclosures contain value-relevant information. Supplementary analyses show that the text classification model captures information not contained in document-level features of clarity, tone and risk sentiment considered in prior research. However, we find that the narrative score is not providing information incremental to traditional predictors such as size, market-to-book and momentum, but rather affects investors’ use of price momentum as a factor that predicts excess returns. Ó 2009 Elsevier B.V. All rights reserved.

1. Introduction A primary use of accounting reports is to help investors evaluate an organization’s financial prospects. Narratives are an important information source for analysts and a critical component in annual reports (Rogers and Grant, 1997). A majority of the financial analysts surveyed by the AIMR (2000) indicate that management discussion is a very or extremely important item when evaluating firm value. However, perhaps because of the relative costs of gathering and analyzing numerical versus textual data, most academic research has focused on the quantitative disclosures in annual reports. Moreover, because of flexibility in framing these disclosures with respect to choice of words and tone in addition to content, it is likely that the information in narratives is not fully impounded into contemporaneous prices (see Li, 2006 for additional observations in this regard). In this study, we modify and apply techniques from the text classification branch of computer science to the narrative disclosures in 10-K and 10-K405 filings in order to predict market returns. In a training sample, we pair the narrative disclosure in the 10K documents with the subsequent performance and use standard text classification techniques to build a predictive model. In particular, we define out- and under-performing firms as the top (bottom) 25% of all firms, and group firms into three classes (out-performing, average and under-performing) based on their actual performance from period t to t þ 1. We then use text disclosed in period t (that relate to performance for the period t  1 to t) and the performance class to build a model that associates the text in a 10K report for a period with next period’s performance. This automated text-classification exercise, which employs many features such as the number, frequency, and count of words that are similar (dissimilar) across documents, yields a model that can classify the text for an arbitrary firm as to its predicted performance. We test the model’s predictive ability by applying it to the documents for period t þ 1 (this testing sample of documents relates to performance for t to t þ 1) to predict performance for the period t þ 1 to t þ 2. We use these predictions to form and maintain a portfolio. Specifically, for each year, our equally weighted portfolio buys stock in firms we predict to out-perform the market and sells predicted under-performers. The magnitude of the size-adjusted returns for the portfolio is then a joint test of the presence of value relevant information in narrative disclosures and our ability to systematically extract it. (Of course, like the anomalies literature, our analysis also assumes that the information is not impounded immediately into prices.) On average, our portfolio yields an average size-adjusted return of 12.16% per year. Our classifier is word-based, i.e., it extracts key words from the texts and uses these as features to build predictive models. We conduct additional tests to examine the extent to which * Corresponding author. Tel.: +1 (319) 335 0958; fax: +1 (319) 335 1956. E-mail addresses: [email protected], [email protected] (R. Balakrishnan). 0377-2217/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.ejor.2009.06.023

790

R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801

our (word-level) model adds to models built using (document-level) meta-features such as clarity and tone. Motivated by prior research we add the following three meta-features to our model: Fog index (Li, 2006), risk sentiment (Henry, 2006a) and optimistic tone (Davis et al., 2006). (As a check, we replicated the association between changes in clarity (the fog index) and market returns. A portfolio based on changes in clarity leads to a size adjusted return of 10%, a magnitude similar to that reported in Li, 2006.) Adding these three document-level meta-features to the text classification model, however, leads to statistically similar returns (11.74% for the augmented model versus 12.16% for the raw model). In contrast, a model that contains only the three textual meta-features does not generate any excess return whatsoever. We also find that while the meta-features distinguish between average and extreme performance (as also predicted by our model) they are less able to distinguish the direction of the difference (i.e., between out- and under-perform), particularly for sub-samples of firms. That is, while the scores on the meta-features of out- and under-performing firms differ from the average firm (i.e., they are denser and reflect greater risk sentiment), the two groups do not differ between themselves. We conclude that the word-level text classifier model we employ captures more information than is represented in the meta-features of clarity, risk sentiment, and tone, suggesting a fruitful role for word-based text classification methods in accounting and finance research. Next, to gain insight into the source of the information content, we examine quantitative properties of firms in the predicted classes. These univariate comparisons indicate that the text-based disclosure score we develop is correlated with attributes such as size and market-to-book. Indeed, sub-sample analyses indicate a greater portfolio return for glamour versus value firms, and for small versus large firms. Thus, one explanation for our results is that our text-classification captures firm attributes that may be readily computed using financial data. While the correlation between text disclosure and financial characteristics is an interesting finding in itself, it also is possible that the text disclosures affect the association between numeric dimensions and market performance. We examine these conjectures by regressing the excess returns on known factors such as earnings surprise, price momentum, firm size, and market-to-book ratio. We find no main effect for the score on narrative disclosures, suggesting that the disclosure score provides no new information over that provided by known financial factors. However, we find that the coefficient for interaction term with price momentum reliably differs from zero. (The interaction with market-to-book ratio is weakly significant.) We infer that differences in the disclosures across firms affect confidence in the numerical estimates, a finding consistent with firms with differing profiles following differing text disclosure strategies. Our study makes both methodological and economic contributions. Recent literature (e.g., Li, 2006; Davis et al., 2006; Henry, 2006a,b; Tetlock et al., 2006) that conducts a large sample study of the characteristics of narrative disclosures considers specific dimensions of narrative disclosure.1 Li, 2006) examines how changes in a firm’s fog index (a measure of readability) correlates with earnings prospects and persistence, thereby shedding light on managerial incentives to alter readability. In a levels study, Davis et al. (2006) consider the association between the tone (optimistic/pessimistic) of current reports with future ROA. Henry (2006a) conducts an event study that links the tone of press releases with market reactions.2 In contrast, our text-classification algorithm offers three key methodological advantages. First, it simultaneously considers all aspects of the disclosure such as length and word choice, thereby avoiding the need to impose an external model to generate meta-features such as optimism or readability. Allowing an unconstrained relation lets the predictive model capture complex interactions among features. This attribute is particularly important for analysis of text because the relations can occur at the word, sentence and/or document level. And yet (the second advantage), our approach is open to including meta-features such as the fog index, thereby helping us understand the information captured by meta-features. (We perform such an extension.) Third, our approach can be readily extended to include other text sources such as analyst forecasts, economic reports or industry analyses that also might be relevant for firm valuation and for predicting performance. Indeed, it is possible to differentially assign weights to these sources in terms of their credibility, freshness, and so on, which extensions are not possible with the current approaches which rely on features developed from external models.3 Economically, we show that current period disclosure quality is associated with future returns and that the disclosures affect the confidence in estimates.4 Our results indicate considerable benefits from research that refines such predictive models by increasing the dataset (e.g., adding economic forecasts), and by conditioning the model on parameters such as industry and product-life cycle. Overall, the techniques we explore in this paper point toward a rich set of questions that parallel the use of numeric disclosures and examine the use of narrative disclosures by market participants as well as management incentives connected with such disclosures. The rest of this paper is as follows: Section 2 describes our research question and Section 3 provides an overview of the methodology. We discuss sample selection process in Section 4 and provide sample descriptions. We report results in Section 5 and offer concluding remarks in Section 6. 2. Background Beginning with the seminal work by Ball and Brown (1968), a vast literature examines whether and how market participants employ financial reports to evaluate a firm’s future performance, and thus, its value. Fields et al. (2001); Kothari (2001) and Healy and Palepu (2001) provide recent surveys. In contrast to the attention paid to the properties of and the information contained in financial data disclosed by firms, there is a paucity of research examining the narrative disclosures. However, such narratives are an important information source to the analysts and a critical component in annual reports. Rogers and Grant (1997) found that the management discussion and analysis (MD&A) section in annual reports constituted the largest proportion of information cited by the analysts. They state (p. 17), ‘‘[I]n total, the narrative portions of the annual report provide almost twice as much information as the basic financial statements”. 1 There is also a research stream (Barron et al., 1999; Clarkson et al., 1999; Subramanian et al., 1993; Smith and Taffler, 2000) that primarily relies on hand-coded classification of a small sample of firms when investigating their research question. 2 Henry (2006b) considers a partitioning algorithm (CART) and shows that including data about key words and document style improves classification accuracy. She performs a 10-fold analysis on contemporaneous data. That is, the model is trained with 90% of the observations and tested on the remaining 10%. Thus, the model is not implementable because it uses data from the same period to predict returns. That is, she uses actual data from 1998 for 90% of firms to predict returns in 1998 for the remaining 10% of firms. In contrast, our approach and tests lead to implementable approach in that we use actual data from 1998 to predict returns for 1999. 3 The disadvantage is that the underlying model is not transparent because it might be non-linear. Although not our focus, with additional structure and analyses, it is possible to determine the relative ‘‘weights” of the attributes. 4 Botosan (1997) and Botosan and Plumlee (2000), who follow the convention of using the AIMR ranking of corporate disclosure as a measure of disclosure quality, are notable exceptions.

R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801

791

Similarly, a survey of financial analysts by the Association for Investment Management and Research (AIMR, 2000) found that ‘‘86% of surveyed financial analysts indicated that management’s discussion of corporate performance was an ‘extremely’ or ‘very’ important factor in assessing firm value” (AIMR, 2000). Research corroborates practitioners’ claims that the narrative in an annual report contains value relevant information. For instance, using quality scores provided by analysts, Clarkson et al. (1999) finds that the quality of forward-looking information in MD&A directly relates to the firm’s upcoming performance. Botosan (1997) studies the association between disclosure level and the cost of equity capital, and finds that voluntary disclosures substitute for analyst following in lowering cost of capital. Bryan (1997) shows that discussions of future operations and planned capital expenditures are associated with one-period-ahead changes in sales, earnings per share, and capital expenditures. Barron et al. (1999) finds that high MD&A quality (in terms of compliance with the disclosure requirements) reliably reduces errors in analysts’ earnings forecasts. We also interpret the SEC’s plain English disclosure rules as acknowledging the importance of narrative disclosures when evaluating earnings and cash flow (Firtel, 1999). The source of the information content in narrative disclosures is subtle and hard to measure. Subramanian et al. (1993) find that well performing firms used ‘‘strong” writing in their reports while poor performers’ reports contained significantly more jargon or modifiers and were hard to read. Smith and Taffler (2000) identify thematic keywords from Chairman’s statements and generate discriminant functions to predict company failure. Kohut and Segars (1992) study president’s letters in annual reports and suggest that, as a communications stratagem, poor performing firms tend to emphasize future opportunities over poor past financial performance. Lang and Lundholm (2000) find that ‘‘optimistic” pre-announcement disclosures of equity offerings lower the cost of equity capital. Because of the difficulty in data collection and measurement, early studies that examine the qualitative aspects of the disclosure usually employ hand-collected data and examine small samples. They also typically rely on experts to code the quality of disclosure (e.g., AIMR scores). Recognizing these limitations, Core (2001) suggests that computing the measure of disclosure quality could greatly benefit from the techniques of other research areas such as computer science, computational linguistics, and artificial intelligence. There also is interest in developing analyses that test the information content and the predictive ability of narrative disclosures in a large-sample study with automatic coding of data. Recent research (Li, 2006; Henry, 2006a,b; Davis et al., 2006) has responded to this call. Typical examples include Li (2006) who shows that changes in the readability of the MD&A section are predictive of future return and Davis et al. (2006) who show that tone (a count of pessimistic versus optimistic words) is associated with future ROA. Note that, like our study, Li (2006) assumes that market price does not instantaneously impound the information contained in narrative disclosures. We view these papers as positing a relation between some dimension(s) of textual data and future performance. Thus, these papers construct a measure (e.g., fog index, count of positive words) of the typically one dimension (readability, optimism) studied, and use traditional statistical methodology such as OLS regressions to test the association between the measure and performance. The values and relations among the parameter estimates form the basis for inferences about patterns in the data. Our innovation is the use of an algorithmic approach (see also Henry, 2006b) to develop a predictive model.5 Our approach, which draws from foundations in computer science, focuses on predictive accuracy and treats the data structure or pattern as an unknown. The goal is to let the algorithm ‘‘learn” the underlying model using the most relevant information from the entire set available. Thus, the focus is not on generating model parameters but on fitting the best possible model. Such an approach confers at least three advantages:  We can simultaneously consider many different aspects of the disclosure such as length, readability and word choice, thereby avoiding the need to specify ex ante the meta-features of interest such as optimism or readability. Such an unconstrained relation lets the predictive model discover and capture complex interactions among features. This attribute is particularly important for analysis of text because the relations can occur at the word, sentence and/or document level. Indeed, we can (and do so in our extensions) include document-level meta-features such as the fog index, thereby helping us understand the information captured by metafeatures.  The approach can be easily extended to include other information sources such as economic reports. Including such data is particularly useful because market participants parse the annual report in the broader economic context and the other information available to them.6 Indeed, current development in computer science allows for models that differentially weight information sources in terms of their credibility, freshness, and so on.  We can use the model to identify sub-sets of the population that systematically differ in terms of the information content of their disclosure.7 Because of these advantages, the use of algorithmic text classification models is widespread in diverse areas such as marketing, biomedicine, music, law and web crawlers (Dave et al., 2003; Popa et al., 2007; Pérez-Sancho et al., 2005; Thompson, 2001; Pant and Srinivasan, 2005), although their use in finance and accounting is nascent. The primary disadvantage is that the method does not readily yield parameters that we could use to assess the statistical/economic significance of individual dimensions and/or sources. While possible, such analyses require the researcher to impose considerably more structure and are left open for future research.8

5 Our method differs from the CART method in Henry (2006b) in that we do not sequentially add measures of constructs to partition the data. Rather, the entire set of words is used to construct a model. 6 For instance, Asquith et al. (2006) examine the information content of qualitative analysis provided by equity analysts. 7 As an example, consider a model that tests the ability of film reviews to predict box office receipts. We can then identify the reviewers whose reviews consistently outperform reviews by other reviewers. Studying this sub-sample of reviews then can help us understand the features that make a review more predictive of box office success. Similarly, we can use this methodology to find sub-sets of firms whose narrative disclosures are more informative regarding market and/or accounting performance. We can then study these disclosures to glean the reason why. 8 The two approaches are complements. The algorithmic approach can potentially help identify the constructs and an outline of the model. We can then employ traditional statistical methodology to fit the model and identify parameters.

792

R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801

3. Methodology We focus on whether we could use narrative disclosures to construct measures that predict the firms’ performance. Constructing such an algorithm requires that we define (1) a method to quantitatively represent a document’s narrative disclosure, (2) measures of a firm’s performance, and (3) a model that will enable the use of the disclosure measure in step (1) to predict performance as defined in step (2). We address these issues next. (See Appendix A for a non-technical description of the text classification problem; Mahinovs and Tiwari (2007) provide an accessible review of the literature. See http://videolectures.net/mlas06_cohen_tc/ (accessed on 7/8/08) and Sebastiani (2002) for an in-depth review of the area.) Briefly, text classification is the task of automatically putting documents into predefined categories. (A ready example is assigning news articles by topic such as politics, sports or culture.) This classification task comprises several steps, the first of which is text representation. For this step, we employ standard text representation techniques used in computer science, with suitable modifications for financial reports. Consistent with the literature (Sebastiani, 2005), we first stem the words in a document to their morphological roots (e.g., running is stemmed to run) and eliminate common words such as a and the. We then represent the document as a vector of stems using the ‘‘bag of words” approach. The approach is so named because it uses all the terms (stems) in a document regardless of the order or position of the terms. Loosely, the set of ‘‘independent variables” in the model is the set of all stemmed words. We can then map a document in n-space, treating each term as a dimension and using a numerical weight for each stem. This weight is usually a function of frequency of the stem in the document and in the full collection of documents (Hand, 2001).9 Naturally, because the method treats each unique term as a separate dimension, this step leads to a large term space. Accordingly, the next step is to reduce the term space and generate a smaller vocabulary (loosely, identify the words that have the greatest ability to distinguish among documents).This step is particularly important in our study because the term space generated from 10K reports is of extremely high dimension. We employ the document frequency (DF) method to reduce the term space. This method ranks words by the number of documents that contain the word and uses a threshold level to reduce the number of words considered. Yang and Pedesen (1997) shows that DF method produces an overall efficiency gain by eliminating less informative terms and reducing the vocabulary size without sacrificing classification accuracy. Finally, we use the term frequency * inverse document frequency (TF*IDF) method (Singhal et al., 1996), the most commonly used weighting scheme, for estimating the term weights for individual terms identified by the DF method. Intuitively, this weighting scheme assumes that the best descriptive terms for a given document are those that occur very often in the given document (term frequency) but not much in other documents (inverse document frequency) (Salton and Buckley, 1988). Note that the document frequency is calculated in the context of our collection of 10K filing documents. Thus, these words will do well in separating the considered document from other documents. In this way, we represent each document as a point in the n-dimensional term space. Step 2 in our method is to identify the predictive attribute of interest. We focus our analysis on size-adjusted returns because the market return is the metric of most interest to shareholders, analysts and other users.10 This performance measure becomes the n þ 1 dimension associated with a document. In this context, note that predicting a specific value of a certain performance measure is a harder task than predicting a category of a performance measure, because a real value prediction is more granular than a category prediction. As an exploratory study, we start with a coarser approach and classify firms into three classes relative to their peers: under-performing, average, and out-performing. Each year, we rank the firms by their actual performance for the next year, and use the 25th and the 75th percentiles to define the cutoffs for the three classes. For step 3 in our method, ideally, we could develop a mapping between a firm’s disclosure vector (as developed in step 1) and the performance measure (in step 2). The classical statistical approach (which includes studies that examine one or more specified aspects of the text) then finds parameters that fit a specified model to the data. Our approach differs in that we do not adopt a model or specify the attributes of interest. Rather, akin to a neural net, we let the data-driven text-classification algorithm ‘‘learn” the potentially non-linear and multi-faceted relation between the text attributes and future returns. Essentially, the model seeks to construct an n-dimensional hyperplane that best separates the data points as per their categories.11 Once we ‘‘train” the model, we apply it to a hold-out sample (in our case to the annual reports for the next year). The output from this analysis is a prediction for each firm in the hold-out sample as to its category: out-perform, average or under-perform. We then construct equally-weighted portfolios based on these predictions. That is, we allocate the same $ amount to two sets of firms – we buy firms predicted to outperform and sell firms expected to under-perform. The size adjusted return earned by the portfolio is our measure of incremental value and the predictive ability of narrative disclosures. 3.1. Design For our design, a data point represents the results from a particular measure and year. We draw the training document set and the test document set from adjacent years. We use documents that report performance for the period t  1 to t (available at time t) to build the predictive model.12 We then apply the model to the documents reporting results for year t (available at time t þ 1) to predict the performance category for the period t þ 1 to t þ 2. (Notice that the standard 10-fold validation in text classification as in Witten and Frank (2000) is not

9 In accounting, Smith and Taffler (2000) and Hussainey et al. (2003) show that counts of keywords are related to bankruptcy and the association between current earnings and future stock returns. 10 Unlike two accounting metrics, the return metric impounds other information not reflected in the firm’s financial statements because market prices are based on forward looking information (Kothari, 2001). Thus, the market return is the hardest to predict. On the other side, firm’s management exercises greater control over accounting data. Even though there is ongoing debate on whether earnings management is generally opportunistic or strategic (e.g., Arya et al., 1998, 2003), there is broad consensus that firms employ discretionary accruals to manage reported income. Such practices add noise to the accounting measures we consider. 11 The method is ideally suited for binary classifications. Because we have three classes, we perform three two-way classifications and combine the predictions to generate an overall classification. See last paragraph of Appendix A for details. 12 The number of years to consider when building a predictive model is an interesting question. We could use all available data to construct the model, weighting recent years more. We use a conservative approach and only employ the most recently available information. In essence, our approach assumes that the patterns unearthed in last year’s annual report would hold for the current year’s annual report, and can help predict performance in the forthcoming year.

R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801

793

Overview of Design Year t-1 SAR Annual Report

A Doct-1

B

t

t+1

SARt

SARt+1

Doct

C

where: SARt Size adjusted return cumulated from April 1 of year t to March 31 of year t+1 Doct-1 Annual report for year t-1, usually available in March of year t. A. For firms in year t-1, build predictive model of year t using firms’ SAR in year t (i.e. size-adjusted return cumulated from April of year t to March of year t+1) and annual reports for year t-1 which are usually published in March of year t. B. For firms in year t, apply the predictive model built in step A) to the annual reports for year t which are published in March year t+1, and predict the class of SAR performance of these firms in year t+1, i.e. the 3-class of SAR (size-adjusted return cumulated from April of year t+1 to March of year t+2). C. On March 31st, year t+1, given a set of predicted out-performing firms and a set of predicted under-performing firms from step B), we sell the under-performing firms’ stocks at a total value of (for example) 10 million dollars and buy the out-performing firms’ stocks with a total value of 10 million dollars. In both the buying and selling transactions, we will allocate equal values of stocks among the firms. On March 31st, year t+2, we will sell the stocks of the out-performing firms and buy the stocks of the under-performing firms. If our prediction was correct, this transaction should generate non-negative profit. Fig. 1. Overview of design.

sensible in this context as we build a single model for each year in the sample.) Based on the classification, we examine if an implementable trading strategy based on predictions from our model earns a positive size adjusted return. Such a test is interesting because predictive accuracy is a relatively coarse performance metric. Further, portfolio returns have an endogenous cost of prediction successes and failures. Finally, a returns test is the appropriate measure to examine if there is incremental information content in the narrative disclosures relative to the information impounded into contemporaneous prices. We calculate a portfolio return as the average size adjusted return difference between the out-performing firms and the under-performing firms for each year. We report results for a 25–50–25% cut-off for defining the three classes of out-performing, average, and under-performing firms. (We verified robustness with a 10–80–10 cut-off.) We calculate a portfolio level return for a buy and hold strategy (see Fig. 1). Specifically, consider the model constructed using documents for the year ending 12/31/1999 (data available March 2000) and calendar year 2000 performance (data available in March 2001). We apply this model to documents available in March 2001, make predictions, and measure the cumulative size adjusted return for the portfolio from April 1, 2001 to April 1, 2002. We verified that such a strategy is implementable in that documents are available before 3/31. Further, because SAR for a random portfolio is zero by construction, this return is the incremental return relative to constructing a random portfolio. We perform two analyses to understand the source of any excess return. Our first approach checks whether our disclosure score is picking up known document-level features such as the fog index or risk-sentiment. For each document, we add these features to the term space and construct a new model, and use the predictions of the augmented model to construct portfolios. If these meta-features are incrementally useful, the prediction of the augmented model should exhibit greater returns relative to the base model. Our second approach employs cross-sectional regressions. We estimate:

SAR ¼ a þ b1 Dummy þ b2 Size þ b3 MTB þ b4 PM þ b5 Earning Surprise þ b6 Size  Dummy þ b7 MTB  Dummy þ b8 PM  Dummy þ b9 Earnings Surprise  Dummy þ error; where SAR = Size adjusted buy-and-hold return for the year; Dummy = 1 if the firm is classified as out-performing and 0 for predicted underperforming firms. Average firms are excluded from this analysis; Size = The size of the firm, measured as the natural logarithm of total assets; MTB = Market-to-book ratio (a valuation proxy), using the closing market price as of the start of the holding period; PM = Price momentum, measured as the SAR for the six months preceding the start of the holding period; Surprise = Actual EPS – Forecast EPS, where the forecast is the latest available consensus analyst forecast. Our choice of the regressors stems from studies (e.g., Jegadeesh et al., 2004) that examine the incremental information content of analyst forecast revisions after controlling for factors known to affect returns. In the above regression, a positive coefficient for b1 is consistent with the narrative disclosures providing incremental value-relevant information to market participants. A non-zero interaction term is consistent with the narrative disclosure altering the confidence market participants place in the numeric estimates.

4. Data and descriptive statistics The primary data for our experiments include the firms’ financial data, size-adjusted return, and the firms’ annual reports. To increase the homogeneity of firms in the sample, we restrict the sample to firms in the manufacturing industry (SIC codes 2000 to 3999), having December as fiscal year ending month. The sample period is from 1997 to 2002 (we include return data for 2003 as well). We ensure data integrity and accuracy by using the values for gvkey from the COMPUSTAT database, permno from the CRSP database, and OFTIC from I/B/E/S to identify a unique firm. We collect a total of 1236 unique firms’ financial data. Each annual report has an accession

794

R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801

Table 1 Descriptive statistics. Item

Number of firm-years

Panel A: sample selection Total number of documents (1997–2002) Do not have SAR data Potentially useful in SAR exercise Truncated extreme observations Documents used to develop in SAR prediction model Loss due to 1-year ahead prediction Documents with portfolio experiment results No data for sub-sample classification Available for sub-sample analysis Trimmed for extreme observations Net available for sub-sample analysis Used only extreme quintiles in regressions Lost due to missing data Used in regression

4755 (295) 4460 (180) 4280 (751) 3529 (459) 3070 (124) 2946 1473 406 1007

Notes

Trimmed top and bottom 1% of observations Statistics presented in Table 1 We do not have predictions for 1997, the first year with documents Results presented in Tables 2, 3, and Panel A of Table 4

Top and bottom 1% of observations removed for each variable Results presented in Panel A of Table 3

Results presented in Panel B of Table 4

Item

N (firm-years)

Mean

Median

25th percentile

75th percentile

Panel B: sample characteristics Sales (millions) Net assets (millions) ROE EPS Size adjusted return Market-to-book ratio

4099 4099 3073 4255 4280 4098

$2749.66 $3153.43 3.48% $0.248 2.23% 1.95

$336.44 $385.9 7.77% $0.63 12.91% 1.18

$65.93 $95.52 10.64% $0.18 41.11% 0.64

$1666.91 $1735.89 18.09% $1.39 18.48% 2.36

SIC codes

Number of firms in sample

Panel C: industry composition 20–25 26 27 28 29–32 33 34 35 36 37 38 39

99 38 32 276 59 43 38 171 185 55 199 23

Total

1236

code as its unique identifier. We manually download from Mergent Online the accession codes of the annual reports for each firm from 1997 to 2002. Then we automatically retrieve from the EdgarScan website the annual reports using the downloaded accession code. There are 10 different submission types for annual reports: 10K (10K filings), 10K405 (10K filings where regulation S-K Item 405 box on the cover page is checked), 10K405A (amendments to 10K405), 10KA (amendments to 10K filings), 10KSB (10K filings for small business), 10KSBA (amendments to 10KSB), 10KSB40 (optional form for small business where regulation S-B Item 405 box on the cover page is checked), 10KSB40A (amendment to 10KSB40), 10KT (10K transition report), 10KTA (amendment to 10K transition report). We focus on the major submission types of 10K and 10K405. Our final useable documents with matching financial performance measure values are 4280 annual reports from 1236 firms published in years 1997 to 2002. Using the CRSP database, we calculate size-adjusted cumulative return as the size-adjusted buy-and-hold return cumulated for 12 months from April 1 of the fiscal year to the next April. We verify that the relevant documents are available, and that the strategy is implementable. 4.1. Sample description Table 1, panel A provides the number of observations considered for each analysis. We begin with 4755 documents but make only 3529 predictions because of missing data and because of the lagging nature of the predictive model. We also trim the top and bottom 1% of observations on size-adjusted returns and other classification variables to reduce the influence of outlying observations.13 We use 3070 observations in sub-sample analysis because we could not collect the classification data required for 1997. Panel B provides descriptive data for our sample, with each observation representing a firm-year. Over all years, the average firm has mean sales of $2749 million and median sales of $336 million indicating the presence of several large firms in the sample. The average ROE is 3.48%, while the median is 7.77%. The mean and median values for the market-to-book ratio are 1.95 and 1.18, respectively. Panel C of Table 1 provides the industry breakdown for sample firms. We do not find any significant clustering of industries specific to our sample. Tests (not reported) do not reveal any systematic difference between the spread of firms in our sample and the distribution of 13 In the accounting and finance literatures, such trimming is standard when dealing with security returns. The average return in the bottom (top) 1% is close to (well over) 100% (+100%) which is not representative of average returns.

795

R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801

Table 2 Average return difference between predicted out-performing and under -performing firms model based on documents for year t and performance for year t + 1. Tested on documents for year t + 1 and performance for year t + 2. Year

With 25–50–25% performance class definition (%)

With 10–80–10 performance class definition (%)

Panel A: portfolio returns 1998 1999 2000 2001 2002

2.46 63.68 36.84 19.6 16.82

4.69 42.62 45.73 19.11 12.25

12.16

6.59

Average Year

With 25–50–25% performance class definition

With 10–80–10 performance class definition

Predicted out-performing

Predicted out-performing

Predicted under-performing

Panel B: number of out-performing and under-performing firms predicted as in panel A 1998 98 126 1999 110 83 2000 206 110 2001 97 170 2002 97 112 Total Year

608

601 Large size (%)

Predicted under-performing

45 61 78 55 56

54 41 46 54 37

295

232

Small size (%)

Value (%)

Glamour (%)

Panel C: portfolio returns (sub-sample analysis) 1999 18.08 2000 18.78 2001 16.61 2002 11.08

88.4 41.08 0 9.58

7.76 17.49 3.68 11.9

83.36 46.66 15.04 13.85

Average

14.23

1.46

16.39

6.75

Notes: 1. Cell entries represent the portfolio level buy-and-hold size-adjusted return for a year beginning April 1 and ending March 31. The portfolio is long on the predicted outperform firms and short on the predicted under-performers. 2. The performance class specification relates to the performance cutoffs used to define classes in the training sample.

all COMPUSTAT firms from the relevant SIC codes. We also do not find systematic differences in key firm characteristics. Thus, our results appear generalizable, albeit only to manufacturing firms. We use the size-adjusted cumulative return as the key metric of firms’ financial performance. As noted earlier, this metric contains the market response information that is generally not reflected in the financial statements. 5. Results The dependent variable in our analysis is a portfolio size-adjusted return, rebalanced each year. We construct an equally weighted buyand-hold portfolio that sells the under-performing firms and invests in the predicted out-performing firms. We ensure that we employ an implementable strategy by verifying that all of the documents were available before April. We calculate annual returns for the prediction period (April to April). For robustness, we replicate the analysis both for the 25–50–25 partition (reported), and for the 10–80–10 partitions of the sample for identifying out- and under-performing firms. Panel A of Table 2 presents results on the cumulative size-adjusted return by year. For both partitions, we find a significant return for every year except for 1998 and 2000 (when we find a significantly negative portfolio return). One reason for this anomaly might be the considerable turbulence experienced by financial markets during 2000 (see, for example, Barber et al., 2003). On average, we find an annual excess return of 12.16% using the 25–50–25 partition and 6.59% using the 10–80–10 partition for developing the model. These estimates are consistent with earlier research that hints at the considerable information content of narrative disclosures. These results also suggest that the market has difficulty in immediately parsing the information content of the disclosures, meaning that this information shows up in the return for the next year.14 In panel B, we report the number of firms classified as out- and under-performing, by year, for each of our partitions. These data show, while the predictive model was constructed using a 25–50–25 partition of actual performance, the actual number predicted to out or under-perform is not 25% of the hold-out sample of firms. For instance, only 608 firm-years predicted to out-perform when a naı¨ve expectation is for 882 = 3529 total observations classified * 0.25. (Using a proportions test, this difference is statistically significant.) Thus, as is intuitive, our predictive model is better able to pick up ‘‘extreme” differences from the average firm relative to smaller differences. Panel C of Table 2 presents results for sub-samples of firms. We investigate two partitions based on market-to-book and on firm size. For the first set of results, we determined the median market-to-book value for each year. We then partitioned sample firm-years into the value

14 Inspection suggests that general market volatility affects participant’s ability to gainfully use narrative disclosures to predict future performance. Systematic analysis of this inference is hampered because we only have 5 observations. Extending the analysis to more years (and/or using quarterly reports) is one way to obtain enough data to test this conjecture.

796

R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801

Table 3 Supplementary analysis. Firms partitioned by market-to-book, separate analysis for each group

Firms partitioned by size, separate analysis for each group

Glamour firms

Value firms

Large firms

Small firms

Panel A: semantics values by firm partitions Mean values of attribute All firms N (firm years) 3529 Fog index 18.41 Risk sentiment 28.81 Tone 0.401

1473 18.41 29.23 0.404

1472 17.97 24.37 0.392

1473 17.82 27.30 0.401

1472 18.57 26.20 0.391

Firms-predicted to out-perform N (firm years) Fog index Risk sentiment Tone

608 18.515 31.939 0.399

249 18.59 33.76 0.406

129 17.49 30.92 0.384

217 18.02 37.21 0.404

192 18.65 30.92 0.394

Firms predicted to under-perform N (firm years) Fog index Risk sentiment Tone

601 18.740 34.600 0.393

141 18.86 37.95 0.400

181 17.96 32.41 0.383

203 18.30 35.77 0.383

167 18.81 34.70 0.387

t-Tests of differences Firms predicted to out-perform versus all firms Fog index 2.34* Risk sentiment 4.65*** Tone 1.99

1.06 3.01*** 0.34

0.68 3.32*** 2.46

0.75 3.31*** 0.22

0.02 2.84** 1.42

Firms predicted to under-perform versus all firms Fog index 6.19*** Risk sentiment 7.40*** Tone 4.25***

3.44*** 3.69*** 1.19

1.02 4.25*** 1.02

2.75** 5.06*** 2.73**

1.83 4.38*** 2.65**

Predicted out performers versus predicted under-performers Fog index 2.46* Risk sentiment 1.57 Tone 2.78**

1.95 1.82 0.80

1.19 0.65 0.15

1.53 1.30 1.47

1.29 1.80 0.87

Panel B: average return difference between predicted out-performing and under-performing firms. Model (augmented with three document-level meta-features) based on documents for year t and performance for year t + 1. Tested on documents for year t + 1 and performance for year t + 2. Year

With 25–50–25% performance class definition (%)

1998 1999 2000 2001 2002

1.85 64.65 37.48 17.01 16.39

Average

11.74

Notes: 1. Variable definitions are as follows: Risk Sum of risk term frequency Tone (Optimism term frequency  pessimism term frequency)/(optimism term frequency + pessimism term frequency)) Fog index A measure of readability. Calculated as

 0:4 

   word wordswith  2syllables þ 100 sentences words

2. Entries in panel A are the raw values. We performed a log transformation when including the three textual feature definitions in the model. 3. Cell entries in Panel B represent the portfolio level buy-and-hold size-adjusted return for a year beginning April 1 and ending March 31. The portfolio is long on the predicted out-perform firms and short on the predicted under-performers. 4. The performance class specification relates to the performance cutoffs used to define classes in the training sample. * ** ***

p < 0.05. p < 0.01. p < 0.001.

or glamour categories based on their value relative to the median value for the relevant year. We then re-estimated the textual model for each of the sub-samples separately. We repeated the exercise for size, using total assets as the measure. Our textual model indicates greater value relevance in the disclosures made by glamour firms and by small firms. Portfolios based on the model predictions have a size-adjusted return of 14.23% on average for small firms but only 6.75% for large firms. Similarly, expectations about future growth drive the valuations of glamour firms more so than for value firms. Again, we find size adjusted returns of 16.39% (1.46%) when we form portfolios for glamour (value) firms. In other words, our findings show that firms grouped on readily observable metrics such as size and market-to-book ratio follow detectably different text disclosure strategies. (However, our analysis does not speak to the dimensions in which the disclosures differ, a matter for additional research in this area.)

797

R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801

Table 4 Sample characteristics and incremental returns model based on documents for year t and performance for year t + 1 Tested on documents for year t + 1 and performance for year t + 2. Sample firm/year from SAR implementable experiment

Panel A: sample characteristics Assets ($ million) Sales ($ million) EPS Market-to-book Leverage Earnings Surprise Price Momentum Size adjusted return (annual)

Predicted under-perform Mean (Median) N = 601

Predicted average-perform Mean (Median) N = 2320

Predicted outperform Mean (Median) N = 608

$1736.93 (170.88) $1389.62 (91.72) $0.305 (0.231) 1.636 (1.042) 0.441 (0.396) 0.153 (0.175) 0.332 (0.39) 0.057 (0.173)

$3625.59 (488.60) $3136.83 (473.51) $0.746 (0.74) 1.718 (1.012) 0.493 (0.505) 0.249 (0.1) 0.017 (0.067) 0.009 (0.097)

$3715.88 (482.53) $3125.95 (376.17) $0.699 (0.72) 3.124 (1.953) 0.447 (0.445) 0.099 (0) 0.696 (0.315) 0.019 (0.126)

SAR ¼ a þ b1 Dummy þ b2 Size þ b3 MTB þ b3 PM þ b4 Earning Surprise þ b5 Size  Dummy þ b6 MTB  Dummy þ b7 PM  Dummy þ b8 Earning Surprise  Dummy þ error Item

Panel B: incremental information content Intercept Dummy for model prediction Log (total assets) Log (market-to-book) Earnings surprise Price momentum Dummy  log (total assets) Dummy  log (market-to-book) Dummy  earning surprise Dummy  price momentum N Adjusted R-square F-value

Regression model 1

Regression model 2

Estimate

t-Value

Estimate

t-Value

0.068 0.054 0.014 0.003 0.001 0.083

0.87 1.08 1.17 0.12 0.36 3.11***

0.042 0.043 0.011 0.027 0.004 0.035 0.002 0.071 0.011 0.291

0.36 0.26 0.56 0.75 0.68 1.23 0.08 1.65 1.39 4.00***

1007 0.009 2.71**

1007 0.023 3.54***

Notes: 1. Variable definitions are as follows: SAR = Size adjusted buy-and-hold return for the year. Dummy = 1 if the firm is classified as out-performing and 0 for predicted under-performing firms. Average firms are excluded from this analysis. Size = The size of the firm, measured as the natural logarithm of total assets. MTB = Market-to-book ratio (a valuation proxy), using the closing market price as of the start of the holding period. PM = Price momentum, measured as the SAR for the six months preceding the start of the holding period. Surprise = Actual EPS  Forecast EPS, where the forecast is the latest available consensus analyst forecast. 2. Test statistics employ cluster-adjusted standard errors to control for multiple observations from the same firm. ** ***

p < 0.01. p < 0.001.

5.1. Link to meta-features It is possible that our text classification model is merely replicating the previously known association between document-level metafeatures (e.g., clarity, tone) and future performance. Panel A of Table 3 presents descriptive data for the three features we study, both for the full sample and for the sub-samples we consider. (In this table, each firm-year is a separate observation.) We report relevant t-statistics at the bottom of this panel. The first column of this panel reports data relating to the entire sample for which we obtain performance predictions. Relative to the average firm, we find that firms predicted to out-perform have a denser text (fog index of 18.515 versus 18.41, t = 2.34), have more words expressing risk (sentiment 31.93 versus 28.81, t = 4.65) but have a similar tone. We find the similar pattern for firms predicted to underperform, with even the tone turning more pessimistic. Thus, firms in the ‘‘tails” of the distribution of predicted performance differ relative to average firm. A first conclusion is that the attributes that our classification models pick up firms that differ systematically on the meta-features of tone, clarity and risk sentiment. However, we have weaker evidence for this conclusion when we compare the feature scores for

798

R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801

firms predicted to out-perform with those predicted to under-perform. We find that predicted under-performers have a marginally less readable document and have slightly greater pessimism. The two groups express similar risk profiles. These comparisons suggest that text disclosures do have information content relating to performance, and that meta-features can help us identify the extremes. The comparisons in the columns highlight that we cannot simply use the meta-features to replace the model. This inability arises because meta-features are of less use in distinguishing the direction of the performance differential, the key attribute of interest. Data in the next four columns (for the sub-samples of glamour, value, large and small firms) provide additional evidence that support our inference. For all four sub-samples, the predicted under- and out-performing firms have greater scores for risk sentiment relative to the average firm. However, we do not find differences between the sets of firms predicted to out- and under-perform, for every sub-sample and for every measure. We conclude that while meta-features are picking up differences in style and tone that are systematically related to performance differentials predicted by our model, they seem unable to distinguish the nature of the performance differential.15 For an additional test of whether our predictive model captures more than the meta-features, we refit the predictive model including the meta-features in the term space. As shown in Panel B of Table 3, we obtain a similar (11.74%) return from the augmented model. More importantly, we also fit a model using only the meta-features. Such a model should, in theory, produce the same predictive ability, if the meta-features contained all of the information in the documents. However, we find that such a parsimonious representation of a document (as 3 meta-features of clarity, tone and risk sentiment) has no explanatory power at all (results not tabled). Overall, we conclude that our text classification model is capturing features not picked up by the selected meta-features. 5.2. Incremental information content Panel A of Table 4 returns to the full sample analysis. This table provides descriptive data on the firms predicted in the three classes, for the 25–50–25 classification. Relative to the average firm in the out-perform sample, the firms in the under-perform sample are reliably smaller, have lower market-to-book ratios, and are less profitable (as measured by EPS). This distinction provides additional evidence about the information content of disclosures because the classification does not use any numeric item. The text in the annual reports is enough to identify a distinct sample of under- and out-performing firms. Panel B of Table 4 provides results that speak to the relation between the information in the narratives and the information in quantitative disclosures. In particular, it is of interest to know whether the information in the narrative disclosure is subsumed by or is incremental to the information in the quantitative disclosure. The first column reports results for a model that considers main effects only. We find that the coefficient for the model prediction (for ‘‘dummy”) is not reliably different than zero. Thus, the text disclosure does not appear to provide value-relevant information incremental to that provided by known factors but rather captures known features. We find that large firms earn smaller returns (Reinganum, 1981), and that a high market-to-book ratio presages lower returns as well (Fama and French, 1992). However, once we account for all other factors, our data do not show the expected relation between price momentum and excess returns (Jegadeesh and Titman, 1993). Interestingly, we note that the univariate comparison in Panel A is significant at the 5% level (the price momentum is 0.019 for predicted out-performers versus 0.332 for the predicted under-performing firms). The regression estimate, however, indicates that the incremental effect (after accounting for other factors such as market-to-book, size and earnings momentum) is negative. The second column in this panel reports results for a complete model that includes interaction terms for the model prediction (a binary variable) with established causes. We continue to find an insignificant main effect for our model’s prediction. However, as indicated by the significant interaction terms, the disclosure score is weakly informative as to whether the glamour/value partition will continue to yield excess returns for the next period. Moreover, the interaction with price momentum is significant. That is, the disclosure score indicates that the effect of price momentum for firms predicted to out-perform is reliably smaller than for the average firm. Thus, while Jegadeesh and Titman (1993) shows abnormal returns to buying winners and selling losers, our results suggest the possibilities of finer partitions.16 Thus, one interpretation of our results is that narrative disclosures can help identify firms with negative price momentum that reverses over the next year. Stated differently, narrative disclosures could help identify if the price momentum will sustain into the next period or will reverse. 6. Conclusions This paper is part of a nascent literature that explores the narrative disclosures made by firms and complements established literature that considers the ability of numeric data to predict market performance (e.g., the accrual or the post earnings announcement drift anomalies). Most of the prior studies that studied textual disclosures have largely relied on expert classification thereby limiting sample sizes and the kinds of questions that could be asked. This study demonstrates a methodology for large-scale text mining of the narrative disclosures in annual reports. Even a relatively simple model, when applied to the narrative data alone, successfully predicts future accounting and market performance. There are several limitations of our approach. Our methodology only allows for limited economic insight into what characteristics of the disclosure lead to certain predictions (see e.g., Li, 2006). We also employ a simple ‘bag of words’ approach, without paying attention to the context of the usage of specific words. Further, we limit ourselves to the disclosures in the annual report and thus restrict the information that market participants would employ. However, these limitations can be addressed using some of the emerging techniques in text mining (see, e.g., Pant and Srinivasan, 2005). We could expand this study along several dimensions. The first is the use of text mining models that consider attributes such as tone, phrasing and so on. The second avenue is to augment the disclosures with additional disclosures such as press releases. We also could overweight economic predictions such as the news releases from the Federal Reserve, and sector-specific forecasts by trade associations. A third 15 We note that, in a changes analysis, Li (2006) shows the predictive ability of the fog index (we replicate this finding as well). The other studies do not focus on market performance in an implementable way. We also note that, considering all firms, the group of glamour firms has greater risk sentiment and a more optimistic tone (p < 0:001 for both comparisons) relative to the scores recorded for value firms. We also find that larger firms tend to have greater readability but their count of risk-related words is also higher. 16 These results hold even if we discard the data for the year 2000 in the analysis. We also note that portfolios formed on price momentum generate abnormal negative returns after an initial holding period (Jegadeesh and Titman, 1993).

R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801

799

avenue is to identify the nature of the differences in disclosures by large versus small firms, and glamour versus value firms as our results demonstrate that these sub-sets follow differing strategies. We also could study extreme observations (e.g., high positive score but high negative return) to identify features that diminish the informativeness of text disclosures. Finally, it is of interest to examine the time it takes market participants to impound textual information. While we have focused on annual returns, we conjecture that studies that examined shorter-time frames might provide sharper differences while the additional economic noise would wipe out the effect if we considered longer time frames. However, the relation is not likely monotonic because quantitative data likely dominates the returns for very short (intra-day or a few days) return intervals. Acknowledgements We thank Mort Pincus, Cristi Gleason, Paul Hribar, the editor, two anonymous reviewers, and workshop participants at the University of Iowa and Christopher Newport University for helpful comments. Xin Ying Qiu also acknowledges contributions from members of her thesis committee. Appendix A. Building text classifiers for prediction Text classification is a core activity in information science. The goal is to assign each text to one (or more) of a given set of categories. As an example, we may be interested in classifying a news article using the categories of sports, health, famous persons, entertainment, gardening, real estate or finance. It could be that the article belongs to sports or it could be that the article belongs to both the famous persons and sports categories. Trained individuals may perform text classification manually. Alternately classification may be accomplished using computational tools called text classifiers. The design and evaluation of algorithms for automatic text classification has been the basis of a highly active field of research for several decades. The field is now mature to the point that we are seeing text classifiers used to decide not only conceptual categories (as in the above example) but also to capture more subtle human phenomenon such as sentiment; classifiers are being used to identify sentences that are speculative (versus presenting ideas with confidence), to identify sentence tone as positive, negative or neutral and so on. Developments in these more subtle realms in part motivate our current research on text classifiers to predict market performance. The automatic methods employed in text classification derive predominantly from research in machine learning, a subfield of artificial intelligence. Major examples of these derivations include text classification algorithms based on support vector machine (as in this paper), neural nets, decision trees and association rules. Of these the Support Vector Machine (SVM) based algorithms are at least amongst the most effective algorithmic methods Sebastianip. 49, 2002. A given classification problem generally (but not always) starts with some training data that has been classified by some reliable mechanism, such as an expert, into one of two classes. Alternatively, we can use a known outcome such as next period’s return to classify the text. An SVM represents each example in the training data as a vector in an n-dimensional space and proceeds to find an n  1 dimensional hyperplane that separates the two classes. This strategy produces a linear classifier. Here, the parameter ‘‘n” represents the number of features considered. Thus, in text classification problems, n can be fairly large consisting of every nontrivial word in the collection of texts being classified. Because many candidate hyperplanes are likely to exist, SVMs are additionally designed to achieve the best or maximum separation (also called margin) between the two classes of the training data. That is, the nearest distance between a point in one separated hyperplane and a point in the other separated hyperplane is maximized. The ‘‘trained” classifier may then be applied to new data, classifying each new text into one of the two classes. Several key extensions have been made to the basic linear SVM. For instance, when a clean separation between the two classes of points is not possible, soft margins allow for some amount of classification error through the use of slack variables. SVM then aims at maximizing margin while minimizing error. In addition, researchers often employ one of several functions to transform the initial n-dimensional space. The classifier then looks for a separating hyperplane in this transformed space, a hyperplane that may be non-linear in the original space. This strategy may be useful in cases where linear classifiers are not sufficiently effective. Several such ‘‘kernel functions” to transform the initial space are available in implementations of SVM tools, including polynomial and Sigmoid functions. In this paper we use the base linear SVM classifier.17 SVMs are designed mainly for solving binary or two-class classification problems. Since our research problem is to classify documents into three classes, we consider some options to extend SVMs to multi-class problems. We perform one-against-rest classification for each class, and combine the results to make a final decision. The computing time for this option is linear in n (the number of classes). That is we produce a total of three binary (one-against-rest) SVM models. We use the highest predictive scores generated by the three models to assign a class label to the document. A.1. Document representation In information retrieval and text-classification research, the most common approach to encode (or represent) a text document is to model a document as a vector of weighted of terms. There are generally three aspects to consider when constructing such a document model: (1) What are the terms in the vector? Are they all the words from the document set, or phrases, or some transformation of the words or phrases? (2) How many terms do we need to construct the document representation? Do we use all the defined terms, or a subset of the terms? And, if we want only a subset of the terms, how do we select this subset and why?

17 We build our classifiers using the SVM-Light implementation of Support Vector Machines with default parameter settings and linear kernel function. (See http:// svmlight.joachims.org/.)

800

R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801

(3) How do we construct a weighting scheme for the terms in the document vector, to best indicate the terms’ relative informativeness and importance with respect to representing the document? In addressing the first question of defining the terms to represent document, the most widely used ‘‘bag of words” approach starts with the complete vocabulary in the training corpus (the set of words used as ‘‘independent variables” in the model). Functional or connective words, such as ‘‘a, hence, and, the,” are considered as stop words and are generally removed since they are assumed to have no information content. Stemming (e.g., connecting or connected is the same as connect) is sometimes performed to remove the suffixes and to map words to their morphological roots. Researchers have explored other more complex textual representations (e.g. Peng and Schuurmans, 2003; Dumais et al., 1998; Apte et al., 1994). While each method has its strengths and weaknesses, more complex definitions have not been shown to be superior to the basic ‘‘bag of words” approach in solving classification problems. In this study, we use the stemmed words of the document corpus to construct the document vector representation. Since the term space generated from our 10K report collection is of extremely high dimension, we will need to reduce the term space and generate smaller vocabularies. The benefits of such a reduced term space include better generalization ability of the model, saving of computing time, and possibly better interpretation and understanding of the predictive features. Most term selection methods either compute statistical feature scores to select high-scoring terms or apply simpler feature selection algorithms from machine learning research (e.g. Yang and Liu, 1999; Larkey, 1998; Yang and Pedesen, 1997). We use the document frequency (DF) threshold method for reducing the term space. Relative to other methods, this method (which employs a count of the number of 10K filing documents in our collection that use a given word) shows an overall efficiency in eliminating less informative terms and reducing the vocabulary size without sacrificing classification accuracy. Researchers have used many ways to calculate term weights in document vectors. The term frequency * inverse document frequency or TF*IDF is the most commonly used weighting scheme for estimating the usefulness of a given term as a descriptor of a document. Its interpretation is that the best descriptive terms of a given document are those that occur very often in the given document (high term frequency or TF) but not much in the other documents (IDF). In our previous study, we explored several constructions of TF*IDF weights. The best performer is the atn weight formulated as:

 atn ¼ 0:5 þ 0:5 

tf max tf



 ln

  N ; n

where tf is raw term frequency; max tf is the maximum term frequency for term in the document collection; N is the total number of documents in the collection; n is the number of documents containing the given term i. Therefore, we report results only using atn as our weighting scheme for the terms in the document vector. References Apte, C., Damerau, F.J., Weiss, S.M., 1994. Automated learning of decision rules for text categorization. ACM Transaction on Information Systems 12 (3), 233–251. Arya, A., Glover, J., Sunder, S., 1998. Earnings management and the revelation principle. Review of Accounting Studies, 7–34. Arya, A., Glover, J., Sunder, S., 2003. Are unmanaged earnings always better for shareholders? Accounting Horizons 17, 111–116. Association for Investment Management and Research (AIMR), 2000. AIMR Corporate Disclosure Survey: A Report to AIMR. Fleishman-Hillard Research, St. Louis, MO. Asquith, P., Mikhail, M., Au, A., 2006. Information content of equity analyst’s reports. Journal of Financial Economics 75, 245–282. Ball, R., Brown, P., 1968. An empirical evaluation of accounting income numbers. Journal of Accounting Research 6 (2), 159–178. Barber, B., Lehavy, R., McNichols, M., Trueman, B., 2003. Reassessing the returns to analysts’ stock recommendations’. Financial Analysts Journal 59 (2), 16–18. Barron, O., Kile, C., O’Keefe, T., 1999. MD&A quality as measured by the SEC and analysts’ earnings forecasts. Contemporary Accounting Research 16 (Spring), 75–109. Botosan, C., 1997. Disclosure level and the cost of equity capital. The Accounting Review 72, 323–349. Botosan, C., Plumlee, M., 2000. Disclosure level and expected cost of equity capital: An examination of analysts’ rankings of corporate disclosures and alternative methods for estimating the cost of capital. Working paper, The University of Utah. Bryan, S.H., 1997. Incremental information content of required disclosures contained in management discussion and analysis. The Accounting Review 72 (2), 285–301. Clarkson, P., Kao, J., Richardson, G., 1999. Evidence that management discussion and analysis (MD&A) is a part of a firm’s overall disclosure package. Contemporary Accounting Research 61, 111–134. Core, J.E., 2001. Firm’s disclosure and their cost of capital: A discussion of a review of the empirical disclosure literature. Journal of Accounting and Economics 31, 441–456. Dave, D., Lawrence, S., Pennock, D.M., 2003. Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. In: Proceedings of the 12th International World Wide Web Conference, (WWW 2003), ACM, pp. 519–528. Davis, A., Piger, J., Sedor, L., 2006. Beyond the numbers: An analysis of optimistic and pessimistic language in earnings press releases. Working paper, Washington University in St. Louis. Dumais, S.T., Platt, J., Heckerman, D., Sahami, M., 1998. Inductive learning algorithms and representations for text categorization. In: Proceedings of CIKM-98, Seventh ACM International Conference on Information and Knowledge Management, pp. 148–155. Fama, E., French, K., 1992. The cross section of expected stock returns. Journal of Finance 47, 427–465. Fields, T., Lys, T., Vincent, L., 2001. Empirical research in accounting choice. Journal of Accounting and Economics 31 (1–3). Firtel, K., 1999. Plain English: A reappraisal of the intended audience of disclosure under the securities at of 1933. Southern California Law Review 72, 851–889. Hand, D., Mannila, H., Smyth, P., 2001. Principles of Data Mining. MIT Press, Cambridge, MA. Healy, P., Palepu, K.G., 2001. Information asymmetry, corporate disclosure, and the capital markets: A review of the empirical disclosure literature. Journal of Accounting and Economics 31 (1–3), 405–440. Henry, E., 2006a. Market reaction to verbal components of earnings press releases: Event study using a predictive algorithm. Journal of Emerging Technologies in Accounting 3 (1–19). Henry, E., 2006b. Are Investors influenced by how earnings releases are written?. Working paper, University of Miami. Hussainey, K., Schleicher, T., Walker, M., 2003. Undertaking large-scale disclosure studies when AIMR-FAF ratings are not available: The case for prices leading earnings. Accounting and Business Research 33 (4), 275–294. Jegadeesh, N., Kim, J., Krische, S.D., Lee, C.M.C., 2004. Analyzing the analysts: When do recommendations add value? The Journal of Finance LIX (3), 1083–1124. Jegadeesh, N., Titman, S., 1993. Returns to buying winners and selling losers: Implications for stock market efficiency. Journal of Finance 48, 65–91. Kothari, S.P., 2001. Capital markets research in accounting. Journal of Accounting and Economics 31 (1–3). Kohut, G., Segars, A., 1992. The president’s letter to stockholders: An examination of corporate communication strategy. Journal of Business Communication 29 (1), 7–21. Lang, M., Lundholm, R., 2000. Voluntary disclosure during equity offerings: Reducing information asymmetry or hyping the stock? Contemporary Accounting Research 17, 623–662. Larkey, L.S., 1998. Automatic essay grading using text categorization techniques. In: Proceedings of ICML-98, 12th International Conference on Machine Learning, pp. 90–95. Li, F., 2006. Annual Report readability, current earnings and earnings persistence. Working paper, University of Michigan, Ann Arbor. Mahinovs, A., Tiwari, A., 2007. Text Classification Method Review. In: Roy, R., Baxter, D. (Eds.), Decision Engineering Report Series. Miemo. Cranfield University, UK. Pant, G., Srinivasan, P., 2005. Learning to crawl: Comparing classifier schemes. ACM Transactions on Information Systems 23 (4), 430–462.

R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801

801

Peng, F., Schuurmans Dales, 2003. Combining naive Bayes and n-gram language models for text categorization. In: Proceedings of the 25th European Conference on Information Retrieval Research (ECIR03). Pérez-Sancho, C., Iñesta, J.M., Calera-Rubio, J., 2005. A text categorization approach for music style recognition: Pattern recognition and image analysis. Lecture Notes in Computer Science 3523, 649–657. Popa, S., Zeitouni, K., Gardarin, G., Nakache, D., Métai, E., 2007. Text categorization for multi-label documents and many categories. In: Proceedings of the 12th IEEE International Symposium on Computer-Based Medical Systems (CBMS’07), IEEE Computer Society, Washington, DC, pp. 421–426. Reinganum, M., 1981. Misspecification of the capital asset pricing: Empirical anomalies based on earnings’ yield and market values. Journal of Financial Economics 9, 19–46. Rogers, K., Grant, J., 1997. Content analysis of information cited in reports of sell-side financial analysts. Journal of Financial Statement Analysis 3, 17–30. Salton, G., Buckley, C., 1988. Term-weighting approaches in automatic text retrieval. Information Processing & Management 24 (5), 513–523. Sebastiani, F., 2002. Machine learning in automated text categorization. ACM Computing Surveys 34 (1), 1–47. Sebastiani, F., 2005. Text categorization. In: Zanasi, Alessandro (Ed.), Text Mining and its Applications to Intelligence. CRM and Knowledge Management, WIT Press, Southampton, UK, pp. 109–129. Singhal, A., Buckley, C., Mitra, M., 1996. Pivoted document length normalization. In: Proceedings of the 1996 ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 21–29. Smith, M., Taffler, R.J., 2000. The chairman’s statement: A content analysis of discretionary narrative disclosures. Accounting Auditing & Accountability Journal 13 (5), 624– 646. Subramanian, R., Insley, R.G., Blackwell, R.D., 1993. Performance and readability: A comparison of annual reports of profitable and unprofitable corporations. Journal of Business Communication 30, 50–61. Tetlock, P., Saar-Tsechansky, M., Macsakssy, S. 2006. More than words: Quantifying language to measure firm’s fundamentals. Working paper, University of Texas at Austin. Thompson, P., 2001. Automatic categorization of case law. In: Proceedings of the 8th international conference on Artificial intelligence and law, ACM, pp. 70–71. Witten, I., Frank, E., 2000. Data Mining. Morgan Kaufmann Publishers, San Francisco. Yang, Y., Liu, X., 1999. A re-examination of text categorization methods. In: Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval, pp. 42–49. Yang, Y., Pedesen, J.O., 1997. A comparative study in feature selection in text categorization. In: Proceedings of the 14th International Conference on Machine Learning, pp. 412–420.