Advances in Accounting, incorporating Advances in International Accounting 29 (2013) 205–217
Contents lists available at ScienceDirect
Advances in Accounting, incorporating Advances in International Accounting journal homepage: www.elsevier.com/locate/adiac
The effects of task outcome feedback and broad domain evaluation experience on the use of unique scorecard measures☆ Kip R. Krumwiede a,⁎, Monte R. Swain b, Todd A. Thornock c, Dennis L. Eggett d a
Robins School of Business, University of Richmond, Richmond, VA 23173, United States School of Accountancy, Brigham Young University, United States College of Business, Iowa State University, United States d Department of Statistics, Brigham Young University, United States b c
a r t i c l e
i n f o
Keywords: Performance evaluation Outcome feedback Balanced scorecard Domain experience Multiperiod studies Experiment
a b s t r a c t Prior research has found that division evaluators using balanced scorecards in a performance evaluation process relied almost solely on common measures and virtually ignored unique measures. Other studies have found certain situations in which measures that are unique to a particular division are not completely ignored. However, no study has addressed whether outcome feedback over a period of time can motivate evaluators to rely more on unique measures that are predictive of future financial results. Our study involving executives with varying levels of prior evaluation experience examines two factors that may lead to increased use of unique measures: task outcome feedback and broad domain evaluation experience. Results provide evidence of increased reliance on unique measures after multiple periods as evaluators receive outcome feedback showing the predictive value of these unique measures. Further, results indicate that unique measures are used more over time when the prior evaluation experience of the participants is relatively high. © 2013 Elsevier Ltd. All rights reserved.
1. Introduction Lipe and Salterio (2000) test the effects of common and unique measures on evaluations of two divisions of a clothing firm. Although participants were not asked to compare or rank the divisions, the research finds that evaluators rated the divisions almost solely on the measures common across the divisions and state that “performance on unique measures has no effect on the evaluation judgments” (p. 284). These findings continue to be troubling for proponents of the balanced scorecard approach who contend that evaluations of performance should include unique measures derived from an organization's own vision and strategy (Kaplan & Norton, 1996). However, improved performance evaluation is only one goal of the balanced scorecard (BSC) framework. The real impact of the BSC approach is purported to be strategic alignment and focus within an organization (Kaplan & Norton, 2001, 7–17). This issue is critical to BSC success. If unique performance measures are developed that capture a division's strategic focus, but the organization's managers are evaluated solely on measures that are common to all divisions throughout the corporation, then the managers will focus their efforts on excelling in those common measures (Hopwood, 1972; Kaplan & Norton, 1992). Further,
☆ Data availability: Data and case materials are available and may be obtained by contacting the first author. ⁎ Corresponding author. Tel.: +1 804 287 1835. E-mail addresses:
[email protected] (K.R. Krumwiede),
[email protected] (M.R. Swain),
[email protected] (T.A. Thornock),
[email protected] (D.L. Eggett). 0882-6110/$ – see front matter © 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.adiac.2013.05.002
evaluators who ignore the unique measures, which are often leading measures of future performance (Lipe & Salterio, 2000), may not evaluate the division fairly or optimally. Various studies since the Lipe and Salterio (2000) study have found certain situations in which measures that are unique to a particular division are not completely ignored (Banker, Change, & Pizzini, 2004; Dilla & Steinbart, 2005; Humphreys & Trotman, 2011; Libby, Salterio, & Webb, 2004; Roberts, Albright, & Hibbets, 2004). However, no study has addressed whether outcome feedback over a period of time can motivate evaluators to increase their reliance on unique measures that are predictive of future financial results. Given that unique performance data are a crucial characteristic of the BSC model, and that evaluations in actual organizations are complex and iterative processes that require learning by all participants, outcomes showing that unique measures are predictive of future performance should increase the influence of unique measures. The first purpose of this study is to test whether multiperiod tasks with outcome feedback involving professionals who have prior evaluation experience will lead to increased use of unique scorecard measures. Assuming that BSC pundits are correct in asserting the value of unique performance measures that are strategically aligned (Humphreys & Trotman, 2011), prior experience and improved knowledge structures should then support successful integration of unique measures in the process of evaluating managers. Hence, decision makers who already possess general domain knowledge and specific knowledge structures for performance evaluation tasks may have increased ability to recognize and use relevant
206
K.R. Krumwiede et al. / Advances in Accounting, incorporating Advances in International Accounting 29 (2013) 205–217
information, such as the results of strategically linked unique performance measures provided in a BSC environment (Vera-Munoz, Kinney, & Bonner, 2001). The second purpose of this study is to test whether evaluators with higher levels of prior evaluation experience will increase their use of unique measures more than those with less evaluation experience. Better understanding the effect of outcome feedback in multiperiod tasks and the effect of evaluation experience enlarges the perspective on the results reported by Lipe and Salterio. Using Lipe and Salterio's (2000) experimental task with additional evaluation periods showing the predictive value of unique performance measures, we ask professionals with varying levels of prior performance evaluation experience to evaluate two divisions over four periods. The results provide evidence of increased reliance on unique performance measures after several periods of outcome feedback showing that these measures are indicative of future performance. Further, the unique measures are weighted more strongly over time when the participant's prior experience in performance evaluation tasks is relatively high. The rest of this paper will proceed as follows. First, relevant behavioral and performance measurement literature supporting our hypotheses is examined. Next, the experimental research method is described and the results and analysis are presented. Finally, the study's contributions and limitations are discussed. 2. Motivation and hypotheses 2.1. Common vs. unique measures Perhaps due to early successes by companies like DuPont and General Motors, conglomerate corporations have traditionally managed their diverse divisions using common financial benchmarks such as return on investment.1 However, more modern organizations have experienced challenges with using common financial measures to evaluate multiple divisions. These problems include stifled risk taking, unfair comparisons of units with unequal potential, and shortsighted decision making (Johnson & Kaplan, 1991). Partly in response to difficulties with using common financial metrics, many organizations are trying to better “balance” their financial performance measures by including nonfinancial unique measures of performance in their management process. A survey of 382 companies in 44 countries found that over 50% of the respondents use a BSC approach to performance measure tracking (Lawson, Stratton, & Hatch, 2006). A BSC approach strives to overcome problems inherent in traditional performance measurement sets with a strict financial focus (e.g., ROI) by combining financial measures of past performance with nonfinancial measures that communicate current efforts to pursue unique organization strategies, which are then expected to lead to improved financial performance in the future (Kaplan & Norton, 1996, 8). Evaluation processes that continue to compare diverse divisions using only common measures (that typically focus on financial performance) are contrary to the BSC approach. When evaluators compare divisions with different products and strategies, performance measures that are unique to a division's individual products and strategies should be more informative to evaluators assessing that division's specific performance compared to measures that are common across all divisions. However, Lipe and Salterio (2000) find the opposite result. In their study, participants evaluated two divisions based on each division's scorecard. Although each division's scorecard had an equal mix of common and unique measures, and even though the evaluators were not asked to compare or rank the divisions, the evaluators relied almost solely on measures common to both divisions in making their decisions. These results 1
Further information on the history of the DuPont ROI formula is available in Davis (1950, 7); reprinted in Johnson and Kaplan (1991, 85).
were replicated by Banker et al. (2004).2 In addition, Ittner, Larcker, and Meyer (2003) use archival data from a financial services firm and find evidence that BSC evaluators may overemphasize common financial performance measures and ignore other measures that were predictive of future financial results. Lipe and Salterio (2000) motivate their findings largely on the work of Slovic and MacPhillamy (1974) in which participants predicted which of two high school students would subsequently have the higher college freshman GPA. Although the unique measures were not ignored, the common measures consistently had more influence on the evaluators over multiple experiments. Focusing on common dimensions of alternatives is one example of a “simplification strategy” or a “heuristic” (Payne, Bettman, & Johnson, 1993). Since the Lipe and Salterio (2000) study, several studies have found situations where unique measures are not ignored in BSC division evaluations. Libby et al. (2004) find that requiring managers to justify their evaluations to superiors and providing third-party assurance reports to improve perceived quality of the measures leads to increased use of the unique measures. Roberts et al. (2004) find that unique measures have greater impact on the manager's overall evaluation when the evaluation process is disaggregated by first asking participants to rate performance on each of the 16 BSC measures. Banker et al. (2004), Kaplan and Wisner (2009), and Humphreys and Trotman (2011) all support the idea that stronger communication and articulation of the business strategy and strategic importance of the unique measures can lead to greater impact on evaluations by unique measures. Related studies find evidence of factors affecting the reliance on financial versus nonfinancial measures in BSC evaluations. These factors include use of BSC categories versus unformatted scorecards (Cardinaels & van Veen-Dirks, 2010; Lipe & Salterio, 2002), use of performance markers (i.e., +, −, =) (Cardinaels & van Veen-Dirks, 2010), the ambiguity tolerance of individual evaluators (Liedtka, Church, & Ray, 2008), and the focus of the evaluation (i.e., individual versus division performance) (Krumwiede, Eaton, Swain, & Eggett, 2008). Lau (2011) finds that nonfinancial measures affect performance through role clarity more than financial measures. Lipe and Salterio (2000) suggest some limitations to their study, such as participants' lack of involvement in choosing the measures and relatively low prior business experience. Dilla and Steinbart (2005) conducted essentially the same experiment as Lipe and Salterio but first covered the BSC topic in an undergraduate class, asked students to build a BSC for at least two organizations in class exercises, and tested students on the topic. After this task experience, they found that the students used both the common and unique measures in their BSC evaluations, although they still weighted the common measures more heavily. Commenting on Lipe and Salterio's results, Dilla and Steinbart (2005, 45) state, “decision makers initially resort to simplifying strategies when using the BSC. Decision makers who are more familiar with the BSC are expected to behave differently.” Cardinaels and van Veen-Dirks (2010) call for comparing and contrasting the BSC evaluations of more experienced managers, who have more developed knowledge of measurement properties and causal relationships, with the BSC evaluations of students. Another limitation of the Lipe and Salterio (2000) study is the single evaluative period. Companies establish balanced scorecards with the intent to use them over time in the organization (Kaplan & Norton, 1992, 1996). As such, expanding the setting of BSC evaluations to include multiple periods is critical to gain a more complete understanding of how decision makers actually use this evaluation tool. 2 Banker et al. (2004) were then able to attenuate the emphasis on common measures by educating their study participants on the linkage between the organization strategy and the unique measures.
K.R. Krumwiede et al. / Advances in Accounting, incorporating Advances in International Accounting 29 (2013) 205–217
2.2. Outcome feedback Evaluators using balanced scorecards over multiple periods inevitably receive information about past performance outcomes that may affect future evaluation decisions. We refer to this information that becomes available over time as outcome feedback on the performance of the organization. Outcome feedback is pertinent to managerial accounting settings as it is often more visible, more unambiguous, and more readily available to managers and management accountants than is process feedback (Einhorn & Hogarth, 1981).3 Also, outcome feedback is often the only type of feedback available to individuals due to the nature of the task (e.g., such as in BSC evaluation settings) or the time constraints inherent within the task (Bonner & Walker, 1994). Payne et al. (1993, 172) suggest that Slovic and MacPhillamy's (1974) finding that evaluators focus on information common among alternatives is an example of individuals using constructive processes (i.e., a heuristic using rules or beliefs stored in the decision maker's memory). Further, they suggest that as a decision maker's experience with a particular decision increases (e.g., a setting with multiple trials or periods that provides outcome feedback), the use of more exhaustive heuristics is more likely. These more exhaustive heuristics may involve the use of additional information available to the decision maker, such as unique information that is specific to alternatives. Prior research on the effect of outcome feedback over multiple periods strongly supports the notion that feedback positively affects evaluation performance over time. This effect is, in part, attributed to the informational qualities of the feedback that allows evaluators to compare current performance with some standard (Annett, 1969; Luckett & Eggleton, 1991; Schmidt, Young, Swinnen, & Shaprio, 1989). In more recent research, Sprinkle (2000) finds experimental evidence that participants in a multiperiod, profit-maximizing decision task spend more processing time (e.g., accessing more outcome feedback) and perform better, but only after several periods of the experiment. Sprinkle concludes that increased processing time and better evaluation performance are a function of the fact that outcome feedback useful for belief revision was provided between decisions. Also, Hodge, Hopkins, and Wood (2010) find that the provision of accuracy-related information (i.e., feedback) improved nonprofessional investors' forecast performance over multiple trials. These studies suggest possible implications for BSC decision making in evaluation contexts. Assuming that decision makers initially understand a theory or model relevant to the decision task at hand (Bonner & Walker, 1994), more effective use of division performance data may then occur after several evaluation periods with beliefrevising outcome feedback provided between decisions. This is consistent with research showing that more frequent outcome feedback strengthens the causal link between outcome data and performance (Frederickson, Peffer, & Pratt, 1999; Ilgen, Fisher, & Taylor, 1979). One might argue that if this causal link is known, communicating it to evaluators would eliminate the bias for common measures. Humphreys and Trotman (2011) show that when all measures are linked in a strategy map and strategic information is provided,
3 Bonner and Walker (1994) define two main types of feedback relevant to decision making. Process feedback, which explains why an outcome has occurred, is distinguished from outcome feedback, which solely provides information about the outcome. While process feedback is often a preferred system for development of procedural knowledge, we focus this study on outcome feedback. Outcome feedback is generally defined as feedback on the outcome of a judgment providing information on the ‘correctness’ of a decision or action (Kessler & Ashton, 1981; Leung & Trotman, 2005, 2008; Luckett & Eggleton, 1991). While feedback in our study does not speak directly to the “correctness” of the participants' judgments, it does provide a benchmark for participants to determine indirectly the “correctness” of their judgment given that actual performance information is made available after the evaluation.
207
common measures bias can be reduced or even eliminated. Their experiments also show that when not all measures are linked in a strategy map, or if strategic information is not provided, common measures bias does persist. In practice, BSC performance measures are not always strategically linked (e.g., Ittner, Larcker, & Randall, 2003). And knowing that the link between unique and common measures exists cannot fully eliminate the common measure bias since other characteristics of the relation between unique and common measures remain uncertain, e.g., timing (Farrell, Kadous, & Towry, 2008), level of linearity (Farrell, Luft, & Shields, 2007), and level of magnitude (Luft & Shields, 2001). In addition, the evaluator may not need to buy into the strategy or need to see evidence of its efficacy before trusting measures that are unique to the division. Hence, in this study we explore the ability of outcome feedback based on actual data to reduce managers' uncertainty with respect to these strategically linked measures. Our assertion is that individuals draw causal inferences from the information available, such as outcome feedback (or knowledge of results). Outcome feedback facilitates the evaluator making causal connections between performance measures and future performance (in order to more appropriately evaluate the manager). When unique measures are first used, there is little or no context for evaluating performance except using targets or goals. Hence, in the context of the current study, evaluators may initially evaluate performance based on benchmarking common measures against those of other divisions. However, it is expected that as evaluators become more familiar with the unique measures over time, they will infer that such measures provide useful information on future financial success. These evaluators may then expand their use of unique measures (assuming that the unique measures do in fact indicate future performance), which is essentially a more exhaustive heuristic. This is consistent with the notion that outcome feedback can serve a beliefrevising role in decision making (Dearman & Shields, 2001; Gupta & King, 1997). Therefore, when unique measures are predictive of future division performance, we expect that outcome feedback over multiple evaluation periods will aid evaluators' increased use of unique measures in their evaluations. This leads to the following hypothesis: H1. After multiple evaluation periods in which outcome feedback showing unique measures being predictive of future performance is provided for each period, evaluators will increase their use of unique performance measures in their evaluations. The BSC model focuses on nonfinancial measures leading to improved financial performance in the future (Kaplan & Norton, 1996, 8). Like previous studies in this literature stream (e.g., Lipe & Salterio, 2000), the focus of this study is the reliance on unique measures (whether financial or nonfinancial) versus common measures (whether financial or nonfinancial) in evaluations of division presidents. The BSC focus on nonfinancial measures leading financial measures assumes a long-term relationship, and this relationship among leading and lagging measures involves all the nonfinancial and financial measures. Two important conditions must generally be present in order to observe an increased reliance on unique measures over time. First, the unique measures should actually be leading indicators of future outcome performance, as suggested in the literature (Kaplan & Norton, 1996, 2001; Lipe & Salterio, 2000). Second, there should be little or no change over time in the selection of unique BSC measures to be used in the performance evaluation process, although actual organizations may add or delete measures from their original scorecards (Malina & Selto, 2001, 76; Kaplan & Norton, 2001, 59). However, even with these conditions present, prior research suggests that the effect of outcome feedback may be moderated by the level of experience of the evaluator (Bonner & Walker, 1994). We explore this possibility next.
208
K.R. Krumwiede et al. / Advances in Accounting, incorporating Advances in International Accounting 29 (2013) 205–217
2.3. Evaluation experience Experienced individuals have an enhanced capacity to handle relevant information and to integrate acquired concepts and decision rules into their pre-existing knowledge structures, all of which results in improved procedural knowledge (Anderson, Kline Greeno, & Neves, 1981). However, results of prior psychological research strongly suggest that domain experience is an insufficient predictor of improved decision performance. Bonner and Lewis (1990) point out that “expert-like” decision performance is a function of experience, knowledge, and ability. Subsequent research suggests that the impact of experience on decision performance is not direct; rather, experience directly impacts knowledge, which then affects decision performance (Dearman & Shields, 2001; Libby, 1995; Vera-Munoz et al., 2001). Prior research has shown that broad domain knowledge, when coupled with understanding rules (that establish a causal theory within the specialized domain) and outcome feedback on decisions, leads to improvements in procedural knowledge (Bonner & Walker, 1994). This increased procedural knowledge allows decision makers with more broad domain experience in evaluating performance to employ more sophisticated knowledge structures in evaluating decision performance as compared to decision makers with less broad domain experience. Evaluators using these more sophisticated knowledge structures for evaluating performance are more likely to recognize the predictive value of unique measures when provided with outcome feedback highlighting the relation between the unique measures and future performance, and thus increase their use of unique measures in evaluating performance. They are also able to filter out irrelevant information more clearly, allowing for quicker learning from the outcome feedback (which indicates the causal links). In sum, we propose that broad domain experience across various performance evaluation contexts, when coupled with outcome feedback between task events, will lead to an even more complete use of unique performance data as compared to the provision of outcome feedback without evaluation experience. The following hypothesis is proposed: H2. After multiple evaluation periods in which outcome feedback showing unique measures being predictive of future performance is provided for each period, evaluators with higher levels of broadbased performance evaluation experience will increase their use of unique performance measures in their evaluations more than evaluators with lower levels of broad-based performance evaluation experience. 3. Method We used Lipe and Salterio's (2000) experimental task with some additions necessary to test our hypotheses and control for other variables. As in the Lipe and Salterio (2000) study, participants take the role of a senior executive of a firm specializing in women's apparel (called AXT Corp. in our study). They read essentially the same case information as in Lipe and Salterio (2000) describing the mission, organizational strategy, and balanced scorecard implementation of AXT Corp. Provided next are the specific strategies, goals, and scorecards for two of the firm's largest divisions. The two divisions are RadWear, a retailer specializing in clothing for the “urban teenager,” and WorkWear, which sells business uniforms through direct sales to business clients. The experiment used the same scorecard measures (including common and unique measures) and data for Year 1 as in the Lipe and Salterio (2000) study. Participants evaluated the performance of each division manager from 0 (Reassign) to 100 (Excellent). Although this study is similar to Lipe and Salterio's (2000) in many ways, there are four main differences. First, we used an Internet-
distributed experiment rather than a pencil-and-paper method. There are several potential advantages of this method, including access to a geographically dispersed participant pool and potentially higher construct, external, and convergent validity (Alexander, Blay, & Hurtt, 2006; Bryant, Hunton, & Stone, 2004). Of course, there is also a lack of control inherent in an Internet-based experiment. Nevertheless, despite the potential benefits, there have been few Internetbased experiments in behavioral accounting research (Alexander et al., 2006), so this study may be useful for other researchers considering similar experiment designs. Second, we added three more evaluation periods (four total). After participants made their evaluations for Year 1 (i.e., a replication of Lipe and Salterio), they received outcome feedback in the form of actual results for each of the scorecard measures for Year 2. After participants evaluated each division manager for Year 2, they next receive actual results for Year 3, and so on through Year 4. The actual results (outcome feedback) provide clear data patterns demonstrating that the unique measures are leading indicators of future performance on the common measures. As suggested by H1, we expect that after several evaluation periods, participants will learn to observe the strong relationship between unique measures' outcomes and the subsequent period's common measures' outcomes and begin to rely more on unique measures. Third, we created an additional division scenario to explore the possibility that division similarity might impact participants' use of unique performance measures. Prior experimental research on decision making suggests that as the similarity of alternatives increases, decision makers tend to increase the number of attributes they consider (Biggs, Bedard, Gaber, & Linsmeier, 1985; Bockenholt, Albert, Aschenbrenner, & Schmalhofer, 1991; Payne et al., 1993, 55). We explored this possibility by replacing WorkWear with a new division, RealWear, which is more similar strategically and operationally to RadWear than WorkWear.4 Subsequent analysis revealed no statistically significant difference (p > 0.10) in decision outcomes due to similarity of divisions.5 Therefore, this factor is collapsed in testing of the hypotheses. Table 1 provides the measures used for each division. Finally, to address the possibility that participants might have been unclear about how unique performance measures relate to division strategy, we randomly gave half the participants the Lipe and Salterio (2000) version of strategy and performance measures and the other half a more articulated version. The more articulated version adds a vision statement for each division and strategic objectives for each of the four scorecard perspectives.6 Balanced scorecard proponents argue that the strategic objectives should be well articulated 4 RealWear is a clothing division for women in their twenties and was designed to be more similar to the market and strategy of RadWear than that of WorkWear. RadWear has set an aggressive strategy of growth through opening new stores and increasing brands that cater to low-mobility teenage girls. WorkWear focuses on growth by adding a few basic uniforms for men and also offering a catalog to facilitate repeat orders. Both RadWear and WorkWear sell clothing, but their customer segments, types of clothing, and sales methods differ substantially. On the other hand, RealWear, like RadWear, has an aggressive strategy of opening new stores, offering fashionable brands, and being the “clothing store of choice” for women in its demographic target group. To allow consistent comparison between RealWear and RadWear, both divisions’ scorecards have the same measures in common with RadWear. In reality, as divisions become more similar, we might expect them to have more measures in common. 5 Using a mixed models analysis (see Results section and Footnote 14), when the variable DISSIM (1 = RadWear/RealWear; 2 = RadWear/WorkWear) was included in the model, we found no statistically significant difference (p > .40) in evaluations between the two combinations (both main and interaction effects with common and unique measures). 6 Lipe and Salterio (2000) provided a three-paragraph summary for each division's background and strategy. The strategic objectives were not particularly well articulated, presumably so as not to unduly influence the evaluator’s reliance on particular measures. Our more articulated version of the division strategy adds a vision statement and strategic objectives for each of the four scorecard perspectives that link to the measures for each division.
K.R. Krumwiede et al. / Advances in Accounting, incorporating Advances in International Accounting 29 (2013) 205–217
209
Table 1 Common and unique measures for each division. RadWear
RealWeara
WorkWeara
Common measures
Unique measures
Unique measures
Unique measures
Financial • Return on sales • Sales growth
• New store sales • Market share relative to retail space
• Revenue per employee • Receivables turnover
• Revenues per sales visit • Catalog profits
• Mystery shopper program rating • Returns by customers as a percent of sales
• Number of customer complaints • Average store quality rating
• Captured customers • Referrals
• Average percent of major brand names per store • Percent of sales from new items
• Percent of sales from in-house designs • Inventory turns
• Percent of orders filled within one week • Catalog orders with errors
• Average tenure of sales personnel
• Number of skill sets per employee
• Percent of stores computerizing key functions
• Employee turnover per year
• Percent of sales managers with MBA degrees • Percent of order clerks with data base mgt certification
Customer-related • Repeat sales • Customer satisfaction rating Internal processes • Percent of returns to suppliers • Average markdowns Learning and growth • Hours of employee training per employee • Number of employee suggestions per employee
a Evaluation decisions on RealWear or WorkWear are compared to evaluations of RadWear (RealWear was designed to be more strategically and operationally similar to RadWear as compared to WorkWear). After determining that comparative evaluations of RealWear to RadWear were not statistically distinguishable from comparative evaluations of WorkWear to RadWear, the “Division Similarity” factor was removed from hypothesis testing design in order to allow consistent comparison of results to Lipe and Salterio (2000).
and communicated for each of the four perspectives (Kaplan & Norton, 1996, 10) and this recommendation is supported by prior studies (Banker et al., 2004; Kaplan & Wisner, 2009). As with division similarity, we found no statistically significant difference (p > 0.10) in decision outcomes in evaluations between the less articulated and more articulated versions.7 Therefore, this factor is also collapsed in the analysis of the hypotheses.
performance pattern for each performance measure presented in a format similar to the outline below.
3.1. Experimental design and procedure
Participants were randomly assigned one of 16 (2 × 2 × 4) possible versions of the program: two division combinations (RadWear/ WorkWear or RadWear/RealWear), two levels of strategy articulation (articulated and non-articulated), and four potential performance sets (explained below). Participants evaluate the same two divisions for Years 1 through 4. For each of the four years, the data are manipulated similar to Lipe and Salterio (2000) in a 2 × 2 × 2 design: two divisions (RadWear/WorkWear or RadWear/RealWear), two measurement groups (common and unique), and two levels for each measurement group (“more favorable” and “less favorable”). Table 2, Panel A, illustrates how the data are manipulated across divisions for Year 1 and Year 2. As seen in Table 2, Panel B, there are four different performance sets. In each of the performance sets, the unique measures in each period are designed to “lead” common measure performance in the subsequent year. For example, in Performance Set 1, the sum of actual percent better than target for the eight unique measures for Division 1 in Year 1 is 85.0% (“more favorable”), which closely relates to the sum of actual percent better than target for the eight common measures in Year 2 (85.2%). As in Lipe and Salterio (2000), the actual results for Division 2 are in contrast to those of Division 1. For Division 2 in Performance Set 1, the sum of actual percent better than target for the unique measures in Year 1 is 52.0% (“less favorable”), which closely relates to the sum for the eight common measures in Year 2 (52.0%).8 Thus, due to the lead–lag relationships between unique and common measures, the results in Year 1 dictate the results for the remaining years and only four different performance sets are needed to test the hypotheses (2 divisions × 2 performance levels).
In this Internet-based experiment, a computer program written in Visual Basic 6 served as the experimental instrument and was set up as an electronic balanced scorecard and made available to participating managers and decision makers across a variety of organizations. This approach was designed to be representative of how actual scorecard evaluation systems are presented in practice (Silk, 1998). In the main menu of the computer program, participants were given access to background material providing some company history and explaining the balanced scorecard approach and reasons for using it. Next, each participant received information about each division's background, strategy, and scorecard measures. Then for each measure the participant was provided with the Year 1 target, the Year 1 actual performance result, and the Year 1 actual percent better than target. After completing the evaluation decision on each division manager for Year 1, the participant moved on to the evaluation decision for Year 2. In addition to the Year 2 target, the Year 2 actual performance result, and the Year 2 actual percent better than target for each measure, the participant was also given the Year 1 actual percent better than target as a comparison. Years 3 and 4 proceeded in a similar manner except that the results for Year 4 did not include the Year 1 actual percent better than target. Participants were not allowed to change their evaluations for a prior year, although they could access any of the other previous screens. By the time the participant reached the Year 4 evaluation decision, they had a clear view of the evolving
7 Using a mixed models analysis (see Results section and Footnote 14), the variable NONARTIC (0 = articulated strategy; 1 = less articulated strategy) had no statistically significant impact (p > .10) on evaluations between the two versions (both main and interaction effects with common and unique measures). Of course, with the rather small sample size, we cannot rule out low statistical power as an explanation.
Measure X
Year 2 actual % better
Year 3 actual % better
Year 4 target performance
Year 4 actual performance
Year 4 actual % better
X2%
X3%
X4t
X4a
X4%
8 The scales used in this study to establish the two levels of actual percent better than target are the same as those used in Lipe and Salterio (2000). While the types of unique measures varied by division, the sums of measurement performance (i.e., the actual percent better than target) were constrained to be the same for all three divisions.
210
K.R. Krumwiede et al. / Advances in Accounting, incorporating Advances in International Accounting 29 (2013) 205–217
Table 2 Manipulation of performance on common and unique measures.
Panel A: Example of performance set 1, years 1 and 2: RadWear Division. Year 1 Measures
Metric type
Goal
Financial: 1. Return on Sales 2. New store sales 3. Sales Growth 4. Market share relative to retail space
Common Unique Common Unique
Exceed Exceed Exceed Exceed
Customer-related: 1. Mystery shopper program rating 2. Repeat sales 3. Returns by customers as % of sales 4. Customer satisfaction rating
Unique Common Unique Common
Internal processes: 1. Returns to suppliers 2. Average major brand names/store 3. Average markdowns 4. Sales from new market leaders Learning and growth: 1. Average tenure of sales personnel 2. Hours of employee training/employee 3. Stores computerizing 4. Employee suggestions/employee
Year 2
Target
Actual
Percent better than target
Target
Actual
Percent better than target
24.0% 30% 35.0% $80
26.0% 32.5% 38.0% $86.85
8.33% 8.33% 8.57% 8.56%
24.0% 30% 35.0% $80
26.4% 31.3% 38.1% $84.60
10.20% 4.33% 8.86% 5.75%
Exceed Exceed Below Exceed
85 35.0% 12% 92.0%
96.0 34.0% 11.6% 95.0%
12.94% 13.33% 3.33% 3.26%
89.25 31.5% 12% 92.0%
96.5 34.8% 11.7% 95.2%
8.12% 10.32% 2.92% 3.46%
Common Unique Common Unique
Below Exceed Below Exceed
6.0% 32 16% 25.0%
5.0% 37.0 13.5% 29.0%
16.67% 15.63% 15.63% 16.00%
5.7% 32 16% 26.3%
4.9% 36.0 14.1% 28.4%
13.79% 12.50% 12.00% 8.00%
Unique Common Unique Common
Exceed Exceed Exceed Exceed
1.4 15 85.0% 3.3
1.6 17.0 90.0% 3.5
14.29% 13.33% 5.88% 6.06%
1.4 15 85.0% 3.3
1.5 17.2 87.7% 3.7
7.14% 14.67% 3.18% 11.87%
+ +
85.18% 84.96%
+ −
85.17% 51.94%
Sum total:
Common measure Unique measures
Panel B: Manipulations for each performance set. Set
1 2 3 4
Year 1
Year 2
Year 3
Year 4
C1
U1
C2
U2
C1
U1
C2
U2
C1
U1
C2
U2
C1
U1
C2
U2
85.2% 52.0% 85.2% 52.0%
85.08% 85.0% 51.6% 51.6%
52.0% 85.2% 52.0% 85.2%
52.0% 52.0% 85.1% 85.1%
85.2% 85.2% 52.0% 52.0%
51.9% 85.0% 85.0% 51.9%
52.0% 52.0% 85.2% 85.2%
85.0% 51.9% 51.9% 85.0%
52.0% 85.2% 85.2% 52.0%
51.6% 51.6% 85.0% 85.0%
85.2% 52.0% 52.0% 85.2%
85.0% 85.0% 51.9% 51.9%
52.0% 52.0% 85.2% 85.2%
85.0% 51.2% 51.2% 85.0%
85.2% 85.2% 52.0% 52.0%
51.9% 85.0% 85.0% 51.9%
Where C1 (C2) = sum of actual percent better than target for performance on common measures for Division 1 (Division 2), and U1 (U2) = sum of actual percent better than target for performance on unique measures for Division 1 (Division 2). In each performance set, the unique measures in each period are designed to “lead” common measure performance in the subsequent year. For example, in Performance Set 1, the sum of actual percent better than target for the eight unique measures for Division 1 in Year 1 is 85.0%, which closely relates to the sum of actual percent better than target for the eight common measures in Year 2 (85.2%).
The close lead–lag relationship between periods for the sums of actual percent better than target for unique and common measures serves as the outcome feedback manipulation that tests the hypotheses in this study. However, it should be noted that this close relationship may not be realistic given the lack of noise or delay in the lead– lag relation between unique and common measures typical of more realistic settings (Maiga & Jacobs, 2006; Said, HassabElnaby, & Wier, 2003). This is an example of the “simplifying assumptions” often required to test theories (Lipe & Salterio, 2000). Within each of the two divisions, the two measurement groups, common and unique, are each set at either “more favorable” or “less favorable” levels, and each participant evaluates two divisions. As in Lipe and Salterio (2000), actual outcomes are all better than target. The “actual percent better than target” is provided to make all measures comparable on the same scale (Slovic, Griffin, & Tversky, 1990). Both common and unique groups of performance measures have the same sum of actual percent better than target for the less favorable (52.0%) and more favorable (85.0%) manipulations.9 9 Although it is probably not realistic for all outcomes to be above target, it eliminates unexpected reactions as a result of not meeting targets (Kahneman & Tversky, 1979). The fact that both divisions exceed targets suggests that the economy is good, at least for their markets. Because it would also not be realistic to leave the targets unaltered during such a time period, we increase the targets for certain measures during the four time periods. The increases are uniform and balanced among common and unique measures, and the total percentage increase is the same for each division for each year and for each scenario.
Two pretests of the experiment were conducted in a graduate management accounting course and in an MBA course. These pretests led to improvements in the computer program and clarification of the directional goal for all performance measures.10 Next, a panel of six managers and 19 academics with experience and perspective on performance measurement and evaluation decisions each appraised one of the performance sets. The evaluations by these “panel of judges” were averaged for each set. Before their pre-experiment instructions, the participants were told they would receive instant feedback at the end of the program showing how closely their evaluations matched our panel of judges' scores, and they would have a chance to win Amazon.com gift certificates based on their average difference from the experts. At the end of the computer program, the average difference between the judges' and participant's scores for each evaluation were reported to the participant. This post-experiment feedback served as a way to motivate the performance and effort of the participants during the experiment. In addition, similar to Lipe and Salterio (2000), the panel of judges were also asked to rate each of the performance measures on its relevance for evaluating the performance of the division manager using a scale of 1 (low relevance) to 10 (high relevance). No statistically 10 For example, we added a clarification on the “Balanced Scorecard Measures and Targets” screen for each division to indicate whether the goal for each performance measure was to exceed the target or get below the target. Some participants in the pretest expressed confusion about the goal for some measures.
K.R. Krumwiede et al. / Advances in Accounting, incorporating Advances in International Accounting 29 (2013) 205–217
significant differences (p > 0.05) were found in mean relevance ratings between common and unique measures for either of the case scenarios.11 Thus, it is not likely that any difference in participants' use of common and unique measures is due to differences in the perceived relevance of these sets of measures. 3.2. Participants Executive managers from around the world were solicited for this study through a website hosted by BetterManagement.com, an organization devoted to providing information on various management innovations such as scorecarding.12 The editor of the website disseminated promotional announcements and the email address of one of the coauthors to its targeted membership in exchange for the coauthor team providing a web-based seminar reporting the results. Use of the BetterManagement.com website, in partnership with the associated management organization (also named BetterManagement.com), provided access to an audience that is generally experienced in performance measurement and evaluation and is knowledgeable about the BSC model. When a potential participant emailed the researcher to express a willingness to participate, the researcher replied back with a case ID code that identified which version of the program they would receive and a link to where they could access the program. Most of the participants accessed an online version of the program (n = 67). The rest (n = 19) downloaded the program to their own computer, completed it, and then emailed back an output (Excel) file. A total of 148 responses to the simulation were received over a one-month period. Forty-six responses were eliminated because of incomplete participation or multiple responses from the same individual (we used the first complete response). An additional 16 responses were eliminated for those spending less than 10 min on the simulation.13 Thus, 86 usable responses were received. These participants spent an average of 67.4 min completing the experimental program. Table 3 reports the cell sizes (number of participants) across the 16 program versions. Results from the post-experimental questionnaire provide a report of the demographic structure of our participant group (see Table 4). The average age for the participants was 40.4 years, and 24% are female. The participant group represents approximately 13 countries based on email addresses. Two-thirds of the participants are from the U.S. and 78% are from Western countries (we found no statistically significant differences in evaluation scores due to U.S. vs. non-U.S. nationality). They had an average of 16.2 years professional work experience, 8.5 years evaluating employees, 4.6 years evaluating organizations, and 1.3 years using the BSC approach. Job titles include Manager (31%), Consultant (27%), Director (13%), Planning/Strategy/Analyst (10%), President/CEO (8%), and Miscellaneous (11%). Due to the fact that the organization providing respondents for this study was heavily focused on discussion and dissemination of BSC theory and practice, these participants generally understood the BSC model. In addition, the experimental instrument included a brief review of the BSC model. Hence, we can assume that the 11 The judges' mean relevance ratings for the scenario comparing RadWear and RealWear were 6.98 (common) and 6.74 (unique) for RadWear and 7.15 (common) and 6.91 (unique) for RealWear. For the scenario comparing RadWear and WorkWear, the average ratings were 8.01 and 7.95 for RadWear and 7.83 and 8.08 for WorkWear. Because of the rather small sample size, we cannot rule out low statistical power as an explanation for the lack of difference in these ratings. 12 Bettermanagement.com is now owned and maintained by SAS (see http://www. sas.com/knowledge-exchange/). 13 An analysis of scores compared to the panel of judges shows a clear difference at around 10 min. For those participants spending 10 min or more in the experiment, the average difference from judges' evaluation scores is 11.14. The average difference increases to 16.86 for those spending less than 10 min in the experiment. We removed these 16 participants in order to reduce the potential impact of low effort on our results.
211
Table 3 Number of participants per cell. DISSIM
NONARTIC
Performance set 1
2
Panel A: all participants by performance set (n = 86) (1) RealWear Articulated 7 6 Nonarticulated 4 4 (2) WorkWear Articulated 8 4 Nonarticulated 8 5 Total 27 19
3
4
Total
4 7 8 3 22
5 4 7 2 18
22 19 27 18 86
Panel B: participants providing years of experience evaluating employees (n = 53) (1) RealWear Articulated 4 3 4 1 12 Nonarticulated 4 3 3 4 14 (2) WorkWear Articulated 5 1 5 2 13 Nonarticulated 7 3 3 1 14 Total 20 10 15 8 53 Table values represent number of participants for each version of the instrument. Participants were randomly assigned one of 16 (2 × 2 × 4) possible versions of the program. DISSIM indicates two division combinations (RadWear/WorkWear or RadWear/RealWear). NONARTIC indicates two levels of strategy articulation (articulated or nonarticulated). The four potential performance sets are described in Table 2.
participants had sufficient pre-experiment exposure to the BSC causal model, which then combined with outcome feedback during the experiment to create procedural knowledge necessary to improved decision analysis using BSC data. Although the participants had relatively little actual experience with using the BSC in performance evaluations (i.e., subspecialty procedural knowledge), the relatively high level of prior experience evaluating employees was deemed sufficient to measure broad-based performance evaluation experience to test H2. Also, we believe that those managers with more experience evaluating employees should have broad domain procedural knowledge with respect to performance evaluation processes. Four methods were used to motivate participation and effort in the experiment by these busy professionals. First, the BetterManagement.com website emphasized how their knowledge and skills would contribute to the BSC body of knowledge. Second, potential participants were told they would receive an early copy of the research report. Third, participants understood that they would receive instant feedback at the end of the program showing how closely their evaluations matched our panel of judges' scores. Fourth, participants had a chance to win Amazon.com gift certificates. The participant with the smallest average difference (2.96) compared to the panel of judges' scores received a $75 certificate. In addition, to further encourage participation, a randomly drawn participant received a $50 certificate. 3.3. Manipulation checks The post-experimental questionnaire asked participants to rate their agreement with the same seven statements used in Lipe and Salterio (2000) as manipulation checks. Participants responded to these statements on an 11-point scale where zero equaled “strongly disagree” and ten equaled “strongly agree.” As reported in Table 4, responses indicate that participants agreed the two divisions were targeting different markets, these divisions used some measures that were different from each other, and it was appropriate for these divisions to use some different measures (p b 0.001; probability that mean equals 5.0, one-sample t-test, 2-tailed). They also agreed that the case was easy to understand, not difficult to do, and very realistic (p b 0.01). The only question with an overall neutral response was “Financial performance measures were emphasized in this case.” This response is not surprising as many managers (and their companies) in actual practice typically expect to emphasize financial measures more than nonfinancial measures. There were only a couple differences across experimental treatments. The RadWear/WorkWear group felt more strongly that the two divisions were targeting different markets than the RadWear/RealWear
212
K.R. Krumwiede et al. / Advances in Accounting, incorporating Advances in International Accounting 29 (2013) 205–217
Table 4 Post-experimental questionnaire results (n = 86). General information
Mean (percent)
Education (highest degree received or in process) [1 = high school, 2 = AA, 3 = BS, 4 = MS] Gender [percent female] Age Full-time professional work experience
3.2 level (23.7%) 40.4 years 16.2 years
Work experience
Mean
Using balanced scorecard Evaluating employees Evaluating organizations Retailing business Banking industry Internet business
1.3 8.5 4.6 0.5 2.4 0.9
years years years years years years
Additional questions Please indicate the extent of your agreement with each of the following questions (slider scale range from 0 to10; 0 = strongly disagree, 5 = agree, 10 = strongly agree). Question
Mean
S.D.
p-Valuea
1. 2. 3. 4. 5. 6. 7.
Financial performance measures were emphasized in this case. The two divisions were targeting the same markets. The two divisions used some different performance measures. It was appropriate for the two divisions to employ some different performance measures. The case was easy to understand. The case was very difficult to do. The case was very realistic.
5.16 2.18 6.05 7.04 6.61 2.70 5.61
2.17 1.94 2.05 2.45 2.28 2.47 2.08
.528 .000 .000 .000 .000 .000 .013
a
Probability that mean equals 5.0 (2-tailed) based on one-sample t-test.
group. This result is as expected since we designed the RealWear Division to be more similar to RadWear than WorkWear. Second, the group of participants receiving the more articulated strategies felt the case was more realistic than the less articulated strategy. This result is interesting as it suggests that participants expected to see each division's vision and strategic objectives. Other than these two differences, there were no other statistical differences in responses across experimental treatment groups (all p-values > 0.10; independent-sample t-tests, 2-tailed). Of course, due to the rather small sample size, we cannot rule out low statistical power as an explanation for lack of difference in responses.
significant (p = .00). The computer program randomly assigned the presentation order of the divisions. But for some reason, evaluation scores were generally higher for RadWear Division than for either WorkWear or RealWear Division, although there was no discernible difference between WorkWear and RealWear's scores. This difference does not seem to be a concern because the two-way interactions of DIVISION × COMMON and DIVISION × UNIQUE are similar to Lipe and Salterio's results.
4. Results
We compare the results for Years 1 and 4 in this study to assess the culminating impact of feedback data on the relationship between common and unique performance measures. Comparing Year 1 to Year 4 is most appropriate for this study as it is not until Year 4 that the participants have seen all possible combinations of unique/common measurement performance for each division (see Table 2). Further, Sprinkle (2000) finds that participants in his study accessed more feedback and performed better only after several periods of the experiment. Table 6 (Panel A) reports the results of a statistical analysis testing the impact of common and unique measures on evaluation scores for Year 1 versus Year 4 using a mixed models approach (Littell, Milliken, Stroup, & Wolfinger, 1996). A repeated measures ANOVA analysis as used in prior studies (e.g., Dilla & Steinbart, 2005; Lipe & Salterio, 2000) does not work for analyses of data in studies such as this one using two layers of repeated measures (i.e., two divisions over multiple years), random effects (i.e., two divisions out of all possible divisions), and fixed effects (i.e., common or unique measures). In this case, a mixed models approach improves generalizability and is more appropriate than a repeated measures ANOVA approach (Littell et al., 1996, 5–9).14
Table 5 compares the results of repeated-measures ANOVA analysis of evaluation scores for Year 1 with the results of Lipe and Salterio (2000). The results of our study generally replicate the results of Lipe and Salterio, except that in this study, DIVISION is highly statistically Table 5 Least squares ANOVA of evaluation scores for year 1 only (n = 86). Variable
df
Between subjects COMMON UNIQUE COMMON × UNIQUE Error
1 69.01 69.01 1 142.35 142.35 1 10.54 10.54 82 16,310.95 198.91
Within subjects DIVISION 1 DIVISION × COMMON 1 DIVISION × UNIQUE 1 DIVISION × COMMON × UNIQUE 1 Error 82
SS
MS
F
Current L&Sa study (2000) p-value p-value 1.53 0.22 3.16 0.08 0.23 0.63
0.47 0.65 0.21
526.45 526.45 11.70 0.00 199.07 199.07 4.43 0.04 98.82 98.82 2.20 0.14 42.29 42.29 0.94 0.34 3688.37 44.98
0.64 0.00 0.32 0.33
Where dependent variable = evaluation score for each division; DIVISION = division evaluated [RadWear = 1; WorkWear or RealWear = 2]; COMMON = division favored by common measures [RadWear = 1; WorkWear or RealWear = 2]; UNIQUE = division favored by unique measures [RadWear = 1; WorkWear or RealWear = 2]. a Lipe and Salterio (2000).
4.1. Results for H1: outcome feedback
14 Using Proc Mixed in SAS, mixed models analysis is based on likelihood ratio test iterations. The Proc Mixed procedure does not provide sum of squares or mean of squares as in a conventional ANOVA. However, this test does provide an F-value and degrees of freedom (numerator and denominator) and a related probability value for each variable. Also, a model chi-square estimating the significance of the model is computed by comparing its −2 log likelihood with that of a simple means model.
K.R. Krumwiede et al. / Advances in Accounting, incorporating Advances in International Accounting 29 (2013) 205–217
213
Table 6 Impact of common and unique measures on evaluation scores. Panel A: mixed model analysis of evaluation scores for years 1 and 4 (n = 86). Variable
df Numerator
df Denominator
F
p-Value
YEAR DIVISION COMMON UNIQUE COMMON × YEAR UNIQUE × YEAR COMMON × UNIQUE DIVISION × UNIQUE DIVISION × COMMON
1 1 1 1 1 1 1 1 1
83 167 167 167 167 167 167 167 167
0.68 8.97 26.60 14.50 5.67 3.21 0.20 0.23 0.37
0.412 0.003 0.000 0.000 0.018 0.075 0.655 0.633 0.543
Where YEAR = Year of evaluation [1, 4]; dependent variable = evaluation score for each division; DIVISION = division evaluated [RadWear = 1; WorkWear or RealWear = 2]; COMMON = common measures manipulation [less favorable = 0; more favorable = 1]; UNIQUE = unique measures manipulation [less favorable = 0; more favorable = 1]. Panel B: Mean evaluation scores by division for years 1 and 4 (n = 86). Year 1
Year 4
Measures
Div. 1
Div. 2
Overall
Div. 1
Div. 2
Overall
Common Favorable Less fav. Difference
77.8 (10.9) 76.9 (9.6) 0.9
75.5 (11.0) 71.9 (12.1) 3.6
76.8 (10.9) 74.0 (11.3) 2.8*
80.1 (12.9) 74.4 (13.2) 5.7**
78.3 (12.3) 72.6 (14.1) 5.7**
79.2 (12.5) 73.6 (13.6) 5.6***
Unique Favorable Less fav. Difference
77.3 (11.0) 77.5 (9.5) (0.2)
75.4 (10.5) 71.7 (12.5) 3.7
76.4 (10.7) 74.4 (11.5) 2.0
80.1 (11.0) 73.6 (14.7) 6.5**
76.4 (13.9) 75.0 (13.0) 1.4
78.4 (12.6) 74.4 (13.8) 4.0**
Table values are means (standard deviations) for evaluation scores. Common measures appear on both divisions' balanced scorecards. Unique measures are specific to each division's balanced scorecard. Favorable (Less fav.) means are computed by taking the average scores for each division when the Common (Unique) measures are favorable (less favorable), as defined in Table 2. (* difference is significant at p b .10; ** significant at p b .05; *** significant at p b .01; 2-tailed t-tests). 1 Bold numbers are statistically significant at p b .10.
As shown in Panel A, the mixed models analysis comparing Year 1 and Year 4 provides some support for H1 as the UNIQUE × YEAR interaction is marginally statistically significant (p = 0.075). This marginal significance could be due in part to low statistical power. The COMMON × YEAR interaction is highly significant, suggesting that, overall, the reliance on common measures increases from Year 1 to Year 4. In Table 6 (Panel B), we get an idea of evaluators' relative use (i.e., impact) of common and unique measures by comparing mean evaluation scores when each division's common and unique measures were more favorable versus when they were less favorable. After four periods of outcome feedback, the apparent impact of common measure outcomes appears to increase for both divisions. The impact of unique measure outcomes is much stronger for Division 1 than for Division 2. This pattern provides mixed support for H1. Next, we consider whether the level of broad-based domain evaluation experience was a factor in how the participants relied upon measures. 4.2. Results for H2: prior evaluation experience with outcome feedback Participants' years of prior experience evaluating employees are used as a proxy for broad domain experience and knowledge structure (Vera-Munoz et al., 2001). Of the 86 participants, 53 provided their prior evaluation experience. Thus, n is equal to 53 for this analysis. EXPERIENCE is a continuous variable equal to years of prior work experience evaluating employees, ranging from zero to 32 with a mean of 8.5 years and median of 5 years. Table 7 reports the results of a mixed models analysis testing the impact of common and unique measures on evaluation scores for Year 1 versus Year 4 and includes EXPERIENCE. H2, which posits that higher prior evaluation experience will lead to greater use of unique measures after multiple periods of outcome feedback, is tested with the three-way interaction UNIQUE × YEAR × EXPERIENCE. As shown in Table 7, even with the relatively smaller n, the
three-way interaction is highly statistically significant (p = 0.005). The three-way interaction COMMON × YEAR × EXPERIENCE is not statistically significant. These findings provide solid support for the idea that even though evaluators gave similar weight to common measures after more feedback was received, the more experienced evaluators gave more weight to the more predictive unique measures over time with feedback. We might then infer that outcome measures are simple enough to learn to use without any substantial experience. Unique measures, on the other hand, may be more difficult to learn to use over time, likely due to the relatively complex leading Table 7 Mixed model analysis of evaluation scores for years 1 and 4 (n = 53). Variable
df Numerator
df Denominator
F
p-Value
YEAR DIVISION COMMON UNIQUE EXPERIENCE COMMON × YEAR UNIQUE × YEAR YEAR × EXPERIENCE COMMON × EXPERIENCE UNIQUE × EXPERIENCE COMMON × UNIQUE DIVISION × UNIQUE DIVISION × COMMON UNIQUE × YEAR × EXPERIENCE COMMON × YEAR × EXPERIENCE
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
49 97 97 97 97 97 97 97 97 97 97 97 97 97 97
0.30 8.78 5.06 2.48 0.20 2.34 0.01 0.01 0.21 0.96 0.18 0.79 0.42 8.16 0.01
0.584 0.004 0.027 0.117 0.659 0.130 0.907 0.932 0.645 0.329 0.677 0.378 0.519 0.005 0.938
YEAR = year of evaluation [1, 4]; dependent variable = evaluation score for each division; DIVISION = division evaluated [RadWear = 1; WorkWear or RealWear = 2]; COMMON = common measures manipulation [less favorable = 0; more favorable = 1]; UNIQUE = unique measures manipulation [less favorable = 0; more favorable = 1]; EXPERIENCE = prior years of work experience evaluating employees [mean = 8.5; median = 5]. 5 Bold numbers are statistically significant at p b .05.
214
K.R. Krumwiede et al. / Advances in Accounting, incorporating Advances in International Accounting 29 (2013) 205–217
Table 8 Effects of prior evaluation experience and unique measure outcome on mean evaluation scores for years 1 and 4. Measures
Year 1 Div. 1
Year 4 Div. 2
Panel A: low evaluation experience (b5 years; n = 24) Common Favorable 78.7 (12.3) 74.4 (13.4) Less Fav. 78.6 (9.9) 71.9 (14.4) Difference 0.1 2.5 Unique Favorable 76.4 (12.0) 79.9 (13.8) Less Fav. 83.4 (8.9) 69.1 (12.8) Difference (7.0) 10.8* Panel B: high evaluation experience (≥5 years; n = 29) Common Favorable 76.7 (9.8) 77.7 (11.5) Less Fav. 81.0 (6.6) 73.5 (9.1) Difference (4.3) 4.2 Unique Favorable 77.8 (8.7) 74.6 (9.7) Less Fav. 78.4 (9.5) 75.3 (10.7) Difference (0.6) (0.7)
Overall
Div. 1
Div. 2
Overall
77.3 (12.6) 74.1 (13.3) 3.2
82.0 (18.7) 75.5 (10.9) 6.5
78.3 (8.7) 72.3 (15.3) 6.0
79.5 (12.6) 74.4 (12.3) 5.1
77.5 (12.4) 73.9 (13.3) 3.6
78.4 (10.8) 76.7 (18.0) 1.7
78.7 (14.6) 74.6 (8.7) 4.1
78.5 (12.2) 75.5 (13.1) 3.0
77.0 (10.2) 76.0 (9.0) 1.0
79.0 (13.3) 78.6 (13.4) 0.4
77.1 (15.0) 76.1 (14.8) 1.0
78.1 (13.9) 77.3 (13.9) 0.8
76.2 (9.2) 76.9 (10.0) (0.7)
82.4 (10.4) 75.6 (14.8) 6.8
79.1 (11.5) 73.9 (17.4) 5.2
80.7 (10.9) 74.8 (15.8) 5.9
Table values are means (standard deviations) for evaluation scores. Common measures appear on both divisions' balanced scorecards. Unique measures are specific to each division's balanced scorecard. Favorable (Less fav.) means are computed by taking the average scores for each division when the Common (Unique) measures are favorable (less favorable), as defined in Table 2. (* difference is significant at p b .10; ** significant at p b .05; *** significant at p b .01; 2-tailed t-tests).
6.08 for the low experience group (p = 0.004, one-tailed). This result suggests that participants with higher broad domain evaluation experience felt more strongly that different (i.e., unique) measures were needed for the two divisions as compared to their less experienced counterparts.
Impact on score
A
Common measures
Unique measures
Common measures
Unique measures
B
Impact on score
relationship with future common performance outcomes. These findings are consistent with H2. To illustrate the three-way interactions, we distinguish relatively high personnel evaluation experience from low experience using the median value of five years as the cutoff. Of the 53 participants, 24 are classified in the “low experience” (less than 5 years; mean = 1.7 years) and 29 are classified in the “high experience” category (5 years or more; mean = 14.2 years). These two groups also differ significantly in experience evaluating organizations (2.4 years vs. 8.4 years, t = 3.14, p = 0.004, one-tailed). However, the groups do not differ in experience using a balanced scorecard (1.1 years vs. 1.5 years). Overall, the average difference between the judges' and participants' evaluation scores was 11.9 for the “low” experience group and 10.0 for the “high” experience group (one-tailed p = 0.10), suggesting that the latter group may have made somewhat more appropriate evaluations. Table 8 provides means data comparing the evaluation decisions of the low experience group with the high experience group. These results only approximate the three-way interactions tested in Table 7 in which EXPERIENCE was a continuous variable. As shown in Table 8 (Panel A), the low experience group appears to have increased their reliance on common measures from Year 1 to Year 4 for both divisions (5.1 − 3.2 = 1.9). On the other hand, the low experience group does not demonstrate any change in how unique performance measures impacted their evaluations of the two divisions (3.0 − 3.6 = − 0.6). In contrast, the high experience group clearly increased their reliance on unique measures for both divisions from Year 1 to Year 4 (5.9 − − 0.7 = 6.6). The impact of common measures for the high experience group does not appear to increase from Year 1 to Year 4 (0.8 − 1.0 = − 0.2). Fig. 1 illustrates these results graphically and provides further support for H2. 4.3. Additional analyses We further tested whether the high experience participants placed more importance on unique measures compared to low experience participants by comparing the responses of the two participant groups to post-experiment Statement 4, “It was appropriate for the two divisions to employ some different performance measures.” The mean response for the high experience participants was 7.93 versus
Fig. 1. Average impact on mean evaluation scores due to common and unique measures for years 1 and 4. Panel A: low experience participants (n = 24). Panel B: high experience participants (n = 29). Note: based on Table 8, “Impact on Scores” is computed by taking the average difference in mean scores for the two divisions when the common (unique) measures are favorable versus less favorable.
K.R. Krumwiede et al. / Advances in Accounting, incorporating Advances in International Accounting 29 (2013) 205–217
As an alternative proxy for broad-based evaluation experience, we tried replacing EXPERIENCE (both main and interaction effects) with full-time years of general work experience, experience using the balanced scorecard, experience evaluating organizations, and experience in the retail industry.15 The results of these alternative proxies were not statistically significant, suggesting that prior experience evaluating employees is the most relevant domain experience to the impact of unique measures in this study. 5. Conclusions This study adds to the growing body of research on the effectiveness and limitations of the balanced scorecard model. We find evidence that outcome feedback and broad domain evaluation experience may help offset the tendency of evaluators to use a simplifying strategy of relying on only common measures in the evaluation process. There are decision situations in which unique scorecard measures should not be ignored, and the conditions that encourage their use must be understood and addressed. Otherwise, evaluators and managers may focus upon less strategically linked common measures and the potential benefits of the balanced scorecard approach will be diminished. This study extends the literature stream started by Lipe and Salterio (2000) who used a single evaluation period to study the impact of common versus unique measures in a balanced scorecard (BSC) context. Lipe and Salterio determined that evaluators of division performance respond strongly to measures that are common across divisions while demonstrating relatively little response to measures that are unique to specific divisions. The current study suggests that evaluators are increasingly responsive to unique measures in their evaluations as outcome feedback demonstrating their relevance to evaluation decisions is provided. Further, this study shows that subjects with relatively higher broad domain evaluation experience rely more on unique measures in their evaluations after several periods than subjects with lower broad domain experience. These results support and extend the findings of Dilla and Steinbart (2005) who report that procedural knowledge developed through experience in a specialized domain (i.e., developing scorecards in a classroom setting) helped students to use more sophisticated decision processes (i.e., using both common and unique measures in the evaluations). Further research could examine whether evaluators with more specific BSC evaluation experience utilize the unique measures more than those with only broad domain evaluation experience over multiple evaluation periods. Taken together, the results of Lipe and Salterio (2000), Dilla and Steinbart (2005), and this study provide combined evidence that is consistent with the theoretical model of Bonner and Walker (1994). This model stresses that instruction and experience impact the acquisition of knowledge, particularly procedural knowledge, which is necessary to improved decision performance. These results are important to organizations implementing a BSC evaluation framework. Along with previous work (Banker et al., 2004; Dilla & Steinbart, 2005; Libby et al., 2004; Roberts et al., 2004), these findings provide support for the receptiveness of decision makers—especially those with prior evaluation experience—to the use of leading and unique performance measures tied closely to an organization's individual strategy. This receptiveness is consistent with the underlying lead–lag linkage principle advocated in the BSC model (Kaplan & Norton, 1996, 2001). There are a number of limitations of this study. Using an Internetdistributed experiment rather than a “pencil-and-paper” method has several potential advantages, including access to a geographically 15 For each alternative proxy for domain-specific knowledge, we used both the years and a binary (0,1) variable equal to one if the years of experience are greater than the median number (otherwise zero).
215
dispersed participant pool and potentially higher construct, external, and convergent validity. However, it also decreases control over the experiment. The Bettermanagement.com organization and website used to identify participants for this study caters to experienced managers and executives who are knowledgeable about performance measurement and evaluation practices. Using this more experienced group allowed us to test the effects of prior evaluation experience on the use of unique measures. But it is also possible that there is an unidentified self-selection bias inherent in the individuals who are targeted by this particular organization and then who chose to respond to our call for experiment participants. In addition, due to the rather small participant pool, especially for the prior evaluation experience analysis, we cannot rule out low statistical power as an explanation when no statistically significant effects were identified. Using self-reported levels of prior experience evaluating employees is not a perfect measure of broad domain procedural knowledge with respect to performance evaluation processes. In hindsight, it would have been helpful to validate this measure with knowledge tests at the end of the experiment such as those in Dearman and Shields (2001). However, the experiment was already fairly long for busy managers and we felt that those managers with more experience evaluating employees should have broad domain procedural knowledge with respect to performance evaluation processes. Of course, the various pressures and incentives that would be present for real-world decision makers are not present in our experimental design. As discussed in Lipe and Salterio (2000), experimental studies attempting to model real-world phenomenon require simplifying assumptions that impair realism. For example, the participants in this study received immediate outcome feedback. In actual practice, the effectiveness of feedback may be diminished with realistic time delays. The time lag effect of unique measures in one year on common measures the following year over a four-year period may be difficult and require even more periods of feedback in order to develop procedural knowledge. Also, the consistently strong lead–lag relationship of unique and common measures between periods may be unrealistic (Maiga & Jacobs, 2006; Said et al., 2003). Although not readily transparent to participants, this strong relationship provides unique measures the best chance to be noticed by evaluators, but may also limit the generalizability of the study. Nevertheless, the results of this study emphasize the benefits of validating and communicating the relationship between unique measures and future financial performance (Banker et al., 2004; Humphreys & Trotman, 2011; Ittner & Larcker, 2003). Two other simplifying assumptions were required for this study. First, the balanced scorecards used in our case materials have the same measures over the four-year period. In reality, companies may add or delete measures from their original scorecards. Second, performance outcomes are manipulated exogenously. In the real world, performance may be viewed as an endogenous factor because the weights placed on various measures by evaluators will often affect future performance. Finally, this study focuses on the perspective of the evaluator rather than on the performer. Clearly, the behavioral impact on the organization or people being evaluated is another significant focus of the balanced scorecard approach. Acknowledgments We wish to thank Bruce Behn, Tim Eaton, Linda Flaming, Rosemary Fullerton, Marlys Lipe, Taylor Randall, Steven Salterio, workshop participants at Brigham Young University and The University of Utah for helpful their comments on earlier drafts of this paper. We also thank Marlys Lipe and Steve Salterio for sharing their experimental materials with us. The research assistance of Amber Webb and financial support from the BYU Marriott School's Rollins Center for eBusiness is also appreciated.
216
K.R. Krumwiede et al. / Advances in Accounting, incorporating Advances in International Accounting 29 (2013) 205–217
References Alexander, R. M., Blay, A. D., & Hurtt, R. K. (2006). An examination of convergent validity between in-lab and out-of-lab Internet-based experimental accounting research. Behavioral Research in Accounting, 18, 207–217. Anderson, J. R., Kline Greeno, P. J., & Neves, D. M. (1981). Acquisition of problem-solving skill. In J. R. Anderson (Ed.), Cognitive skills and their acquisiiton (pp. 191–230). Hillsdale, NJ: Lawrence Erlbaum Associates. Annett, J. (1969). Feedback and human behaviour. Baltimore, MD: Penguin Books. Banker, R. D., Change, H., & Pizzini, M. J. (2004). The balanced scorecard: Judgmental effects of performance measures linked to strategy. The Accounting Review, 79(1), 1–23. Biggs, S. F., Bedard, J. C., Gaber, B. G., & Linsmeier, T. J. (1985). The effects of task size and similarity on the decision behavior of bank loan officers. Management Science, 31, 970–987. Bockenholt, U., Albert, D., Aschenbrenner, M., & Schmalhofer, F. (1991). The effects of attractiveness, dominance, and attribute differences on information acquisition in multiattribute binary choice. Organizational Behavior and Human Decision Processes, 49, 258–281. Bonner, S. E., & Lewis, B. L. (1990). Determinants of auditor expertise. Journal of Accounting Research, 28, 1–20 (Supplement). Bonner, S. E., & Walker, P. L. (1994). The effects of instruction and experience on the acquisition of accounting knowledge. The Accounting Review, 69(January), 157–178. Bryant, S. M., Hunton, J. E., & Stone, D. N. (2004). Internet-based experiments, prospects and possibilities for behavioral accounting research. Behavioral Research in Accounting, 16, 107–129. Cardinaels, E., & van Veen-Dirks, P. M. G. (2010). Financial versus non-financial information: The impact of information organization and presentation in a balanced scorecard. Accounting, Organizations and Society, 35, 565–578. Davis, T. C. (1950). How the Du Pont organization appraises its performance. AMA financial management series no 94. New York: American Management Association. Dearman, D. T., & Shields, M. D. (2001). Cost knowledge and cost-based judgment performance. Journal of Management Accounting Research, 13, 1–18. Dilla, W. N., & Steinbart, P. J. (2005). Relative weighting of common and unique balanced scorecard measures by knowledgeable decision makers. Behavioral Research in Accounting, 17, 43–53. Einhorn, H. J., & Hogarth, R. M. (1981). Behavioral decision theory: Processes of judgment and choice. Journal of Accounting Research, 19(1), 1–31. Farrell, A. M., Kadous, K., & Towry, K. L. (2008). Contracting on contemporaneous versus forward-looking measures: An experimental investigation. Contemporary Accounting Research, 25(3), 773–802. Farrell, A. M., Luft, J., & Shields, M. D. (2007). Accuracy in judging the nonlinear effects of cost and profit drivers. Contemporary Accounting Research, 24(4), 1139–1169. Frederickson, J. R., Peffer, S. A., & Pratt, J. (1999). Performance evaluation judgments: Effects of prior experience under different performance evaluation schemes and feedback frequencies. Journal of Accounting Research, 37(1), 151–165. Gupta, M., & King, R. R. (1997). An experimental investigation of the effect of cost information and feedback on product cost decisions. Contemporary Accounting Research, 14(1), 99–127. Hodge, F. D., Hopkins, P. E., & Wood, D. A. (2010). The effects of financial statement information proximity and feedback on cash flow forecasts. Contemporary Accounting Research, 27(1), 101–133. Hopwood, A. G. (1972). An empirical study of the role of accounting data in performance evaluation. Journal of Accounting Research (Supplement), 156–182. Humphreys, K. A., & Trotman, K. T. (2011). The balanced scorecard: The effect of strategy information on performance evaluation judgments. Journal of Management Accounting Research, 23, 81–98. Ilgen, D. R., Fisher, C. D., & Taylor, M. S. (1979). Consequences of individual feedback on behavior in organizations. Journal of Applied Psychology, 64(4), 349–371. Ittner, C., & Larcker, D. F. (2003). Coming up short on nonfinancial performance measurement. Harvard Business Review(November), 1–9. Ittner, C., Larcker, D. F., & Meyer, M. W. (2003). Subjectivity and the weighting of performance measures: Evidence from a balanced scorecard. The Accounting Review, 78(3), 725–758. Ittner, C. D., Larcker, D. F., & Randall, T. (2003). Performance implications of strategic performance measurement in financial services firms. Accounting, Organizations and Society, 28, 715–741. Johnson, H. T., & Kaplan, R. S. (1991). Relevance lost: The rise and fall of management accounting. Boston, MA: Harvard Business School Press. Kahneman, D., & Tversky, A. (1979). Prospect theory: An analysis of decision under risk. Econometrica, 263–291. Kaplan, R., & Norton, D. (1992). The balanced scorecard—Measures that drive performance. Harvard Business Review(January-February), 71–79. Kaplan, R., & Norton, D. (1996). The balanced scorecard. Boston, MA: Harvard Business School Press. Kaplan, R., & Norton, D. (2001). The strategy-focused organization. Boston, MA: Harvard Business School Press. Kaplan, S. E., & Wisner, P. S. (2009). The judgmental effects of management communications and a fifth balanced scorecard category on performance evaluation. Behavioral Research in Accounting, 21(2), 37–56. Kessler, L., & Ashton, R. H. (1981). Feedback and prediction achievement in financial analysis. Journal of Accounting Research, 19(1), 146–162. Krumwiede, K., Eaton, T. V., Swain, M. R., & Eggett, D. (2008). A research note on the effects of financial and nonfinancial measures in balanced scorecard evaluations. Advances in Accounting Behavioral Research, 11, 155–177. Lau, C. M. (2011). Nonfinancial and financial performance measures: How do they affect employee role clarity and performance? Advances in Accounting, 27, 286–293.
Lawson, R., Stratton, W., & Hatch, T. (2006). Scorecarding goes global. Strategic Finance, 87(9), 35–41. Leung, P. W., & Trotman, K. T. (2005). The effects of feedback type on auditor judgment performance for configural and non-configural tasks. Accounting, Organizations and Society, 30, 537–553. Leung, P. W., & Trotman, K. T. (2008). Effects of different types of feedback on the level of auditors' configural information processing. Accounting & Finance, 48, 301–318. Libby, R. (1995). The role of knowledge and memory in audit judgment. In R. Ashton, & A. Ashton (Eds.), Judgment and decision-making research in accounting and auditing. Cambridge, U.K.: Cambridge University Press. Libby, T., Salterio, S. E., & Webb, A. (2004). The balanced scorecard: The effects of assurance and process accountability on managerial judgment. The Accounting Review, 79(4), 1075–1094. Liedtka, S. L., Church, B. K., & Ray, M. R. (2008). Performance variability, ambiguity intolerance, and balanced scorecard-based performance assessments. Behavioral Research in Accounting, 20, 73–88. Lipe, M. G., & Salterio, S. E. (2000). The balanced scorecard: Judgmental effects of common and unique performance measures. The Accounting Review, 75(3), 283–298. Lipe, M. G., & Salterio, S. E. (2002). A note on the judgmental effects of the balanced scorecard's information organization. Accounting, Organizations and Society, 27, 531–540. Littell, R. C., Milliken, G. A., Stroup, W. W., & Wolfinger, R. D. (1996). SAS® system for mixed models. Cary, NC: SAS Institute Inc. Luckett, P. F., & Eggleton, I. R. C. (1991). Feedback and management accounting: A review of research into behavioural consequences. Accounting, Organizations and Society, 16(4), 371–394. Luft, J. L., & Shields, M. D. (2001). Why does fixation persist? Experimental evidence on the judgment performance effects of expensing intangibles. The Accounting Review, 76(4), 561–587. Maiga, A. S., & Jacobs, F. A. (2006). Assessing the impact of benchmarking antecedents on quality improvement and its financial consequences. Journal of Management Accounting Research, 18, 97–123. Malina, M. A., & Selto, F. H. (2001). Communicating and controlling strategy: An empirical study of the effectiveness of the balanced scorecard. Journal of Management Accounting Research, 13, 47–90. Payne, J. W., Bettman, J. R., & Johnson, E. J. (1993). The adaptive decision maker. New York: Cambridge University Press. Roberts, M. L., Albright, T. L., & Hibbets, A. R. (2004). Debiasing balanced scorecard evaluations. Behavioral Research in Accounting, 16, 75–88. Said, A. A., HassabElnaby, H. R., & Wier, B. (2003). An empirical investigation of the performance consequences of nonfinancial measures. Journal of Management Accounting Research, 15, 193–223. Schmidt, R. A., Young, D. E., Swinnen, S., & Shaprio, D. C. (1989). Summary knowledge of results for skill acquisition: Support for the guidance hypothesis. Journal of Experimental Psychology: Learning, Memory, and Cognition, 15(2), 352–359. Silk, S. (1998). Automating the balanced scorecard. Management Accounting(May), 38–44. Slovic, P., Griffin, D., & Tversky, A. (1990). Compatibility effects in judgment and choice. In R. M. Hogarth (Ed.), Insights in decision making: Theory and applications. Chicago. IL: University of Chicago. Slovic, P., & MacPhillamy, D. (1974). Dimensional commensurability and cue utilization in comparative judgment. Organizational Behavior and Human Performance, 11, 172–194. Sprinkle, G. B. (2000). The effect of incentive contracts on learning and performance. The Accounting Review, 75(3), 299–326. Vera-Munoz, S. C., Kinney, W. R., Jr., & Bonner, S. E. (2001). The effects of domain experience and task presentation formant on accountants' information relevance assurance. The Accounting Review, 76(3), 405–429.
Kip R. Krumwiede, CMA, CPA, has a PhD in Accounting from The University of Tennessee and a MAcc degree in from Brigham Young University. Before joining the accounting faculty of The University of Richmond in 2009, he served on the faculties of Washington State, BYU, and Boise State, and he has also worked in industry in a variety of accounting positions. His teaching and research emphasis is management accounting and he has published many articles in both practice and academic journals.
Professor Monte Swain is the Deloitte Professor in the School of Accountancy at Brigham Young University. Since completing his PhD from Michigan State University in 1992, Professor Swain has researched and taught management accounting at BYU. His empirical research focuses on the impact of information structure on cognitive decision processes. In addition, he writes and consults on the effective use of strategic measurement systems in organizations.
K.R. Krumwiede et al. / Advances in Accounting, incorporating Advances in International Accounting 29 (2013) 205–217 Todd Thornock is an Assistant Professor of Accounting at Iowa State University. After completing his bachelors and masters degrees in Accounting from Brigham Young University in 2002, he received his PhD from the University of Texas at Austin in 2011. Todd teaches managerial accounting and decision making to both undergraduate and graduate students. His research interests center on the effects of information (e.g., performance feedback, management reports) and controls on individual decision making.
217
Dennis L. Eggett is an Associate Research Professor in the Department of Statistics at Brigham Young University. Dennis received bachelors and masters degrees in Statistics from Brigham Young University and a PhD in Statistics from North Carolina State University. In addition, he serves as Director of the Center for Collaborative Research and Statistical Consulting.