Using plural modeling for predicting decisions made by adaptive adversaries

Using plural modeling for predicting decisions made by adaptive adversaries

Reliability Engineering and System Safety 108 (2012) 77–89 Contents lists available at SciVerse ScienceDirect Reliability Engineering and System Saf...

2MB Sizes 0 Downloads 26 Views

Reliability Engineering and System Safety 108 (2012) 77–89

Contents lists available at SciVerse ScienceDirect

Reliability Engineering and System Safety journal homepage: www.elsevier.com/locate/ress

Using plural modeling for predicting decisions made by adaptive adversaries Dennis M. Buede n, Suzanne Mahoney, Barry Ezell, John Lathrop Innovative Decisions, Inc., 1945 Old Gallows Rd., Suite 207, Vienna, VA 22182, USA

a r t i c l e i n f o

abstract

Article history: Received 11 June 2011 Received in revised form 28 May 2012 Accepted 1 June 2012 Available online 15 June 2012

Incorporating an appropriate representation of the likelihood of terrorist decision outcomes into risk assessments associated with weapons of mass destruction attacks has been a significant problem for countries around the world. Developing these likelihoods gets at the heart of the most difficult predictive problems: human decision making, adaptive adversaries, and adversaries about which very little is known. A plural modeling approach is proposed that incorporates estimates of all critical uncertainties: who is the adversary and what skills and resources are available to him, what information is known to the adversary and what perceptions of the important facts are held by this group or individual, what does the adversary know about the countermeasure actions taken by the government in question, what are the adversary’s objectives and the priorities of those objectives, what would trigger the adversary to start an attack and what kind of success does the adversary desire, how realistic is the adversary in estimating the success of an attack, how does the adversary make a decision and what type of model best predicts this decision-making process. A computational framework is defined to aggregate the predictions from a suite of models, based on this broad array of uncertainties. A validation approach is described that deals with a significant scarcity of data. & 2012 Elsevier Ltd. All rights reserved.

Keywords: Adaptive adversary Probabilistic risk analysis Plural analysis Descriptive modeling of decisions

1. Introduction Probabilistic risk analysis (PRA) is being used extensively to address not only the risks of engineered and natural structures, but also attacks by human adversaries [1–4]. There has been some criticism and extended discussions about the appropriateness of PRA for situations in which the threat is an adaptive, human adversary (hereafter called an adaptive adversary) [5–12]. The focus of this paper is not PRA and whether it is an appropriate approach for risk analysis involving adaptive adversaries. Rather, the focus of this approach develops a plural modeling framework that is similar to suggestions made by Guikema and Aven [13] and addresses modeling adaptive adversaries for risk analysis so that the results can be used by whatever higher-level risk analysis method seems appropriate. The conversation for this paper will be the threat risk assessments performed by the Department of Homeland Security for weapons of mass destruction (WMD). In particular, our illustrations will be for the bio-terrorism risk assessment, but the approach and comments apply for any risk analysis of this sort. The motivations for modeling terrorists as adaptive adversaries will be addressed extensively later in this paper. But to summarize for now, terrorists are not homogeneous but differ

n

Corresponding author. Tel.: þ1 703 861 3678; fax: þ1 703 860 8639. E-mail address: [email protected] (D.M. Buede).

0951-8320/$ - see front matter & 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.ress.2012.06.002

widely in terms of motivations; decision making information, skills, and processes; and organizational or personal psychology. In addition, there will likely be some interaction between what the terrorist (red) does and what the defending government (blue) does. For this paper we are focusing on strategic risk analyses of one to three years so move-countermove aspects of this interaction will be fuzzy at best. For more tactical or operational risk analysis involving interchanges in minutes through months, it is much more critical to model these interactions between red and blue, making a time dependent model of the adaptive adversary more critical. The primary assertion associated with this research is that adaptive adversaries, acting as terrorists, cannot be assumed to be rational decision makers. Even more emphatically, modelers cannot presume to know which modeling approach best characterizes the decision-making outcomes of these individuals and groups, all of whom are different, some dramatically so. Nonetheless the perspectives and motivations of these adaptive adversaries are critical to predicting their decision-making outcomes and need to be included in the modeling process. Our approach in this paper is to use multiple modeling methods, plural modeling. These modeling methods will consider motivations or objectives of the adaptive adversaries, will address multiple decision making styles, and will be conditioned on red’s perceptions of red’s capabilities as well as red’s perceptions of the defensive actions that blue has taken. This approach is founded on the principle that has been learned many times in the military/

78

D.M. Buede et al. / Reliability Engineering and System Safety 108 (2012) 77–89

intelligence communities: that blue should not assume that red will do what blue would do in a given situation, often called ‘‘mirroring.’’ In summary our approach will categorize adaptive adversaries in multiple groups based on similarities of motivations and resources, decision-making characteristics and psychology. The decision-making outcomes of each group will then be modeled by multiple simple, descriptive methods and aggregated into a probabilistic representation based upon the uncertainties of the situation, red group’s characteristics, and red perceptions of its capabilities and what defensive actions blue has taken. Our justification for this approach is that plural modeling (using relatively simple models) has proven (across many domains) to be more statistically accurate than a single model, no matter how complex that model is [14]. The data input requirements for these simple models will be similar and manageable for the task at hand. Finally the experts providing these model inputs should be more comfortable providing the information required by these models than providing the output of the adversary choice models. This paper identifies several critical issues that must be addressed by any solution to this problem. Next we justify and define a plural modeling approach for computing the probabilities of the adversary’s decision outcomes. Since this is a computationally intensive problem to address and the plural modeling solution may need to be inserted into any of a number of risk assessments, we define a computational framework for implementing the proposed plural modeling solution. Finally we address the important issue of validation.

2. Issues associated with modeling adaptive adversaries The largest issue in modeling adaptive adversaries for a risk analysis is coming to grips with how little is really known. Once this reality is accepted, the fact that a probabilistic model must be used is easy to accept. So the output of the adaptive adversary model is not going to be ‘‘X is going to happen.’’ But even more importantly just about everything about the adaptive adversary is uncertain: who is red and is this adversary a group or individual? What does red know about blue past actions and future intentions? What does red care about in mounting an attack (e.g., hurting blue, changing blue’s policies, building a bigger red organization)? What kind of success does red need to justify an attack? What resources (financial, technical, personnel, etc.) does red possess for the attack? How realistic is red in estimating the success of an attack red may mount? How does red make a decision about what attack to mount, and does red’s decision process unfold as intended or is it diverted to something new and unpredictable? What are the criteria that would cause red to activate the attack (is red waiting for the attack approach to be ready or waiting for a trigger based on blue actions or something else)? The bottom line here is that there are many uncertainties and they should all be modeled to capture blue’s uncertainty about red’s actions.

and collectively exhaustive alternatives, shown as lines of text within the box. Arrows entering a decision node indicate the information available at the time of the decision. Random variable nodes or ‘‘chance nodes,’’ represented by ovals, have multiple possible, mutually exclusive and collectively exhaustive states, shown as lines of text within the oval. The function associated with the random variable node represents the probabilistic dependence between the random variable and the nodes having arrows pointing to the random variable. A value node, shown as a hexagon, represents a measurable objective (which could be combination of objectives), and has an associated value function to calculate the measure. Arrows entering a value node indicate the variables and decisions that serve as parameters for the value function. Fig. 1 is a simplification of the red influence diagram for red decisions. The box in the upper left represents a number of decisions that a particular red group (or individual) will make regarding an attack on blue. Examples of these decisions include the target to be attacked, the weapon type used in the attack, and the delivery mechanism. The specific decisions that are made (as well as their order) may be influenced by which group is being analyzed. The specific decision alternatives chosen will be affected by the preferences of the group. Red is uncertain about which countermeasures blue has implemented or plans to implement as well as red’s chances of success should an attack be initiated. Finally red is uncertain about the consequences that may result from the attack if it is initiated. The bottom node in the figure involves a number of concepts. For each alternative red considers, it assesses the possible (uncertain) consequences in terms of the degree to which they further each of red’s objectives. Then red assesses the overall value of the alternative by in some way combining how all those objectives are furthered into a single impression of overall value, accounting for the relative importance red associates with each objective. That sentence is deliberately engineered to avoid making any assumptions as to how, and how systematically, red incorporates all those considerations in deciding among alternatives. Fig. 2 provides a more detailed representation of red’s decision problem, showing some of the additional nodes that would be needed for a realistic analysis. There are three decision nodes in the upper left, one for attack initiation and two for the agent and target aspects of an attack. With both agent and target aspects only a small subset of the full spectrum of decision alternatives are shown. In the upper right issues associated with red’s perceptions of blue’s countermeasures are shown. We will discuss the issue of perceptions in more detail later in the paper. Uncertainties about the consequences of a red attack are shown

2.1. A structured approach for discussing this modeling problem In this paper we will use influence diagrams to model decisions by red. There is long history of archival literature on influence diagrams. See [15,16] for two early papers on influence diagrams. An influence diagram is an acyclic directed graph with three types of nodes: decision nodes, random variable nodes and value nodes. Directed links pointing to a node indicate that the nodes from which the arrows emanate contain the parameters required to evaluate the destination node’s function. Decision nodes, represented by boxes, have multiple, mutually exclusive

Fig. 1. Simplification of the red decision problem.

D.M. Buede et al. / Reliability Engineering and System Safety 108 (2012) 77–89

79

Incorporate Red’s prior belief along with Red’s perception of Blue action into belief about location and effectiveness of Blue countermeasures.

Red perceptions of Blue countermeasures and probability of success. Red’s belief about probability of success depends upon Red decisions as well as belief about countermeasures.

Distributions over Red preferences. Total Cost to Red Change Blue Policy Recruit - Further cause Adhere to Religious Teachings

Distributions derived from Blue estimates of consequences given scenario. Estimates adjusted for each Red group type.

Fig. 2. More detailed representation of the red decision problem. (For interpretation of the references to color in this figure legend, the reader is reffered to the web version of this article.)

Fig. 3. Detailed simplification of realistic influence diagram for red’s decision problem. (For interpretation of the references to color in this figure legend, the reader is reffered to the web version of this article.)

in the bottom right; the conditioning of these uncertainties on red’s choices are not shown here to keep the diagram simple, but are shown in Fig. 3. Finally, the various consequences will impact the extent to which red’s objectives are achieved. Examples of red’s objectives are total cost (can only afford certain capabilities), change blue policy, furthering red’s cause by increased recruiting or standing, and adhering to religious teaching. Of course some of

these will not be relevant to some groups or individuals but critical to other groups or individuals. Fig. 3 shows red’s decision problem in more detail in terms of arrows between nodes and variables needed for the value model. But this figure is still a simplification of the final model, which could have, e.g., more decision nodes and many more choices associated with each decision node. Also there will be some

80

D.M. Buede et al. / Reliability Engineering and System Safety 108 (2012) 77–89

intermediate chance nodes between the decision nodes and the consequence nodes that would mitigate the effects associated with the decisions being made. Note that a node for which type of group is undertaking the attack is shown at the bottom of this figure. There would be four to eight groups types needed for a realistic analysis. 2.2. A discussion of the uncertainties associated with this modeling problem These figures presuppose specific knowledge that we believe is generally available concerning groups or individuals who are likely to use WMD against large countries like the United States, though a great deal of uncertainty remains. Example uncertainties concern the skills and resources the red group or individual may have, what their fundamental objectives are [17], what targets they would consider and other aspects of their attack, and what the consequences (e.g., deaths, financial damage, impact on the economy) of such an attack might be. There is sufficient information among researchers and intelligence analysts to elicit probability distributions over the variables discussed here for each of four to eight red groups. The analysis can then be conducted for each group, and then aggregated across groups considering probability distributions developed across the groups. A second category of knowledge presupposed by these figures concerns the critical but very difficult issue related to the perceptions of the attackers for what the consequences of an attack may be, as well as how the consequences will relate to achieving sufficient levels of satisfaction on their fundamental objectives. This issue of perceptions is much more difficult to address than red and blue. Horlick-Jones [18] discuss additional dimensions of perceptions that add to perception complexities including technical, engineering, societal, moral and political— each impacting red and blue’s perceptions. Renn [19] explains the complexity resides in the fact that perception is a mental representation. We will address how to gather information on perceptions in Section 2.3. Before leaving this topic we provide another influence diagram showing red’s perceptions of blue’s countermeasures decision in Fig. 4. At the bottom of this figure

we show a possible representation of red’s perceptions of blue’s objectives and associated measures for those objectives. What were red’s decision alternatives in Figs. 2 and 3 are now uncertainties for blue when making countermeasure decisions. Exactly how much discussion members of a red group might have about blue’s decision making are questions that should be posed to experts. One of our plural modeling methods deals explicitly with these issues. Another category of knowledge necessary to perform calculations on these diagrams is what types of models are reasonable representations of how each of the red groups or individuals will decide which alternative to choose. Multiple objective decision analysis (MODA) [20] is a normative decision model, though it is at times used as a descriptive model. Satisficing [21], lexicographic reasoning [22], and prospect theory [23] have been suggested as more realistic descriptive models. Various forms of game theory are at times used for this situation, though the descriptive power of these highly rational modeling approaches is often called into question. Level-k game theory [24] fits nicely within the plural modeling approach being taken here. Those decision models are explained in Section 3.2. There are deeper questions related to this issue of how to describe the decision process, such as ‘‘is the order in which decisions (e.g., threat, agent) are made important in predicting the final decision?’’ Finally, there is the uncertainty of how reliably red follows some descriptive process. In a modeling sense, how noisy is red’s behavior when compared to our best attempt at a descriptive model? Our approach to dealing with these issues is the plural modeling discussed in Section 1 and incorporated into the title of the paper. We will use as many of these modeling methods as seems appropriate and develop probability distributions across the actions of each red group based upon Monte Carlo simulations for the previous uncertainties for each of these methods. Finally, there is the issue of blue’s reaction to red’s attack. First, is this reaction even important to red? Second, is there any deterrence effect that will keep red from trying to be too successful? These issues need to be addressed via elicitation and modeling. Two approaches that can be used are game theory and incorporating these issues into the fundamental objectives of red.

Fig. 4. Red’s perceptions of blue’s decision on countermeasures. (For interpretation of the references to color in this figure legend, the reader is reffered to the web version of this article.)

D.M. Buede et al. / Reliability Engineering and System Safety 108 (2012) 77–89

2.3. A discussion of implementation issues for this modeling problem Some of the critical issues in modeling adaptive adversaries are (1) integrating this model into the risk assessment models that need probability distributions over adversary decision alternatives, (2) dealing with the computational size induced by the number of relevant WMD variables, (3) gathering information on red perceptions such that the results represent how red is thinking about its own decisions and not how blue would act if it were the terrorists, and finally (4) integrating the results of the many levels of uncertainty into one representation of the uncertainty faced by blue for a terrorist attack. The first issue deals with knowing what decision outcomes the broader risk assessment is addressing. The same WMD decision outcomes being addressed in the risk assessment must be part of the adaptive adversary model. But the context variables associated with these decision outcomes from the perspective of red groups must also be part of the adaptive adversary model. In addition, some way of addressing multiple categories of red groups or individuals must be included along with the resources and skills of each group. The influence diagrams we show in Figs. 3 and 4 have samples of these details. The computational size of some of these WMD risk assessments is large enough to create some unique problems. For example, the Bioterrorism Risk Assessments conducted by the U.S. Department of Homeland Security have about 50,000 variations of agents, target categories, acquisition/production options, and dissemination options. Naturally some combinations of these are not very consequential. The U.S. employs large, science-based simulations to estimate the consequences across all 50,000 combinations of options and more [8]. A terrorist organization may or may not have the computational resources or orientation to take a similar approach. But more fundamentally, the 50,000-variation nature of the problem indicates the degree of complexity that could be involved. A terrorist organization could bring any of a number of analytic or less analytic approaches to bear on that complex a problem, in ways that are hard to anticipate. This brings us back to the important topic of perceptions. There are many sources for estimating the perceptions of terrorist organizations: intelligence reports; reports of interviews with terrorism-related detainees; reports of experts on terrorism psychology and decision making; media reports about previous WMD attacks, both successful and unsuccessful; media reports on the statements of ‘‘experts’’ or government spokespeople about WMD attack variables; and reports by technology media and papers in journals related to these WMD variables. The final topic is combining the many forms of uncertainty into a single probability distribution across the red decision outcomes, i.e., across red’s alternatives at a red decision node. We have demonstrated that influence diagrams can be created to represent the decision space, associated uncertainties, and objectives for each red group. Similarly, we have just described information sources that could provide rough probabilistic representations of the red perceptions of the possible consequences for blue of red attack options. We have even illustrated a modeling approach for thinking about how red might view blue’s decisions for fielding WMD countermeasures. Finally, we have described how uncertain the red decision-making process is, leading to formulating multiple models of the red decision-making process. We will describe how these models of decision-making processes can be implemented in the next section. Now we must address how probabilities across decision outcomes can be computed for any model of decision making. The two common approaches are Monte Carlo simulation and employing one or a combination of probabilistic choice models. The Monte Carlo approach randomly samples from the probability distributions over the arguments for a given decision

81

model, then calculates the decision outcome that would be chosen by that decision model in that Monte Carlo case, repeating that process for some large number of cases. The number of times a decision outcome is chosen, across Monte Carlo cases, is normalized to become the probability that decision outcome will be chosen, given that one of the decision outcomes is chosen. This approach has the advantage that if a specific decision outcome is always inferior to some other decision outcome, its probability will be zero. The second approach is to compute an expected value to red for each of the decision outcomes across all of the uncertainties, then use one or a combination of two widely recognized models for estimating choice probabilities among decision outcomes based on the set of expected values for all decision outcomes: the Luce model [25] and the Random Utility Model [26]. We can compactly present the equation for choice probabilities for those two models in one equation: pðDi Þ ¼

eavðDi Þ þ blnðvðDi ÞÞ n P eavðDj Þ þ blnðvðDj ÞÞ j¼1

where Di represents the ith decision alternative available to red, p(Di) is the probability the ith decision alternative will be selected by red, given that red picks one of the alternatives from the set Dj, j¼1 through n, v(Di) represents red’s value for the ith decision alternative, a and b are constants that sum to 1. When a ¼0 and b ¼1, the equation is the Luce model. When a ¼1 and b ¼0, the equation is the random utility model. This combined-model equation will generate choice probabilities for methods that compute an expected value for each of the considered decision outcomes, such as MODA and Prospect Theory. (Note: Section 4 will address how the results of each of the several models can be aggregated into a final answer.)

3. A plural modeling approach for predicting decision outcomes Forecasters have shown repeatedly that aggregating across multiple models outperforms a good single model. Considerable literature has accumulated over the years regarding the combination of forecasts. The primary conclusion of this line of research is that forecast accuracy can be substantially improved through the combination of multiple individual forecasts. Furthermore, simple combination methods often work reasonably well relative to more complex combinations [14]. See also [27–30]. While most of these forecasters are working in simpler domains than predicting human decision making, these results should generalize to any forecasting domain. Guikema [31,32] shows that aggregate forecasts outperform individual forecasting models on forecasting the impacts of natural disasters on critical infrastructure in the context of homeland security risk analysis. In general, averaging across many models is likely to produce results that may prove close to a uniform probability distribution if the models are widely divergent. Answers that are close to a uniform distribution may not be viewed as helpful, but in many cases may be the correct answer. This is especially true if numerous models were used and their results varied widely. The general principle involved can be simply put: Any one model makes a set of assumptions that deviates from the actual world processes being predicted. If different models deviate from the actual world in different ways, then it should not be

82

D.M. Buede et al. / Reliability Engineering and System Safety 108 (2012) 77–89

surprising that aggregating those models results in better predictions than any one of those models operating alone. There is nothing strictly inevitable about that reasoning, but the empirical findings cited above show that reasoning to widely hold true. That said, two concerns about plural modeling call for comment: First, plural modeling muddles the relationships of which assumptions drive which results, which can be deduced by careful study of an individual model. Yet there is nothing in plural modeling that precludes the modeler from deducing those relationships in each individual model, then analyzing how those many relationships map into the aggregated behavior of the plural model. A second concern involves framing. Any model has a particular scope and set of assumptions. That scope and those assumptions have an effect on the descriptive performance of the model, which we will refer to here as a framing effect. In the case of the component models considered here, those assumptions include ones concerning the ranges of alternatives, and value attributes of those alternatives, considered, and the choice model employed. Plural modeling in general presents an opportunity to reduce framing effects by combining models with different framings. However, the plural modeling presented here has limitations on reducing those framing effects. That is all the component models aggregated here share the same range of alternatives and range of attributes. So, while the plural modeling presented here has the opportunity to reduce the framing effect of which particular choice model is used, by aggregating over several choice models, it does not reduce the framing effects of the ranges of alternatives and attributes considered. Certainly time has proven that predicting human decision making, especially those of humans who are adversaries (e.g., terrorists), is quite difficult and fraught with peril. The research literature on building descriptive models of human behavior demonstrates that no single approach does very well [33]. For DARPA’s on-going Integrated Crisis Early Warning System (ICEWS) program, Innovative Decisions, Incorporated, the consulting group with which the authors are affiliated, constructed a probabilistic aggregation model to combine the estimates made by four statistical and two agent based models. The combined estimates outperformed those of any one other model. To accomplish this IDI characterized the estimates of each model, effectively recalibrating them, and developed a Bayesian network, through a data mining process, to compute the combined estimates [34]. Section 3.1 describes ensemble approaches for aggregating the results of multiple models. The next section after that describes the descriptive models that we believe should be employed in any plural modeling approach to predicting the decision outcomes of red. Finally Section 3.3 describes an approach to aggregating across this collection of descriptive models given that no ground truth data is available.

3.1. Ensemble modeling approaches As previously discussed, plural modeling [35] recommends analyzing a modeling problem by applying several analyses in parallel and aggregating their results. The primary advantage of such an approach is that the results are more accurate than relying on a single complex model [36]. An illustrative analog is found in the machine learning literature; in particular ensemble learning aggregates results of multiple learners [37]. A learner is simply a model learned from data that makes a prediction such as the classification of an observed entity. Ensemble learning has two thrusts: (1) generating a set of base learners, each of which makes its own prediction, and (2) aggregating the output of the base learners, sometimes with model that is learned whose data includes the predictions made by the base learners. This approach works given the following assumptions: (a) that the base learners are independent and (b) each base learner is better than chance. The challenge in generating ensembles is to create a large set of base learners. To efficiently generate large sets of learners, researchers may manipulate: a) Classifier/model type: One approach is to use information about a model’s informational and computational requirements to select a subset of available models. b) Versions of a model type: To manipulate versions of a model type, one may modify a basic model in a systematic way. For example, one may randomly limit oneself to a subset of the random variables when learning a model or modify the structure of the relationships among the variables. c) Training sets: A training set is a subset of the data used to generate base learners. By randomly sampling from the available training data, usually sampling with replacement, a different base learner may be generated for each sample. d) Parameter sets: Model parameters, where relevant, may also be manipulated through sampling to produce as many models as samples. We consider the adaptive adversary models to be equivalent to base learners. To generate an ensemble, we propose that a varied set of models be selected from a database. Then we propose to sample over specified model parameters so that any one model produces multiple outcomes that are then combined into a distribution. Next, we combine the probability distributions from the different models. Table 1 summarizes selected aggregation methods used to aggregate probability forecasts by experts or models. Aggregation requirements denote what known data is required while the aggregation considerations denote what issues about forecasts the method explicitly considers. As shown in the table, most probability aggregation tools use past performance to guide aggregation. Many also

Table 1 Aggregation methods. Aggregation method

Majority vote Average Weighted Average Variance Algorithm [39] Coherent Approximation Principle [41] Generative Bayesian [42] IDI ICEWS aggregator [34] Multi-response linear regression (MLR) [43]

Aggregation requirements

Explicit aggregation considerations

Priors

Calibration

X X

History

X X X X X X

Coherence

Precision

Dependence among experts

X X X X

X

X

D.M. Buede et al. / Reliability Engineering and System Safety 108 (2012) 77–89

try to model characteristics of the forecasters. Some approaches, such as average, are simple. Others, such as the ICEWS Aggregator [34] actually learn another model to perform the aggregation. In machine learning, this is called stacking [38]. In comparing aggregation methods, one needs to consider a number of factors. These include requirements such as prior values for probability distributions and ground truth history. Calibration is the capability of a model to make accurate predictions: for example, predictions with a 0.75 probability occur 3 out of 4 times. Coherence requires that a set of predictions from a single source be logically consistent. Precision speaks to the accuracy of the predictions. Dependence among experts considers the degree to which predictions of experts are correlated. Dani et al. [39] compare the performance of a number of algorithms for a database of binary predictions. In all cases, performance is based upon past history. For example, Cesa-Bianchi et al. [40] found (for their particular data set) that variations of an aggregation algorithm that assigns weights to individual experts perform no better than a simple average. An algorithm that weighs experts based on their variance does better. The variance algorithm makes the following assumptions about the experts: predictions are (1) Gaussian centering on the true probability, (2) unchanging in how well they make predictions and (3) independent. Dani et al. [39] surmise that the average works well because experts are not well calibrated—tending to be over-confident in their predictions. They also propose that a weighted linear combination across a set of other aggregation algorithms may perform well. Kahn’s generative Bayesian model [42] considers calibration and dependence among experts as well as accuracy for fusing forecasts of experts and models. The calibration assumes a functional form based upon a single parameter. It performs well on simulated datasets. Predd et al. [41] add coherence to the mix. Coherence depends upon eliciting multiple related probabilities from an expert. Statistical and agent based models produce forecasts for the ICEWS program. Following traditional Bayesian formulations [28] and using historical performance data, the ICEWS aggregation algorithm [34] learns the likelihood ratios associated with a discretization of a model’s forecasts. The naı¨ve Bayes formulation assumes independence of the forecasters. The multi-response linear regression (MLR) develops a linear regression for each situation of interest [43]. Ranjan and Gneiting [44] prove that linear combinations of calibrated forecasts are themselves not calibrated. They propose a transform to adjust the weights of different forecasters so that the aggregated distribution is calibrated. We believe that improving calibration of aggregated forecasts is a major performance enhancement. 3.2. Descriptive models for predicting human behavior Our plural modeling approach builds around the concept of the adversary having multiple conflicting objectives. As such we start with Multiple Objective Decision Analysis (MODA), even though it is more known for its normative power than descriptive power. However, other approaches such as satisficing, lexicographic analysis, and prospect theory all use a similar structure with different mathematics. Finally, we have adapted Level-k game theory to this structure as well. MODA is an approach to balancing trade-offs among competing objectives that is consistent with the rationality axioms of decision analysis [20]. There are additional axioms associated with MODA that justify the additive equation associated with most MODA applications [20], as shown in the following equation. vðxÞ ¼

n X i¼1

wi vi ðxi Þ

83

where v(x) is the overall value associated with an alternative, based on an analysis that addresses n measures that capture the value of the alternatives on n or fewer objectives. xi is the value of the ith measure for the alternative in question. vi is the value function for the ith measure; this value function has a defined minimum and maximum value for xi known as x0i and xni . This value function can reflect ‘‘more is better’’ or ‘‘more is worse’’, etc. wi is the relative weight for the ith measure; this weight is properly called a swing weight because it reflects the relative of the importance of the ith measure to the other measures based on the ‘‘swing’’ in value from the minimum to the maximum values associated with x0i and xni . The weights are commonly normalized to sum to 1.0, and the value functions are normalized to range from either 0 to 1, or 0 to 10, or 0 to 100. The objectives for the MODA model for each particular terrorist group will have to be defined during elicitation sessions with experts. A first cut at high level objectives for the terrorist groups was shown in Fig. 3: change blue policy, recruit more members to further the goals of the group, adhere to religious policies, and cost. These are consistent with the fundamental objectives that Keeney and von Winterfeldt [38] suggested recently. Satisficing was proposed by Simon [21] as a descriptive theory of human decision making because it suggests a form of bounded rationality. Satisficing starts by having the decision maker think of an alternative and evaluate it on a set of objectives and measures. So red is assumed to be rational with respect to the value model, but is assumed to have bounded rationality with respect to identifying a complete set of alternatives from which to choose. In our approach a random alternative would be selected from the universal set of alternatives. If the alternative passes some threshold of overall value, then the decision maker selects that alternative and acts. However, if the alternative falls short of the threshold of overall value, the decision maker finds another alternative and repeats the process. That is, red thinks of one alternative at a time and determines (using a value model) if it is good enough. If not, red thinks of another alternative and continues until one is determined to be good enough. We believe there is real world merit to this approach since a typical terrorist group would not have the ability to evaluate a complete set of biological agents (or chemical compounds, radiological devices, etc.) in one sitting, but would be presented with or think of potential weapons somewhat randomly, and will have to decide whether to take action with that weapon or wait for a better opportunity. For the satisficing case, blue will have uncertainty about the value model, the order in which the WMD alternatives will be presented and the threshold adopted by red. So in this case the Monte Carlo simulation would randomly sample for all of those uncertainties. Lexicographic reasoning is another descriptive theory of human decision making that is based on a form of bounded rationality called non-compensatory heuristics [22]. Here the decision maker is assumed to be capable of developing a list of all possible alternatives, but the value model is much simplified. The decision maker is assumed to have a rank order of the most important measures or objectives. No value functions or weights are needed. The decision maker rank orders the alternatives on the basis of how well they do on the most important measure (or objective). If there is one single alternative at the top of the list, this alternative is the winner. If two or more alternatives are tied for best, these tied alternatives are ranked on the second most important measure or objective. Again if only one alternative is at the top of this second ranking, the winning alternative is selected. If there is a tie, this process is repeated on the third most important measure (or objective). This approach is called non-compensatory because the second ranked alternative may outperform the

84

D.M. Buede et al. / Reliability Engineering and System Safety 108 (2012) 77–89

top-ranked alternative on every other objective, but this does not compensate for even a small difference on the most important objective. For the lexicographic reasoning case, blue will have uncertainty about the rank order of the measures (and objectives). The Monte Carlo simulation will address the rank order of the measures (or objectives). Prospect theory was developed by Kahneman and Tversky [23] to retain part of decision theory but be more descriptive of actual human decision making. There are now two forms of Prospect theory: original [23] and cumulative [45]. Original Prospect theory takes the following form for the value function. 8 a if x 4 0 > : lðxa Þ, if x o 0 where x¼status quo. Common values for a and l are 0.88 and 2.25, respectively. This curve demonstrates satiation of value for both negative and positive values of x as x gets further from zero, as well as a stronger influence of negative values of x compared to positive values of x. Our implementation of original Prospect theory requires that we substitute the individual MODA value functions (the vi from the MODA equation) for red into the equation above. That is, x in the equation above would represent each individual value function for the red MODA model, calibrated such that vi (status quo) ¼zero. Since the decisions being addressed have to do with taking new actions, the status quo values are current levels if no new action is taken. Cumulative Prospect theory creates a function to modify the subjective probabilities of the decision maker:

pðpÞ ¼

dpg dpg þ ð1pÞg

Kahneman and Tversky [45] assumed d was equal to 1 and found g to equal 0.61 for gains and 0.69 for losses. Wu, Zhang and Gonzalez [46] found d and g to be 0.79 and 0.60 for gains, respectively, and 0.88 and 0.67 for losses. These latter s-shaped curves are shown below. Note the curvature is much more pronounced for losses. The intersection of both curves with the 45-degree line is where p and p are about 0.39. In the calibration literature these curves are termed ‘‘underextremity,’’ assigning probabilities on the y-axis that are not extreme enough Fig. 5. Level-k game theory was developed by Stahl and Wilson [47,48] and Nagel [49] and has been found to successfully account

Fig. 5. Comparison of gains and losses in prospect theory.

for behavior in a wide range of experimental settings [24,49–52]. It provides is a tractable algorithmic alternative to traditional game theoretic solution concepts, but it relaxes the hyperrationality assumptions required to justify those concepts. As in traditional game theoretic models, players form beliefs about how their opponent(s) are likely to play. In traditional equilibrium-based approaches to solving game theoretic models, these beliefs are found by imposing a ‘‘mutual consistency’’ assumption—which each player’s belief about her opponent’s actions should coincide with the actual actions chosen by her opponent(s). In level-k game theory, these beliefs are instead formed on the basis of an inductive, hierarchical model of the ‘‘strategic sophistication’’ of players. Specifically, each player in a level-k game has a level of strategic sophistication. A level-0 player is non-strategic, and is assumed to act randomly. A level-1 player reasons strategically but employs a simplistic model of how his opponent(s) will play: he assumes they are level-0 players and treats their actions as effectively random. A level-2 player is one step more sophisticated: she assumes her opponent(s) are level-1 players, and forms her beliefs about how they will act accordingly. Similarly, a level-3 player forms beliefs about her opponent(s) by assuming they are level-2 players (and by computing what such a player would do). A level-4 player assumes she faces a level-3 player, and so forthy This algorithmic solution approach is easily adaptable to and tractable in complex settings, including cases where each player is uncertain about the goals and worldviews of her opponent(s) [53–54]. In summary, we are proposing six different descriptive modeling techniques, some of which will be analyzed simply by Monte Carlo simulation to generate probabilities, others will be analyzed via both Monte Carlo simulations and the expanded Luce method to generate two sets of probabilities. 3.3. An aggregation approach for plural modeling The adversary models we envision produce probability distributions over adversary choices. As illustrated by Table 1, most approaches for combining distributions across multiple models require historical ground truth data. How does one proceed without a history grounded in experience? Guikema and Aven [13] suggest an integrative approach that triages risk into three classes: tolerable, unacceptable and subject to further study. Their approach applies risk assessments designed to rank discrete risks. Tolerable risks require no further study. Mitigation/prevention of unacceptable risks will be of the highest priority. For the third class, they recommend assessing the risk with four different approaches: game-theoretic, probabilistic risk assessment (PRA), a semi-quantitative analysis and protecting high value targets. Each assessment produces a ranked list of the risks being studied. Risks on which all four assessment approaches agree on ranking would be ranked accordingly. Risks for which there was disagreement would be subject to further analysis to determine why the assessments disagreed. The advantage of such an approach is the insight provided by considering multiple analyses, each with its own strengths and weaknesses. Using ranked lists, in lieu of probability distributions over events of interest, gets around the problem of calibration when comparing probabilities generated by two models. Two models’ lists could be very similar in their order, but have distributions with very different variances. When ranked lists for all models are in agreement, then an average of their probabilities would be an appropriate combination method given a lack of ground truth. If the ranked lists disagree, an average may be presented, but the user should be warned about the disagreement among models. This could encourage further analysis. A statistic such as Kendall’s

D.M. Buede et al. / Reliability Engineering and System Safety 108 (2012) 77–89

tau could be presented to the user in conjunction with a distribution to inform the user of the level of agreement among lists. We do have some ground truth. IEDs, for example, are planted by terrorist organizations every day. As a second example suicide attacks occur on a regular basis. Our suite of adversary models provides a variety of decision-making behaviors that depend upon group characteristics. The challenge is to determine which behavior(s) are most aligned with a specified terrorist group type. One could use attacks for which we have data to evaluate the different decision-making behaviors represented by the models. The assumption is that the decision-making behavior used in planning for and carrying out lesser attacks would be consistent with that for WMD attacks. By creating a set of adversary models for attacks for which data is available, we would be able to generate some history for each adversary model type. That history could in turn be used to generate an aggregation model.

4. Framework for computing and aggregating probabilities across models As can be seen from the discussion in Section 3, this plural modeling approach relies upon Monte Carlo simulation of the decision making models given the uncertainties in the value trade offs across objectives, samples of alternatives being considered, and other probabilistic inputs related to red’s perceptions and resources. So the computing effort for this approach needs a computational engine. In addition, there may be multiple users with varying needs and threat risk assessments that drive the specific manner in which these plural models will be used, dictating a flexible user interface. Additionally there may be widely varying input data sets that depend upon the question being asked and the threat assessment being done. Plus there will often be the need to save the results of an analysis so that it can be rerun at a later date with some changes or be used to create new variations of runs for a future analysis. Finally, some users may want just the answer, a probability distribution over the threat decisions, while other users may want sophisticated sensitivity analysis charts that describe which variables were most critical in driving the answers and how varied these answers might be given different settings of the input data. Fig. 6 shows a schematic of the computational framework for this plural modeling approach with queries from the users (long dashes) entering from the top, outputs exiting to the top right and right (long dash and two dots repeated), and input data and models entering from

85

the left (long dash and single dot repeated). This is not an influence diagram but a diagram showing the flow of inputs (data and models) and outputs (data results and reports). The next three figures show the framework details and illustrate two use case examples of how users might interact with this framework. At the top of Fig. 7, we see that the framework includes two user interface modules: the problem definition query processor and the visualization query processor. The arrows indicate data flows (dotted lines) and flow of controls (dashed lines). The double-headed arrows indicate that data or controls flows in both directions (e.g., query and response). The first query enables a user to compose a query that defines what input data is required, what models are to be used, and what outputs are desired while resolving conflicts that arise such as more data will be required than is available for those models. This problem definition query processor requires some controller capabilities to help resolve those conflicts. The visualization query processor enables interactions with the user once the computations are complete and various visualization formats are being explored. Here some controller and output processing capabilities are needed. The supervisor module integrates the problem definition with Monte Carlo simulations that must be performed. The Monte Carlo simulation module calls the models that are required and supervises the model computations needed to complete the problem definition. Each of models is enabled to call the data needed from the database. Finally the module for aggregation and results sorts through the model results to create the outputs desired by the users in the formats of choice. Different shapes are used for the functional blocks in the computational framework to make the functional differences clear to the reader. Fig. 8 presents the first of two use cases, which show just the flow of data between the modules of the computational framework. Here all of the arrows are solid since they represent just one activity—data flow. This use case addresses the definition of a query and the computation of models needed to complete the query. This use case illustrates a clear interaction of the user with the Problem definition query processor, which interacts with the supervisor, which sets up the query and enables the Monte Carlo simulation module to carry out the computations. The results of these computations are then stored in the database. The second use case addresses pulling these results out of the database for visualizations to the user. Fig. 9 presents a use case for the visualization query. Here the user interacts with the visualization query processor to define the format of the subset of results of interest. The visualization query processor interacts with aggregation and results module so that the latter module can format the visualizations requested and

Queries (inputs and Outputs) Probabilities of interest

Models from across the community can be integrated

Selected models

Selected groups & data

Probabilities of interest

User Interface Capability Model Input Processing Execution Capability Capability

Output Processing Capability

Reports

New models Model Base

New data

Data Base Custodial Capabilities

Computational Framework Fig. 6. External interactions with plural modeling framework.

Probabilities for all TRAs

86

D.M. Buede et al. / Reliability Engineering and System Safety 108 (2012) 77–89

User Interface Capability Input Processing Capability

Problem Definition Query Processor

Visualization Query Proc

Supervisor Controller Capability

Aggregation & Results

Meta data Monte Carlo Simulation

Model Compu -tation

Processor Capability

Output Processing Capability

Model Base Data Base Custodial Capabilities Fig. 7. Computational framework for plural modeling.

Fig. 8. Use case for a problem definition query.

cause additional computations to be performed via the previous use case, if needed. In particular, most users are going to want to know what models and assumptions associated with those models are driving the results to be what they are. The aggregation and results module will contain a range of sensitivity and what-if analysis tools to help the user figure these associations between models and assumptions with results. In summary, any modeling approach for predicting the actions of adaptive adversaries must serve as an input to a broader risk analysis engine, where that estimates probability distributions over consequences to blue due to red’s actions and reactions to blue’s mitigation actions. The point of this analysis is to aid blue in examining alternate sets of countermeasures and adopting a costeffective countermeasure approach. The plural modeling approach presented in the previous section requires substantial Monte Carlo

simulation to generate the probabilities that specific red groups or individuals would undertake any of the many possible different attacks possible. The possible attacks number in the millions. The context settings (e.g., red group and possible resources, perception states) around which a specific attack might be chosen number at least in the thousands. The magnitude of the combinations and permutations of the problem (e.g., red group, decision making priorities and process, perception states) calls for analyses that fully account for that complexity and the associated uncertainty.

5. Evaluation and validation Sound analysis requires that the issues of evaluating the results of modeling, and validating those results be taken

D.M. Buede et al. / Reliability Engineering and System Safety 108 (2012) 77–89

87

Fig. 9. Use case for a visualization query.

seriously. Having a computational platform that is capable of performing what-if analyses and sensitivity analyses is a critical part of evaluating the model results. It is here that we include the development of statements about what model parameters have a big (or negligible) impact on the answer and under what conditions those findings hold. Similarly, with what-if analysis one can describe the circumstances (data inputs) that would yield certain types of outputs. The computational framework described above for the plural modeling approach contains these capabilities. Validation of the models is more complicated. There is no available substantive record of ground truth for WMD attacks by terrorist groups, since they have been rare events, which is a good thing. Except for the simplest models it is incorrect to refer to complex models such as described in the influence diagrams above as valid or validated simply by inspection. More appropriately, our validation plan is a qualified description of validity that is interpreted by its users as accurate enough to be useful within the parameters and context (bounds of validity) set forth in its design. Our validation plan is comprised of three steps. The first step is cause-effect graphing. Cause-effect graphing compares cause and effect in the model with cause and effect in the real world. The influence diagrams shown in Fig. 2 though 4 are examples of such cause-effect graphing. The second step is predictive validation. Predictive validation compares model outcomes to corresponding outcomes in the real world. Of course, the real world here is the quintessential issue, since these models are predicting events that have not necessarily happened before in the real world. Since there is no data source for ground truth related to red WMD attacks, our goal of predictive validation for models regarding red decisions is a finding of model convergence. Model convergence is achieved when multiple sources are analyzed and data triangulation results in convergence for a given set of inputs. We propose the following process. 1. Establish test cases for a dozen or two sets of input sets. Specific input sets would be created to make various types of attacks as highly likely as possible. Other input sets could be devised to make various types of attacks as unlikely as possible. 2. Execute plural model computations for those input sets.

3. Compare output of the plural modeling approach to results from the following venues as a surrogate to ground truth. a. Red team assessments based on tabletop exercises using the same scenarios and inputs, using multiple experts from government, academic and think tank organizations. b. Literature-based evidence associated with writings and statements of specific terrorist groups or individuals. c. Other analyses such as previous threat risk assessments. As part of this triangulation process the data generated from the test case would be analyzed for themes and patterns, allowing the analysts to triangulate as described above and build a library of what-if and sensitivity results that could form the basis for future modeling activities. Step three is an informal process known as face validity. It is often not possible to validate a model by saying it has sufficient face validity. But it is possible to discredit a model by saying it does not have face validity. Given our plural modeling approach, perhaps the most relevant face validity question is whether there are other models that have equivalent face validity to those being used in our framework. If that is the case and they can be integrated easily into our framework and harnessed as part of the plural modeling results, then they should be.

6. Discussion and summary Addressing risk mitigation actions to counter terrorist WMD strikes has become a time-consuming, funding-intensive activity within governments around the world. There are many risk analysis frameworks for evaluating such risk mitigation activities but they all will require modeling adaptive adversaries, particularly the elusive nature of human decision making. Unfortunately, the scope of addressing this problem is large so the analytics have to handle significant computational issues as well as significant predictive modeling issues. Underlying both of these issue sets is a vast degree of uncertainty regarding who reds are, what reds want and know (including perceptions), what resources and skills reds possess, how reds decide, and what would make any one red act if a WMD weapon was available. Given all of that uncertainty any defendable analytical method must generate a probabilistic (or other characterization of uncertainty) prediction of potential

88

D.M. Buede et al. / Reliability Engineering and System Safety 108 (2012) 77–89

red actions that fully accounts for and communicates all of those many uncertainties. The approach described in this paper handles all of these issues. A plural modeling approach is used to compensate for our uncertainty in red decision making. Influence diagrams representing the uncertainties about who red is and what red knows (perceptions) are used to structure the decision problem for each of the descriptive decision models employed: MODA, satisficing, lexicographic reasoning, Prospect theory (two versions), and level-k game theory. Monte Carlo simulation as well as Luce normalization and randomized utility methods are used to produce probability distributions over the red decision outcomes. Then a computation framework is proposed for building these analytic methods into any terrorist risk assessment for red WMD actions. Finally any modeling effort must address how its results can be validated. This is especially difficult in problem areas of predictive modeling for which ground truth data is scarce. This paper describes a three-step process that involves cause-effect graphing; the triangulation of a range of data, judgments and previous modeling results; and face validity assessments that the approach employs reasonable methods and is not ignoring other reasonable methods. In summary, this paper defines the benchmarks against which any method for modeling adaptive adversary decision behaviors should be judged.

Acknowledgements The authors are most grateful to their colleagues for many suggestions and productive collaboration in this effort: Seth Guikema, Laura McClay, Casey Rothschild, Jerrold Post. The Department of Homeland Security Science and Technology Directorate funded this work under Contract No. HSHQDC-10-C-00105. The authors wish to thank the reviewer for his many insightful comments.

References [1] U.S. Nuclear Regulatory Commission (USNRC). Reactor safety study: assessment of accident risk in U.S. Commercial nuclear plants. WASH-1400 (NUREG-75/014). Washington, D.C.: U.S. Nuclear Regulatory Commission; 1975. [2] Vesely WE. Fault tree handbook. Washington DC: Office of Nuclear Regulatory Research; 1981. [3] Garrick BJ. Perspectives on the use of risk assessment to address terrorism. Risk Analysis 2002;22(3):421–3. [4] Ezell B, Bennett S, von Winterfeldt D, Sokolowski J, Collins A. Probabilistic risk analysis and terrorism risk. Risk Analysis 2010;30(4):575–89. [5] Department of Homeland Security’s Bioterrorism Risk Assessment: A Call for Change. Committee on methodological improvements to the department of homeland security’s biological agent risk analysis, National Research Council of the National Academies. Washington, DC: The National Academy Press; 2008. [6] Cox Jr LA. Improving risk-based decision making for terrorism applications. Risk Analysis 2009;29(3):336–41. [7] Wein L. Homeland security: from mathematical models to policy implementation. Operations Research 2009;57:801–11. [8] Parnell GS, Smith CM, Moxley FI. Intelligent adversary risk analysis: a bioterrorism risk management model. Risk Analysis 2010;30(1):32–48. [9] Brown G, Cox A. How probabilistic risk assessment can mislead terrorism analysts. Risk Analysis 2010;31(2):196–204. [10] Ezell B, Collins A. Letter to editor in response to brown and cox, how probabilistic risk assessment can mislead terrorism analysts. Risk Analysis 2010;31(2):192. [11] Brown GG, Carlyle WM, Harney RC, Skroch EM, Wood RK. Interdicting a nuclear-weapons project. Operations Research 2010;57(4) 896–877. [12] Rios J, Rios Insua D. Adversarial risk analysis for counterrorism modeling. Risk Analysis 2012;32(5):894–915. [13] Guikema SD, Aven T. Assessing risk from intelligent attacks: a perspective on approaches. Reliability Engineering and System Safety 2010;95:478–83.

[14] Clemen RT. Combining forecasts: a review and annotated bibliography. International Journal of Forecasting 1989;5:559–83. [15] Howard RA. From influence to relevance to knowledge. In: Oliver RM, Smith JQ, editors. Influence Diagrams, Belief Nets, and Decision Analysis. Chichester: Wiley; 1990. [16] Shachter RD. Evaluating influence diagrams. Operations Research 1986;34: 871–82. [17] Keeney RL, von Winterfeldt D. Identifying and structuring the objectives and terrorists. Risk Analysis 2010;30(12):1803–16. [18] Horlick-Jones T. Meaning and contextualization in risk assessment. Reliability Engineering and System 1998;59:79–89. [19] Renn O. The role of risk perception for risk management. Reliability Engineering and System Safety 1998;59:49–62. [20] Kirkwood CW. Strategic decision making. Belmont, CA: Duxbury Press; 1997. [21] Simon HA. Rational choice and the structure of the environment. Psychological Review 1956;63(2):129–38. [22] Einhorn HJ. Use of nonlinear, noncompensatory models as a fu nction of task and amount of information. Organizational Behavior and Human Performance 1971;6:1–27. [23] Kahneman D, Tversky A. Prospect theory: an analysis of decision under risk. Econometrica 1979;47(2):263–92. [24] Crawford V, Iriberri N. Level-k auctions: can a nonequilibrium model of strategic thinking explain the winner’s curse and overbidding in privatevalue auctions? Econometrica 2007;75(6):1721–70. [25] Luce RD. Individual choice behavior: a theoretical analysis. Mineola, NY: Dover Publications; 2005. [26] Baltas G, Doyle P. Random utility models in marketing research: a survey. Journal of Business Research 2001;51:115–25. [27] Buede DM. Errors associated with simple versus realistic models. Computational and Mathematical Organization Theory 2010;15(4):11–8. [28] Clemen RT, Winkler RL. Aggregating probability distributions. In: Edwards W, Miles R, von Winterfeldt D, editors. Advances in Decision Analysis. New York: Cambridge University Press; 2007. [29] Clemen RT, Winkler RL. Combining probability distributions from experts in risk analysis. Risk Analysis 1999;19(2):187–203. [30] Collopy F, Armstrong JS. Expert opinions about extrapolation and the mystery of the overlooked discontinuities. International Journal of Forecasting 1992;8:575–82. [31] Guikema SD, Quiring SM. Hurricane outage prediction podel: Phase II Report. Baltimore, MD: Johns Hopkins University; 2009. [32] Guikema SD, Han SR, Quiring SM. Pre-storm estimation of hurricane damage to electric power distribution systems. Risk Analysis 2010;30(12): 1744–52. [33] Dawes R. The robust beauty of improper linear models in decision making. American Psychologist 1979;34(7):571–82. [34] Mahoney S, Comstock E, deBlois B, Darcy S. Aggregating forecasts using a learned bayesian network. Palm Beach, FL: Conference of Florida Artificial Intelligence Research Society; 2011. [35] Brown RV, Lindley DV. Plural analysis: multiple approaches to quantitative research. Theory and Decision 1986;20(2):33–154. [36] Sollich P, Krogh A. Learning with ensembles: how overfitting can be useful. Advances in Neural Information Processing Systems 1996;8:190–96s. [37] Dietterich T. Ensemble learning. The Handbook of Brain Theory and Neural Networks. 2nd ed. Cambridge, MA: MIT Press; 2002, p. 405–408. [38] Wolpert D. Stacked generalization. Neural Networks 1992;5(2):241–59. [39] Dani V, Madani O, Pennock D, Sanghasi S, Galebach B An empirical comparison of algorithms for aggregating expert predictions. In: Proceedings of the 22nd conference on uncertainty in artificial intelligence; 2006. AUAI Press. [40] Cesa-Bianchi N, Freund Y, Helbold D, Haussler D, Schapire R, Warmuth M. How to use expert advice. Journal of the Association for Computing Machinery 1997;44(3):427–85. [41] Predd JB, Kulkarni SR, Poor HV, Osherson DN. Scalable algorithms for aggregating disparate forecasts of probability. In: ninth international conference on information fusion; 2006. [42] Kahn JM. A generative Bayesian model for aggregating experts’ probabilities. In: uncertainty in artificial intelligence proceedings of the 20th conference; 2004. AUAI Press. [43] Ting KM, Witten IH. Issues in stacked generalization. Journal of Artificial Intelligence Research 1999;10:271–89. [44] Ranjan R, Gneiting T. Combining probability forecasts. Journal of the Royal Statistical Society. Series B, Statistical methodology 2010;72(1): 71–91. [45] Tversky A, Kahneman D. Advances in prospect theory: cumulative representation of uncertatinty. Journal of Risk and Uncertainty 1992;5:297–323. [46] Wu G, Zhang J, Gonzalez R. Decision under risk. In: Koehler DJ, Harvey N, editors. Blackwell Handbook of Judgment and Decision Making. Oxford: Blackwell Publishing; 2004. [47] Stahl D, Wilson P. Experimental evidence on players’ models of other players. Journal of Economic Behavior & Organization 1994;25:309–27. [48] Stahl D, Wilson P. On players’ models of other players: theory and experimental evidence. Games and Economic Behavior 1995;10:218–54. [49] Nagel R. Unraveling in guessing games: an experimental study. American Economic Review 1995;85(5):1313–26. [50] Crawford V, Gneezy U, Rottenstreich Y. The power of focal points is limited: even minute payoff asymmetry may yield large coordination failures. American Economic Review 2008;98(4):1443–58.

D.M. Buede et al. / Reliability Engineering and System Safety 108 (2012) 77–89

[51] Costa-Gomes MA, Crawford VP. Cognition and behavior in two-person guessing games: an experimental study. American Economic Review 2006;96:1737–68. [52] Kawagoe T, Takizawa H. Equilibrium refinement vs. level-k analysis: an experimental study of cheap-talk games with private information. Games and Economic Behavior 2009;66:238–55.

89

[53] Rothschild C, McLay L, Guikema SD. Adversarial risk analysis with incomplete information: a level-k approach. Risk Analysis 2012;32(7):1219–31. [54] McLay L, Rothschild C, Guikema SD. Robust adversarial risk analysis: a level-k approach. Decision Analysis 2012;9(1):41–54.