Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Contents lists available at ScienceDirect
Reliability Engineering and System Safety journal homepage: www.elsevier.com/locate/ress
A practitioner’s experiences operationalizing Resilience Engineering E. Lay a,n, M. Branlat b, Z. Woods a a b
Calpine Corporation, 717 Texas Ave., Houston, TX, USA 361 Interactive, Springboro, OH, USA
art ic l e i nf o
Keywords: Resilience Engineering Risk System safety
a b s t r a c t Resilience Engineering (RE) is a reframed perspective. This begs the question, “How to operationalize a shift in perspective?” We share strategies, tactics, experiences, and observations from implementing Resilience Engineering in power generation equipment maintenance. Use of Resilience Engineering principles shifts focus to the future, to systems, and to how people really work (not the idealized version of work). We more effectively shape outcomes as we pay attention to what’s coming, looking for signs we’re outside normal work or running out of margins that enable us to adapt and respond. Use of these principles opens new possibilities grounded in theoretical fields of biology, cognitive and system sciences (understand Cartesian views of the world work well for machines but not for people) and underlain by core principles (e.g., people fundamentally want to do a good job, actions taken make sense at the time, and system factors are tremendously influential on outcomes). This paper presents a practitioner’s account of a Resilience Engineering approach in the context of power plant maintenance. The paper will describe how the introduction of RE principles was made possible through supporting/ fostering shifts in perspective and gaining buy-in at various levels of the organization. & 2015 Elsevier Ltd. All rights reserved.
1. Introduction Reliability and safety are often thought to be achieved through prescriptive programs and systems composed of predefined, structured elements such as Safety Management Systems. The elements may include procedures (thou must and thou shall), processes (what and how to do), tools (forms, checklists), training, etc. There is no corresponding system for Resilience Engineering and this is precisely what makes it challenging for organizations that typically rely on tips and techniques to drive the implementation of operational programs. The Resilience Engineering (RE) movement was founded by innovative academics and practitioners who have decades of experience in investigating and transforming work systems, and have an understanding of how human systems function as individuals and organizations. The philosophies are rooted in biology and grounded in cognitive and systems science. RE celebrates that we are adaptive beings who plan and take action, improvise when plans or actions go awry, and learn from and embrace unexpected challenges as we act in, and are shaped by, our complex, dynamic world. Real-world systems are seen as fundamentally characterized by complexity, ambiguity, uncertainty and resource constraints: not everything can be known, not everything can be predicted, and not everything can be investigated. Emergence of events is not linear or specifically predictable. Within brittle systems,
events are waiting to emerge; it may be a turbine over-speed event or something else. An act of nature, such as an earthquake, may be the trigger or it could be a person not showing-up for work that day. Such view emphasizes the need for adaptability and context sensitivity (capabilities provided especially by human expertise) as opposed to views in which, with “a little more effort”, things can be specified, anticipated, and controlled. RE shines a light on breakdowns with traditional safety that are based in a Cartesian view of the world. In our experience, this alternative perspective has proven more powerful and effective. This paper presents a practitioner’s account of a Resilience Engineering approach in the context of power plant maintenance. The paper will describe how the introduction of RE principles was made possible through supporting/fostering shifts in perspective and gaining buy-in at various levels of the organization. The following Resilience Engineering principles guided the work presented here; they will be described in greater length in the next section:
Principle 1: Variability and uncertainty are inherent in complex work.
Principle 2: Expert operators are sources of reliability. Principle 3: A system view is necessary to understand and manage complex work.
n
Corresponding author. E-mail address:
[email protected] (E. Lay).
Principle 4: It is necessary to understand “normal work”. Principle 5: Focus on what we want: to create safety.
http://dx.doi.org/10.1016/j.ress.2015.03.015 0951-8320/& 2015 Elsevier Ltd. All rights reserved.
Please cite this article as: Lay E, et al. A practitioner’s experiences operationalizing Resilience Engineering. Reliability Engineering and System Safety (2015), http://dx.doi.org/10.1016/j.ress.2015.03.015i
E. Lay et al. / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
2
In the following section, we will describe how designing to the above principles resulted in shifts in practices (with examples across system levels), as well as shifts in priorities to develop these abilities characteristic of a resilient system [7]:
Ability to respond to regular and irregular variability, disturbances, and opportunities.
Ability to monitor that which happens and recognize if some
thing changes so much that it may affect the organization’s ability to carry out its current operations. Ability to learn the right lessons from the right experiences. Ability to anticipate developments that lie further into the future, beyond the range of current operations.
2. Background: Power plant maintenance operations With 90þ power plants in the US and Canada, Calpine hold the nations’ largest fleet of combined cycle and cogeneration plants. Gas fired turbine-generators require maintenance outages about every 2 years to change major parts as the extreme operating environment with inlet temperatures up to 2000 1F/1000 1C degrades and damages parts. Power prices vary widely in today’s open market and risks of gain and loss are extreme. Two examples illustrate these extremes: Example 1: In the winter, electricity usage peaks in the mornings before people go to work. Abnormally cold weather is predicted to arrive the next morning in a region where temperatures are typically warm. A call is made to sell all generating capacity into the day ahead market at $3000/MW h. During the night, a plant unexpectedly trips and will not restart; the plant has the capacity to produce 100 MW h (a relatively small plant). In the meantime, the real time price of electricity per MW h rises to $5000/MW h. Since 100 MW h had been committed (sold to market) the day before, power had to be purchased from the market to make up the already sold power. For each hour the plant is down, loss is (5000–3000 $/MW h): $2000/MW h 100 MW h¼ $200,000/h Example 2: Rate caps have been raised to $7000/MW h. It costs about $100 to produce 1 MW h of electricity. If electric prices hit this peak, potential earnings for this same 100 MW plant are 100 MW h $6900/MW h¼$690,000/h. Note that duration of these peaks have historically been very short, power prices reached a peak of $4900/MW h in this region for 15 min in the summer of 2013. As a point of comparison, in the US, the average price of electricity is about $120/MW h. Swings in market price result in the need to be flexible and responsive by shifting outages and modifying work scope in order to meet market demands to get units back online quickly. In other words, this is a system with frequent changes wherein complex projects (outages) are being implemented with high momentum. An ‘outage’ on a combined cycle plant involves crews of 15 to 30 people mobilizing to a power plant site to disassemble, inspect, and reassemble the turbine-generator while the plant staff perform maintenance work on the balance of the plant. It is a complex and demanding operation due to a number of characteristics: - The crew is diverse, including contractors, power plant site personnel, and the power plant owner maintenance group. Furthermore, this group comes together for the first time on day 1 of the outage. - The work requires many specialty tools, often shipped in on several tractor trailers. - The plant parts are large, expensive, complicated, with very tight clearances and close tolerances. Operations require lifting heavy components (a typical turbine rotor can weigh 50 to 80 US tons).
- It is common to be working outdoors in extreme conditions of heat or cold for 12-h shifts, 7 days a week, under extreme schedule pressure. - The work system operates close to the edge during peak season with demand sometimes exceeding resources. Such characteristics require thorough planning, but planning by itself cannot address the variability of operations; scope is indeed highly variable and emergent work is common. Decisions need to be made on the fly and issues resolved quickly as there is significant financial cost associated with inaction. As a result, planning needs to be fungible. The requirement for flexibility is nonetheless highly constrained by high logistical demand in the face of limited resources, by constantly conflicting goals, and by the fact that poor decisions can result in future incidents and losses. It is therefore essential to build adaptability and flexibility, by design, in the planning of operations.
3. A shift in perspective: The principles of Resilience Engineering 3.1. Principle 1: Variability and uncertainty are inherent in complex work Deming [2] stated that “uncontrolled variability is the enemy of quality.” While this could be true for simple, tractable systems such as an assembly line, variability is inevitable in most work situations, in which organizations are complex systems, and uncertainty is present in processes under control and operational environments. Building especially on findings from control theory (Ashby’s law of requisite variety, 1956) Resilience Engineering emphasizes the necessity to expect and design with variability in mind. A large body of research describes the importance of adaptability and adaptive behavior for systems to be able to face the variability of their environment without losing control or totally collapsing (in addition to Ashby, see for instance [20,21]; Weick and Sutcliffe [18]). Work systems pursue various goals that require them to control various processes in a given environment (e.g., a hospital treats sick patients, nuclear power plants produce energy from a nuclear reaction, aviation crews fly passengers, firefighters respond to fires). In order to do so, systems rely on models of their tasks and of the environment. Any model is however a necessary simplification that aims at reducing and managing the uncertainty and complexity of the real world for pragmatic purposes (Hollnagel, [5]). As a result, the conditions for operation are underspecified. The predictability associated with the processes under control and the environment is never perfect (even in predominantly closed worlds like manufacturing plants). Any work system therefore needs to be able to adapt to a variety of perturbations and, given the under specification, these perturbations may be surprising. Such assumptions and perspective are in sharp contrast with traditional risk management practices: in risk management terms, this view corresponds to a shift from an approach relying on the capacity to make accurate predictions (based on detailed analyses of events made in hindsight), to an approach relying on preparing for the general shape of risk and to being highly attuned to indications risk level is changing. Based on studies conducted in different high-risk domains, Woods and Branlat [22] have identified three basic patterns of adaptive failures or traps: (1) decompensation—when a system exhausts its capacity to adapt as disturbances and challenges cascade; (2) working at cross-purposes—when sub-systems or roles exhibit behaviors that are locally adaptive but globally maladaptive; (3) getting stuck in outdated behaviors—when a system over-relies on past successes although conditions of operation change. While the first pattern is directly related to the capacity to manage variability, the second one
Please cite this article as: Lay E, et al. A practitioner’s experiences operationalizing Resilience Engineering. Reliability Engineering and System Safety (2015), http://dx.doi.org/10.1016/j.ress.2015.03.015i
E. Lay et al. / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
relates to issues in managing complexity, i.e., interdependencies and interactions across systems. The argument is that capacities or mechanisms that allow work systems to avoid those patterns represent sources of resilience. The identification of such basic patterns therefore suggests ways in which a work organization needs to behave in order to see and avoid or recognize and escape the corresponding failures. 3.2. Principle 2: Expert operators are sources of reliability (vs. human operators as unreliable components) Traditional views on safety and reliability, such as expressed in countless investigation reports, typically identify human operators as a main source of failure. This perspective is challenged by decades of empirical data showing operators’ critical contribution to the reliability of work systems in the face of variability and uncertainty. To a large extent, the adaptability necessary for systems to handle potentially unforeseen variability is indeed provided by the human elements of the work systems, even when they make ample use of advanced technology. An important reason for this is the context-sensitivity issue [20]. Unlike machines (which act literally according to rules), humans can reflect on the context of operations and identify potential gaps in procedures. Operators fill the gaps by adapting to the real conditions of operations and their dynamics (Cook and Rasmussen, 2005). In our operational context, given the variability of the fleet and vendors, our field personnel are vital assets of resilience. They have the ability to plug the gap and make up for limitations in vendor performance when no other crews are available. They are capable of added expertise in handling scope expansion or troubleshooting unforeseen events. They are a source of extra experience to assist vendor engineers. They are also scouts in a position to notice system variability. 3.3. Principle 3: A system view is necessary to understand and manage complex work An important perspective shift comes from embracing a systemic view of the organization and operations. While the word “system” itself is used ubiquitously, work organizations typically rely on oversimplified views of operations:
Through decomposition, operations are seen are more frag-
mented than they actually are, leading to overlooking crosslevel interdependencies and interactions in the complex environment of outages. The management of outage situations requires the recognition of the diversity of stakeholders and goals within organizations, as well as across organizations. As with any system, understanding an issue requires considering what the boundaries are (what constitutes the system, what constitutes the environment?)—drawing boundaries too narrowly (the typical tendency) leads to overlooking interactions; but drawing boundaries too widely is overwhelming and hinders action. The fundamental challenge here is that there is no unique solution: a system view requires the consideration of multiple boundaries. Risks of oversimplification associated with failing to adopt a system view also include considering phenomena as static rather than dynamics (work organizations need to understand trends, not just snapshots in time), and as seeing work situations as homogeneous rather than heterogeneous (overlooking operational variability).
Traditional safety organizations sometimes draw boundaries too narrow. Traditional safety focused only on occupational safety (preventing slips, trips, and falls) misses the big picture. Hopkins
3
[8], BP and Transocean leadership (who had deep knowledge of rig operations) performed a safety walk down focused on occupational safety as the well was failing around them. This is contrasted with looking for the process risks that can seriously harm the system. In traditional safety, safety and quality may be treated as separate and different, including separating people into different organizations. In these organizations, whether an incident is categorized as safety or quality depends on an outcome. A systems view recognizes that the system will produce what it is configured to produce, in general, and when triggered, there is an element of randomness in the specific outcome, it may be an injury or it may be equipment damage. Organizations that define themselves narrowly have an incomplete and inaccurate assessment of the real state of their system. 3.4. Principle 4: It is necessary to understand “normal work” Traditional safety often focuses on incidents which represent only a small portion of work. A new view of safety looks at improving normal work (most of our time is spent doing normal work!) and understanding and expanding practices that enable work to go well. In this view, there is an emphasis on leadership staying in close touch with work and closing the gap between “work as it really happens” versus “work as imagined”. This includes understanding how work is done, why breakdowns occur, and tailoring systems to support effective human actions. 3.5. Principle 5: Focus on what we want: To create safety According to Erik Hollnagel, traditional safety is like trying to drive a car by looking in the rear view mirror; a corrective and reactive approach. The new view focuses on what we want, which is to create safety, versus what we don’t want: adverse events such as incidents, injuries and losses of performance. We drive the car by looking forward and all around us. Consider, as a point of comparison, the term “hazard”: “hazard” is common language in traditional safety programs and sometimes used to the exclusion of the word “risk”. A hazard is a specific, describable, often already existing situation while “risk” is linked to actions and potential outcomes. Consider the terms “control the hazard” versus “manage the risk”: “control the hazard” brings focus to here and now while considering risks triggers future thinking and moves us to an anticipative, planning state of mind. Subtle shifts in language change culture.
4. From principles to essential abilities of resilience, to operational strategies and tactics When work is complex (and most work is), the principles of Resilience Engineering apply for successful work. “Resilience is the intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances so it can sustain required operations, even after a major mishap or in the presence of continuous stress.” [15]. Such overall ability is supported by four system abilities described below: ability to respond, to monitor, to anticipate and to learn. General strategies that aim at translating those abilities into operations are described, as well as examples of associated practices (specific tools will be described in greater details in the following section). Resilience Engineering principles are reintroduced here such that the link between these principles and abilities are clear:
Principle 1: Variability and uncertainty are inherent in complex work.
Principle 2: Expert operators are sources of reliability.
Please cite this article as: Lay E, et al. A practitioner’s experiences operationalizing Resilience Engineering. Reliability Engineering and System Safety (2015), http://dx.doi.org/10.1016/j.ress.2015.03.015i
E. Lay et al. / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
4
Principle 3: A system view is necessary to understand and
manage complex work. Principle 4: It is necessary to understand “normal work”. Principle 5: Focus on what we want: to create safety.
4.1. Ability to respond “This ability means that the system is ready to move into different actions (i.e., is flexible and can adapt). A system must be able to respond to regular and irregular events (challenges and opportunities) in an effective, flexible manner.” [6]. The strategies supporting this ability relate particularly to the first pattern of adaptive response described in a previous section. They aim at favoring timely response to events. The management of deployed resources, the provision of extra resources or the management of priorities, as described by Cook and Nemeth [16], are examples of strategies supporting the successful management of mass casualty events by Israeli hospitals. Building buffers and developing flexibility enable adaptation to respond to variability both organizationally and individually. This enables us to be prepared the general shape of risk, understanding that variability and uncertainty are inherent in our work (Principle 1) and that system and individual adaptability create flexibility (Principles 2 and 3). You don’t need to predict exactly what is going to happen to know what type of resources could help. Buffer could be in the form of people, tools, supplies, space, time or other types of resources. These resources need to be planned for and developed before they are needed. One example of developing organizational and individual flexibility is cross training people to hold different roles and occasionally rotating them into these roles to keep their skills fresh. Such strategy prepares operators to shift into roles to support disruptive or peak load situations.
4.2. Ability to monitor This ability corresponds to noticing critical disruptions and situations before or when they occur. “A system must be able to monitor internal and external developments that may develop into challenges or opportunities. Effective monitoring can lead to increased readiness (early warning) and facilitate early responses, hence improve allocation and use of resources” [6] (Principles 1 and 5) Strategies relate to how organizations assess and understand their situation. They address especially the second and third pattern described previously through improving coordination (a source of information sharing) across the system, and through mechanisms aiming at bridging the gap between a situation as imagined and the actual situation experienced (Principle 4). Examples include: support of processes of sense making; support of reflective processes (Principle 2). Developing a sensitivity to trigger words or phrases helps people notice and monitor risk (Principle 1). People can be trained to notice phrases that are indicators of increased risk such as: “I’ve never seen that before.”, “This is the first time…”, “This is harder or worse…”, indications of uncertainty (“maybe”, “not sure”, “possibly”), incomplete information, or assumptions. Using the language of risk and sharing that noticing builds noticing in others. This can be done by drawing attention to trigger phrases: “I just heard you say ’this is the first time…’ I hear risk.” Hopkins [8] tells of executives performing a safety walk-around on the Deep Water Horizon rig as it was beginning to fail. One operator mentioned “We’re having a little trouble…” Another executive noted the conversation overheard was confused; he sensed they needed help and suggested the on-site rig manager stay behind to help. Later in the day, the executives asked if the test had gone well and were given a thumbs up. What if other questions had been asked to probe deeper or seek a more authentic response?
4.3. Ability to anticipate This ability relates to expecting and planning for critical disruptions, situations, and their consequences (Principle 1). As Hollnagel [6] puts it: “A system must be able to anticipate challenges and opportunities in the near and far future (requisite imagination). An organization cannot be truly proactive (generative) without anticipation.” Corresponding strategies aim at supporting the processes of monitoring and response. They correspond to longer term learning processes and are responses to conditions experienced in the past that have hindered resilient operations (Principle 2). Examples include: anticipating knowledge gaps and needs; anticipating resource gaps and needs. The design of tools to support the ability to anticipate aim at bringing diverse perspectives, stimulating creative thinking, and increasing mindfulness to uncover risks (Principle 2). This is done through bringing knowledge to bear when and where it’s needed as opposed to closing down and constraining thinking. For example, pretask briefs sometimes include a checklist of hazards, but solely going down a checklist of hazards closes down thinking, is unlikely to surface risks beyond this list, and is typically not specific enough to trigger actions. Pretask briefs can be the one of the most effective tools to anticipate and manage risk if designed to engage people and probe variability through querying (How are we most likely to fail? What’s different than you expected? What’s different than before?), through building better mental models by performing the pretask brief near where work will be done, and by including stories of the unexpected, problems, and what worked well in the past. Adopting the following core principles will improve pretask briefs: 1. We do pretask briefs for all work. 2. A pretask brief is a conversation wherein we: plan work share experiences/stories brainstorm on risks and ways to make work safer. 3. Filling out a checklist is not a pretask brief. 4. Pretask briefs take all the time that is needed and are part of work. The “premortem” method (prospective hindsight) described by Klein [9] can supplement pretask briefs in order to more thoroughly surface risks (Principles 2 and 5). The “premortem” method works like this: You imagine an event already occurred then query how it happened, specifically. For example: “You just burned your hand. How did this happen?” According to research, use of prospective hindsight increased the ability to correctly identify future outcomes by about 30% (ibid.). Stepping back and looking at the big picture can reveal patterns and interactions that a focus on individual events hides (Principle 3). For example, a simple tool of a timeline of incidents reveals: (1) increasing severity and number of incidents toward the end of an outage; (2) precursors that showed up early; and (3) interrelations such as a project engineer, distracted by recovering from an incorrect part delivery, missing weather proofing an enclosure in advance of impending storms (the enclosure leaked resulting in a costly delay). Building the skills to notice patterns and the initiation of potentially interrelated failures enables us to stop the failure cascades or to recognize where a cascade is likely. This recognition in turn allows for the targeted deployment of additional resources to cope with disturbances. A better understanding of interactions and interdependencies enables us to build foresight, improve communications that will ultimately reduce incidents and increase efficiencies. Perspective is shifted away from linear causality to emergence and noticing when conditions are ripe for an incident to emerge. 4.4. Ability to learn “A system must be able to learn from past events by understanding what went right and what went wrong—and why.” [6].
Please cite this article as: Lay E, et al. A practitioner’s experiences operationalizing Resilience Engineering. Reliability Engineering and System Safety (2015), http://dx.doi.org/10.1016/j.ress.2015.03.015i
E. Lay et al. / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Implementing the shift in perspective to view operators as sources of reliability means prioritizing development of practices to share knowledge, support learning and grow diverse skills (for example: After Action Reviews, case studies with thoughtful questions) over practices to control and constrain (for example: procedures, audits, checklists) (Principle 2). Good practices are surfaced and shared below. Story telling is used frequently. Near miss stories are a rich source of lessons especially when grouped by common themes that reveal an element of surprise. Denning [3] found that memory retention is lower for information presented in non-narrative informing style when compared to narrative: We would have four or five bullet points that we were hoping people would learn. We were spending our time focusing on the precise wording of those bullet points. What we discovered almost by accident was that the wording hardly mattered. The only points people remembered one or two weeks later were the points that had been embodied in a story. …We found that when people would come to a meeting a couple weeks later, they had completely forgotten the bullet points, but they could repeat the story back to us almost verbatim. Following the story, they knew what they were supposed to have learned. That was a powerful discovery. Compared to messages that inform or tell what to do, stories are more effective in overcoming resistance to persuasion and in influencing beliefs, feelings, values, and behaviors on the longer term. Research by Moyer-Gusé and Nabi [14] found that transportation into narrative (becoming immersed in a story and experiencing events as they unfold), identification (imagining themselves as a particular character) and perceived similarity (sharing common attributes, characteristics, beliefs, and/or values) influenced three forms of resistance to persuasion: reactance (perceive that someone is trying to constrain us such as through overtly persuasive messages, this triggers basic need for independence), counter arguing (thoughts that dispute or are inconsistent with the persuasive argument), and perceived invulnerability (belief that one is uniquely immune to negative consequences of a risky behavior). Investigations need to be prioritized based on the lessons they hold and on potential future impacts rather than on incidents’ severity only for already occurred consequences (Principle 5). In addition, stories should be told from the view point of the operator (2nd stories) so that the rationale for their behavior is revealed (what they did made sense at the time given information available and constraints of operations). Wild land firefighters, for instance, perform staff rides where leadership hears the story of an incident, told on location and from the perspectives of people who were involved (Principle 4). 4.5. Adopting a system view Our current system structure is multi-layered. While our outage services group is responsible for one core task which is to plan and execute maintenance outages; we are but one unit within a distributed system. It is vital and necessary to understand the relationships between these units within the system. According to Dekker [1], “Accidents result from relationships between components (or the software or the people running them) not from the workings or dysfunction of any component part.” As represented in Fig. A.1, there are 3 types of stakeholders on a single outage: the outage service group, the maintenance vendor and the plant personnel. All of these groups are working towards a similar goal, which is having a successful outage (success, however, is defined differently between stakeholders). The responsibilities and tasks around the outage are diverse, requiring coordination and mutual feedback for effective execution. Expanding the lens outwards, there
5
are even more stakeholders involved in the outage process: support staff for the plant in downtown headquarters and support staff for the outage service group all have additional goals and responsibilities. Stepping back even further, the system expands again moving from a single outage to multiple outages occurring in unison. Thus there exists a site-outage system and a fleet-wide outage system. Given that emergent maintenance work is common in operating combined cycle power plants, the system supporting outages must continually adapt to changing work scopes while balancing the needs of the fleet-wide system. This challenge is compounded by an additional layer of complexity: technology among the 100 power plants is not uniform and work requires specialized workers matched to the technologies. Since the business is cyclic, with high work load peaks in spring and fall when weather is more temperate, it is not cost effective to staff to support the peaks, this includes internal resources and maintenance vendors. This results in a system that can very easily become brittle and begin to produce breakdowns as more and more resources are tapped. The tendency is to focus on the outages that have the most urgent issues (typically linked to highest market potential) but this misses the interdependencies and indirect costs associated with cascading changes. When one outage or one person is moved or work expands, there are ripple effects. The losses due to these ripple effects are mostly hidden but can be uncovered with a holistic, systems view. By understanding the interdependencies, risks (threats and opportunities) associated with changes, constraints, and boundaries within the system, we can better alter the system to improve performance. By speculating and planning for the general shape and timing of changes, strategies can be designed to minimize the disturbances.
5. More details on strategies, tactics, and practices to support developing resilience 5.1. Examples of strategies and tactics to build RE abilities The strategies and tactics shared in the following table have been used to build the abilities to respond, monitor, anticipate, and learn in high cyclical load situations. They could be applied to other situations as well, such as responding to emergent issues (Table A.1). 5.2. A selection of representative new practices at different scales of the system The practices shared in this section are at different stages of implementation and acceptance within our organization. Moreover, they vary in scale: some apply at the level of the individual or team, while others apply to outages (project level), and others apply across the organization. The following table depicts where the practices described in the following sub-sections fall relative to these two dimensions (Table A.2). 5.2.1. Real time risk assessment A novel, complex, and/or difficult situation arises and within 1 h, a geographically separated, diverse group in terms of knowledge, skills, function level, and roles, convenes via telephone conference to solve the problem. This is a Real Time Risk Assessment. They describe and diagnose the problem via structured brainstorming (explore risks and multiple solutions), agree on and produce a plan that includes actions, decisions, decision authority and accountability, check-in points, iterative solutions, and contingencies. A Real Time Risk Assessment is a method developed by the authors to respond to risks by tapping into current, diverse knowledge and shared experiences, in an organic, interconnected way and bring it to bear at point and time of need to address emerging situations. Real
Please cite this article as: Lay E, et al. A practitioner’s experiences operationalizing Resilience Engineering. Reliability Engineering and System Safety (2015), http://dx.doi.org/10.1016/j.ress.2015.03.015i
6
E. Lay et al. / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Fig. A.1. Systems view of the organization and structure of people who support outages.
Time Risk Assessment is especially effective for highly variable work (Principle 1: Variability and uncertainty are inherent in complex work.); enabling expertise to be leveraged (Principle 2: Expert operators are sources of reliability.). According to Woods [19], in resilient organizations, “Adaptive decision-making exists in a co-adaptive web where adaptive behaviour by other systems horizontally or vertically (at different echelons) influences (releases or constrains) the behaviour of the system of interest.” Inherent in these descriptions is flexibility; the ability to configure based on the dilemma being faced. Real Time Risk Assessment creates a culture that is responsive to frontline needs and notices risk and is a way of responding to the general shape of risk, for example types and sources of knowledge that would help with certain types of problems are identified ahead of time. Roles are designed for diversity: risk decision owner (usually a person responsible for profit and loss), a challenger, design expert (what to do), repair expert (how to do), person(s) with related experience, practitioners needing help, and risk knowledge broker. Risk knowledge brokers facilitate the creation, sharing, and use of
knowledge (Sverrisson [17]). Meyer [13] emphasizes that knowledge brokers do more than link knowledge; they facilitate co-creation of knowledge and participate in constructing a common language, in this case the language of risk. Knowledge brokers sometimes use matchmakers to determine where knowledge relevant to the problem at hand resides. “Matchmakers” know what others know. They have wide networks and deep or broad experience. They may have held many roles or held one role for a long time. When approaching an especially novel problem, the knowledge broker’s first step is to reach out to matchmakers and brainstorm with them on who could bring value in solving the problem. Additional details of this process can be found in 2013 Resilience Engineering proceedings, Lay and Branlat [10].
5.2.2. 2-Minute drill A 2-minute drill is a practice for increasing situation awareness by directing attention to the physical workplace and work with an intention of creating foresight. Variations of 2-minute drills are used
Please cite this article as: Lay E, et al. A practitioner’s experiences operationalizing Resilience Engineering. Reliability Engineering and System Safety (2015), http://dx.doi.org/10.1016/j.ress.2015.03.015i
E. Lay et al. / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
7
Table A.1 Examples of strategies and tactics to build RE abilities. Strategy
Manage deployed resources
Provision of extra resources
Manage priorities
Support processes of sense making Support reflective processes
Anticipate knowledge gaps and needs Anticipate resource gaps and needs
Use questions to trigger learning
Example tactics Ability to respond (adapt) Shift goals, shift roles, have critical resources perform critical tasks, only! Use less experienced people for less complex work; provide more oversight if needed. experts coach, provide oversight versus “do” work Add buffer such as a logistics person to manage part; especially for projects with multiple emergent work scopes or issues “Drop in an Expert”. Find a person with deep, relevant knowledge (possibly a retiree) and funding them for a short time period with the mission to assess the situation then make offers to groups who need help Form a special team, for duration system is stretched close to limits, for heightened state of coordination and help. Consider decisions that need to be made, power needed to remove barriers and expedite solutions, authority to add or move resources. Will strengthen leadership’s connection to the front lines, provide a forum to escalate issues to management’s attention; cross-organization team will improve collaboration and are able to hold a neutral position to smooth political tensions that arise during periods of high stress. Focus on aggressively addressing issues that have the potential to impact frontline work Adjust capacity limits by removing stressors, both physical and mental, from people Shed tasks: don’t do, postpone, do less frequently, or move task to another person Shed load: move, decline projects Manage differently considering how people respond when they are close to their limits (fatigued, stressed); they are more forgetful, less attentive and may miss things. Increase use of human performance tools such as peer checks Ability to monitor Someone steps back from (or out of) their usual role to gain a broader perspective Begin a heightened state of coordination and help; possibly brief, daily conversations with those who are involved Assess global situation. Avoid tendency to handle serially versus holistically “Ping” (notice signs risk level has changed) to monitor approach of yield points. The yield point, from a stress–strain analogy, is the point beyond which the system begins to have small failures and misses; this typically shows up before more catastrophic failure [21,11] Query frontlines on breakdowns, concerns, and current capacity. Ask: Who is at the point they can’t keep up? What help is needed to add capacity, remove stressors, or free up capacity? What is impeding ability to perform? What is keeping them awake at night? Search for signs of brittleness, such as incomplete, unclear information or statuses, silo situations where workers are not optimally connected with front lines, accuracy of assumptions, fatigue, and key individuals for whom there is no back-up Ability to anticipate Develop multi-skilled workers. Build support team trained to hold various roles to unload or support frontlines in addition to regular duties. A strategy for building this team is to recruit people with a variety of backgrounds with the understanding that they will periodically work the frontlines to keep their skills fresh Anticipate losing people and their associated capacity Build buffering capacity and develop reserves before needed Design reconfigurable teams. This can be implemented by having a larger team that can be split into smaller components depending on the need, such as entire team working one shift or splitting to cover 2 shifts Pre-assign tactical reserves to planned work to reduce disturbances caused by emergent work. Tactical reserves could be back office personnel with appropriate experience. Assign them to planned work during peak load (giving them time to prepare), leaving active personnel available to respond to unplanned work with their more current skills enabling them to better handle variable situations Ability to learn Hold After Action Reviews. Ask: What was expected to happen? What actually happened? (Where did things not go as expected?) What went well and why? What can be improved and how? Use case studies designed with questions to provoke thinking. Examples can be found at: 〈http://www.nationalnearmiss.org/〉
Table A.2 Practices being implemented at different scales of system. Maturity Scale
Achievement
Used, under improvement
Under development
Organization Outage Individual/team
Real time risk assessment
Lightning round
Margins of maneuver tactics Margins of maneuver tactics Human performance coach
2-Minute drill
widely within industry. The card shared below was developed by the authors, the questions are designed around areas where process safety risks are significant, to direct attention to areas where it isn’t natural to attend (Ex. look up!). The highest risk for fatality with power plant maintenance is electric shock due to failed lock-out, tag-out of electrical systems. Other types of stored energy, such as a hydraulic jack slipping and hitting a person, have the potential to cause serious injury or death. When moving a heavy load, attention is naturally directed to the load but there have been multiple failures at the attachment point of the hook to the crane. Spaces are tight with multiple tasks occurring; people are above/below one another with heavy tools or parts that can be dropped. Critical cues are indicators that the situation is not as expected and that risk could be increased, for example if a pipe is hot or there is noise when the systems is supposed to be locked-out, this could be an indicator that there is still fluid or steam in the pipe; steam can cut
off a limb. 2-minute drills support improved monitoring and anticipating, focusing on creating safety (Principle 5: Focus on what we want: to create safety.) through better understanding normal work (Principle 4: Most of our time is spent doing “normal work”.). To create such a tool, near misses are a great place to start. Query both experienced and inexperienced people on where they’ve been surprised or almost missed something that could have caused harm. What did they notice? What were the cues something was amiss? What would they warn a new person about? What do they monitor or check for now? Where do they think they are at greatest risk? (Fig B.1).
5.2.3. Lightning round Lightning round is a meeting designed to support coordinating, prioritizing, and understanding role contributions to the team [12]. As originally designed “each person should be able to share their three
Please cite this article as: Lay E, et al. A practitioner’s experiences operationalizing Resilience Engineering. Reliability Engineering and System Safety (2015), http://dx.doi.org/10.1016/j.ress.2015.03.015i
8
E. Lay et al. / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Fig. B.1. 2-Minute drill card with tips for pretask briefs and work, worker, workplace observations for the purpose of increasing situation awareness..
Fig. C.1. Originally the meeting was perceived by participants as a debrief for leadership with information flowing vertical and unidirectional to departmental leadership.
priorities in 30 s or less. For a team of nine, this part of the meeting should be done in four and a half minutes.” In addition Lencioni recommends the meeting be held with all standing and no interruptions, output of this meeting can set the agenda and priorities for a weekly tactical meeting wherein issues are discussed in more depth. The implementation of Lightning Round, as we’ve implemented it, differs from the Lencioni design and is described below. Lightning Round supports obtaining a current, system view (Principle 3: A system view is necessary to understand and manage complex work.) if participants from across an organization participate, and coordinating and cooperating to support variable work (Principle 1: Variability and uncertainty are inherent in complex work.) (Figs. C.1 and C.2). Lighting Round is a 30 min meeting, every Monday–Wednesday–Friday morning, in which representatives from each team
Fig. C.2. Over time, lightning round drifted to communication flowing horizontal as well vertical and to be multi-directional.
provide a short update on their current, most important activities. Updates often revolve around ongoing troubleshooting work where there is inherent ambiguity about progress/diagnosis of issues. It is common for smaller groups to convene to work on issues after the main meeting. Lightning Round serves as a point of cooperation and coordination which improves the ability to respond, monitor, anticipate, and learn. These types of communications take place:
Share information i.e. passing along a review of vendor perfor
mance to the stakeholder in charge of said vendor in order to rectify the situation. Make requests for action(s). For instance shifting financial reporting meetings to meet a new deadline from corporate leadership.
Please cite this article as: Lay E, et al. A practitioner’s experiences operationalizing Resilience Engineering. Reliability Engineering and System Safety (2015), http://dx.doi.org/10.1016/j.ress.2015.03.015i
E. Lay et al. / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Make requests for additional coordination i.e. requests for a
one-on-one meeting to resolve a specific issue such as inventory decisions. Ask questions. Such as, can more resources be deployed to a specific outage in order to complete it quicker? For instance, when a unit underwent unplanned maintenance a neighboring unit was already in planned maintenance. This could present stress on the local market. Individuals asked questions about what kind of resources could be deployed to finish the planned maintenance event quicker in order to prevent the potential stress. Discuss newly discovered work scopes and issues and how these issues affect the current plan for the specific outage. Reinforce corporate leadership priorities. Relay physical locations of assets and personnel. Discuss difficulties or lack of flex in system. For instance, a repair shop is at capacity and turnaround times will be delayed. Offer help.
Improvement opportunities and gaps exist. Status updates are shared in the meeting in spite of the existence of tools in use that are designed to provide status updates. Either these tools are not fulfilling their purpose or there is redundant information being stated in Lightning Round; the opportunity exists to increase Lightning Round efficiency by using these or new tools. Over reliance on meetings versus artifacts for status updates has led to misunderstandings and inaccurate assessments of ongoing work and plans (in one example an engineer was in route to a site, having left another job early, when he heard the work was cancelled during Lightning Round). People comment that the tools have inaccurate information since keeping information current is difficult given the high frequency of flux as people move from outage to outage (including moving during outages in progress), outage start and end dates move, scope increases and contracts but making effort to keep artifacts up-to-date would provide an accurate view of system state for the distributed team. There is no prioritizing of information nor are there criteria for which information should be shared. Importance is determined by implicit cultural norms and varies by individual. For instance, a major of event happened over night at a facility; one stakeholder closes his update with the major event, another stakeholder begins his update with the major event. In some healthcare organizations, handovers are designed such that anomalous, complex patient conditions are discussed first, the reasoning being that this timing facilitates more attention on these events. There have been times when important issues are mentioned casually and don’t stand out from a list of other updates being shared, thus are not attended to by the group. Over time, the number of participants has grown to about 20. There are no criteria for who should attend. Participants include both managers and professionals who implement outages as well as support organizations. Stakeholders who miss Lightning Round may miss information or may not be able to provide updates that affect other stakeholders; this may explain the growth in the number of attendees as back-ups attend and keep coming.
5.2.4. Human performance coach The human performance coach program involves developing safety contractors into expanded roles. Power plant owners spend several hundred thousand dollars per year on safety contractors to oversee outages yet these contractors produce inconsistent results. Safety contractors monitor hazards and compliance with policies and industry standards and police unsafe behaviors. A human performance coach focuses on creating foresight and anticipating risks and building skills and knowledge with a goal of changing culture in addition to some traditional safety actions. The responsibilities of the coach, which were designed with the aid of operators, are based on the human
9
performance model we developed for maintenance work which references Crew Resource Management (CRM) models from aviation and Flin’s [4] work in nontechnical skills in healthcare. Building on another lesson from aviation (using check pilots as CRM instructors), the program is led by a Human Performance Specialist (full time employee) who came from a supervisory position within the operator ranks. The purpose of filling this leadership role with a former supervisor includes closing the gap between work as imagined vs. actual work (Principle 4: Most of our time is spent doing “normal work”.), bringing in knowledge of process risks (Principle 2: Expert operators are sources of reliability.), and increasing the power of the role. Frontline supervisors have more direct influence than any other role on outage safety culture. According to Hopkins [8], “too often safety is understood to be a matter of “slips, trips, and falls”, rather than the major hazards that can blow the plant or the rig apart”. Development of the coaches will be through a combination of formal training and frequent coaching conversations with Human Performance Specialists. 5.2.5. Margins of maneuver There is always limited capacity. Margin of maneuver represents an important dimension of the adaptive capacity of a system, which in turn enables a system to respond; it is the cushion of potential actions and additional resources that allow the system to continue functioning and adapting despite unexpected demands [22] (Principle 3: A system view is necessary to understand and manage complex work.) Margins of maneuver:
Exist before disrupting events. Show up as being adequate or inadequate during disruptions,
which means resources that support potential actions may look like excess capacity when not in use. Tend to be lost when systems are under pressure.
Developing organizational flexibility such that limited resources can be shared and the organization reconfigured to meet emerging demands is one method of increasing margin of maneuver. Tactics to accomplish this include rotational programs that support development of people’s ability to hold multiple roles, agreements and practices to share resources, joint strategy between groups of which skill sets to recruit and develop, and staffing back office support groups with former frontline personnel who can temporarily be redeployed to the frontlines. Shifts in perspective can be brought about by using different language; consider the term “margin of maneuver” as contrasted with “fat in schedule”, this different interpretation opens opportunity for different actions such surfacing and actively managing the extra time in schedules that represents uncertainties as compared to hiding it by artificially increasing duration of scheduled activities.
6. Discussion We have described previously various strategies and tactics developed in order to introduce the Resilience Engineering perspective and RE-inspired practices in our organizations. This section aims at discussing challenges essential, in our mind, associated with conducting change. 6.1. Establishing the conditions for the shifts: Gaining buy-in One of the challenges associated with the introduction of a new perspective and the development of new practices is to create the conditions for their acceptance throughout the organization. The principles and tactics described above indeed require the involvement of people at all levels of the organization: from managers adopting
Please cite this article as: Lay E, et al. A practitioner’s experiences operationalizing Resilience Engineering. Reliability Engineering and System Safety (2015), http://dx.doi.org/10.1016/j.ress.2015.03.015i
10
E. Lay et al. / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
new views on safety or accepting to fund new projects, to operators on the frontline directly experiencing the transformation of work situations. This section aims at sharing the approach adopted in two different organizations of the same industrial sector in order to facilitate this process by gaining buy-in. We essentially implemented a middle-out approach facilitated by our function within the organization, standing between the blunt(er) end (higher level managers) and the sharp(er) end (front line operators). The process started by developing a core group composed of our team of engineers. The team approached implementing resilience by learning how to notice brittleness in operations and to design resilient tactics. The team studied Resilience Engineering by organizing an internal weekly study group. It began implementing practices such as a near miss newsletter and workshops that were held to address specific system problems such as operational conditions in which the system is potentially over extended due to resource constraints. For example, a particular workshop was held twice to diagnose brittleness and design in resilience as the team recognized they were entering “outage season” without a sufficient amount of people. While it constituted a small pebble dropped into a large pond and felt like progress was limited (not much was changing and the velocity was too slow), such process is indicative in our view of a characteristic approach to bringing Resilience Engineering into an organization: it starts with local initiative, from the inside and with initial buy-in from the organization management. We did not start with a top-down approach. Larger success in the organization came from shifting the thinking of powerful “sponsors” who were invited to participate in a Resilience Engineering workshop (featuring several members of the international RE research community). Such “sponsor” was identified as: (a) serving in an influential managerial position, which created the condition for a larger impact if the workshop experience was successful; and (b) presenting a profile of being open to new ideas. Similar invitations were made over time, especially to people in lower positions in the hierarchy such as operations managers or engineers. Involving people in such positions aimed at shifting the perspective of people who are likely to be in more influential positions in the future, or who are already in a position to use the concepts in their current roles. Their participation to events organized in the communities of Resilience Engineering or High Reliability Organization helps spread the new principles of safety and reliability within the organization. Gaining buy-in, especially at the managerial level, therefore comes from first from the identification or creation of opportunities within and outside the organization. With operators, the approach has been to design some new practices and take existing, traditional practices and shift them in accordance with RE philosophies. In the example of pretask briefs presented above, a new procedure was not written, new tools were not created. Instead it was recognized that each individual has their own style and this style contributes to the authenticity of their message and this authenticity is necessary for gaining trust. General principles were designed: tell stories, listen, take the time you need as the pretask brief is as important as the physical work. As the outage season has drawn on, tactics to manage fatigue such as shedding tasks and reprioritizing work, were discussed with the operations team with a mention of these tactics being in accordance with RE philosophies. RE is implemented, with no grand announcements, but with a mention and design alignment to RE philosophies. 6.2. Shift: Leveraging or re-designing completely? While we were trying to operate a shift in perspective, therefore a profound change about culture, we found that existing practices provided significant leverage. Our approach mainly consisted of broadening existing, traditional practices to be more in alignment with RE philosophies (e.g., looking for good performance, as well as incidents). One example is the Human Performance Coach role. We
already used safety contractors on outages, but a decision was made to hire a person in order to develop and manage a human performance coach program wherein we change the responsibilities of these same safety contractors from policing to looking for risk. A shift in practice followed this shift in principles, and a list of responsibilities was designed in alignment with RE principles. Our experience suggests that edicts are not effective in the distributed, deep cultures work organizations represented. The successes we had indicate that small changes seem to be more effective and that innovation is incremental rather than abrupt. Change is indeed shaped by already existing structures and interactions between people and environment over time, (e.g., the fact that runways exist shapes the evolution of airplanes). We hypothesize that this issue is one of the major drawbacks with top-down programmatic approaches: there is a lack of transition from existing practices, making new practices feel disconnected from the existing work. Our experience at conducting change in two different organizations of the same industrial sector also suggests differences between long and less established organizations. In a long established organization, a more established culture of quality and safety organizations and beliefs or narratives about safety and quality created the conditions for more inertia and challenges to shift principles and practices. Such conditions further stressed the points made above (getting buy-in, leveraging existing practices). Challenges appeared quite different in a less established organization in which processes are few, and roles, responsibilities and structure are not as well defined. What we found was a safety organization focused on establishing standards and working with a traditional definition of safety (related to injury or possibility of injury) and no group solely responsible for quality or consistency across the organization. Also, an entrepreneurial legacy of protecting autonomy of plants and providing plant leadership with much freedom in running their business resulted in great diversity and variability across plants. The success of the application of organization standard tools throughout the fleet is quite variable in this context. In an effort to identify, understand and then address this diversity, we recognized that field practitioners could serve a vital role as “scouts” within our organization, reporting back on the variations they experienced within the fleet. We are currently working on establishing processes for capturing this information to assist new members of the organization or to feed it back to the plants, other departments, and stakeholders to support change. An advantage of conducting those changes in a less established organization is that we operate on a blanker slate and experience fewer constraints with introducing new principles and practices.
7. Conclusions While it is based on decades of theoretical and practical work, Resilience Engineering is still a new field. The question remains largely open of how to translate principles, values and concepts described in the literature into concrete instruments in general and into tools for specific organizations in particular. We have described in this document our own experience at implementing strategies and tactics to introduce Resilience Engineering into two different organizations from the energy sector. We have tried to share our perspective on this work in progress and to reflect on the successes and drawbacks encountered: this is the approach that generated successes for us, and these are the limitations of its impact. Through these lessons learned and through detailed descriptions of specific tools implemented or in development, our objective is to participate in the critical discussion occurring within the RE community about the practice of resilience. In our experience, the successes came from several key factors: - The transformations were conducted from the inside, not through an external top-down process. However, progress
Please cite this article as: Lay E, et al. A practitioner’s experiences operationalizing Resilience Engineering. Reliability Engineering and System Safety (2015), http://dx.doi.org/10.1016/j.ress.2015.03.015i
E. Lay et al. / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
was regularly fueled by interactions with the larger research and practice RE community. - As insiders, we followed a middle-out approach that relied on gaining buy-in from all levels of the organization. This was facilitated by our particular position with the organization, since we are standing between the high-level managers (blunt end) and the front line operators (sharp end) and since contacts with both ends constitute our daily activity. - In order to shift organizations’ practices toward ones that are more consistent with RE, it was necessary to purposefully drive a shift in perspective. RE values and principles represent a fundamental shift in how organizations typically think of various dimensions of operations such as variability, complexity, error/accidents, hazard/risk, etc. - Such shift in perspective might be in sharp conflict with other approaches to plan for and conduct operations, or to improve performance and safety in the organization. However, the decades of experience the organization has in using those other approaches can also provide valuable leverage for our purposes. We found that progressively and partially transforming the practices and culture in place was more successful than forcing brand new views. Different traditions can cohabitate and complement each other, and the organization needs to recognize the values and limitations of each for its various purposes. Areas of future work could include bringing together practitioners of resilience engineering from different domains to diagnose, compare, contrast, and share practices and case studies. Healthcare deals with the diagnosis and treatment of human patients while industrial maintenance deals with diagnosis and maintenance of machines. If you look beyond biological vs. metal or physiological vs. physics, you will notice similar pressures, constraints and issues. Maintainers even talk of turbines as having a “health” or assign gender (“she”) as they troubleshoot issues and prescribe maintenance procedures. Both domains face diagnosis of complex cases and urgency of emergent load: in healthcare, emergency room overload; in maintenance, unplanned power plant outages. There is learning in what works across domains as well as what doesn’t work from one domain to another. We can learn from implementing specific practices across domains (What needs to change? What are the design parameters? What is fundamental?). What would Real Time Risk Assessments look like when implemented in healthcare? What characterizes resilient organizations in such domains is their ability to prepare for surprises, to adapt in time and to manage interdependencies between their components. If some of the core issues are analogous, we have an opportunity for mutually beneficial learning.
11
References [1] Dekker Sidney. Drift into failure. Farnham, UK: Ashgate; 2011. [2] Deming, Edward . In: Chang W Kang, Paul H Kvam (2012), Basic statistical tools for improving quality; 1980. p. 19. [3] Denning S. Effective storytelling: strategic business narrative techniques. Strategy Leadersh 2006;34(1):42–8. [4] Flin, Rhona. Nontechnical skills for surgeons handbook, University of Aberdeen, version 1.2, 〈www.abdn.ac.uk/iprc/notss〉; 2012. [5] Hollnagel Erik. The ETTO Principle: Efficiency-Thoroughness Trade-Off. Adelshot UK: Ashgate; 2009. [6] Hollnagel, Erik. Resilience health care: the basic issue. In: proceedings from ideas to innovation: stimulating collaborations in the application of resilience engineering to healthcare; June 13–14, 2013. [7] Hollnagel, Erik. Becoming resilient. In: Nemeth CP, Hollnagel E, editors. Resilience engineering in practice, vol. 2: Becoming resilient. Adelshot, UK: Ashgate; 2014. p. 185. [8] Hopkins, Andrew . Management walk-arounds: lessons from the Gulf of Mexico oil well blowout. Working paper 79, National Research Centre for OHS Regulation; 2011. [9] Klein, Gary. Performing a project premortem. Harvard business review; Sept., 2007. [10] Lay, E, Branlat, M. Enhancing resilience in industrial maintenance through the timely mobilization of remote experts. In: Proceedings of the resilience engineering symposium; 2013. [11] Lay E. Practices for noticing and dealing with the critical. A case study from maintenance of power plants. In: Hollnagel E, Pariès J, Woods DD, Wreathall J, editors. Resilience engineering in practice. Farnham, UK: Ashgate; 2011. p. 127–44. [12] Lencioni, Patrick Death by meeting: a leadership fable...about solving the most painful problem in business, vol. 15: J-B Lencioni Series. John Wiley & Sons; 2010. [13] Meyer M. The rise of the knowledge broker. Sci Commun 2010;32:118–27. [14] Moyer-Gusé E, Nabi RL. Explaining the effects of narrative in an entertainment television program: overcoming resistance to persuasion. Hum Commun Res 2010;36:26–52. [15] Nemeth C, Wears R, Woods D, Hollnagel E, Cook R. Minding the gaps: creating resilience in health care. In: Henriksen K, Battles JB, Keyes MA, Grady ML, editors. Advances in patient safety: new directions and alternative approaches, vol. 3: performance and tools. Rockville (MD): Agency for Healthcare Research and Quality (US); 2008 Advances in Patient Safety. [16] Nemeth, Christopher P., Nunnally, Mark, O'Connor, Marian Brandwijk, Julie Kowalsky, Richard I. Cook, Regularly irregular: how groups reconcile crosscutting and demand in healthcare. 2006 Cognition, Technology, & Work August 2007, Volume 9. [17] Sverrisson A. Translation networks, knowledge brokers, and novelty construction: pragmatic environmentalism in Sweden. Acta Sociol 2001;44:313–27. [18] Weick KE, Sutcliffe KM. Managing the unexpected: resilient performance in an age of uncertainty. San Francisco, CA: Jossey-Bass; 2007. [19] Woods DD. Resilience engineering: essential characteristics of resilience. In: Hollnagel E, Woods D, Leveson N, editors. Burlington, VT: Ashgate Publishing Co.; 2006. [20] Woods David D, Hollnagel E. Joint cognitive systems: patterns in cognitive systems engineering. Boca Raton: FL: Taylor & Francis/CRC Press; 2006. [21] Woods DD, Wreathall J. Stress–strain plot as a basis for addressing system resilience. In: Hollnagel E, Nemeth CP, Dekker SWA, editors. Resilience engineering perspectives: remaining sensitive to the possibility of failure. Adelshot, UK: Ashgate; 2008. p. 143–58. [22] Woods DD, Branlat M. Basic patterns in how adaptive systems fail. In: Hollnagel E, Pariès J, Woods DD, Wreathall J, editors. Resilience engineering in practice. Farnham, UK: Ashgate; 2011. p. 127–44.
Please cite this article as: Lay E, et al. A practitioner’s experiences operationalizing Resilience Engineering. Reliability Engineering and System Safety (2015), http://dx.doi.org/10.1016/j.ress.2015.03.015i