getting the whole picture
getting the whole picture
Managing digital incidents – a background
Applying risk management
Peter Stephenson
We took a brief look at the notion of risk in our last column. This time we’ll conduct a similar examination of incident management. An important concept in managing incidents is the term itself: Managing Incidents. Note that the emphasis is not on incident response. The idea of managing digital incidents is build around the notion that incidents pose risk to the organization. Thus, they must be managed proactively and treated as part of the risk profile. We can see, then, that managing risk and managing incidents are very closely aligned. In this column we will begin to address the issue of incident management by setting some background in place. Future columns will expand upon this background.
Current dichotomy in incident management Today there are several theories of incident response. Every one addresses the process as an after-the-fact issue. For example, the two leading books on incident response (by Schultz & Shumway, and vanWyk & Forn) give two different incident response processes. The confusion is complicated by such players as SANS and CERT at Carnegie Mellon University, each claiming to have the official way to “do” incident response. The bottom line is two-fold: (1) everyone in the incident response space is jockeying to be “the expert” on incident response, and, (2) everyone in that space treats the problem after the fact – as a clean-up procedure. While clean-up is, occasionally, required, the real objective needs to be prevention of the problem in the first place. As in the problem of building security into applications, we
Being pro-active helps address that need. The objective, then, is to make incident management part of the risk management process, simplify management of and response to incidents, and reduce the severity and numbers of incidents. This allows more time to respond when it becomes necessary and turns down the heat on an incident should one occur.
find that it is many times more expensive to respond to the problem than it is to prevent it.
We need to treat incident management as part of risk management The dichotomy, of course, is that incidents do happen and one needs a way to respond should it become necessary. So that puts us back at the beginning of the race. The answer, it turns out, is twofold, just as the problem is two-fold. First, we need to treat incident management as a part of the risk management process. This is preventative. Second, we need a way to respond should it become necessary. I teach a two day seminar on incident management. Every time I teach one (which all are over-booked, by the way, attesting to the high importance that organizations put on the process) I ask my class why they are there. About 90% of the time one of the reasons given is simplification. Organizations are looking for better ways to manage incidents but they don’t want a complicated process.
We have spent quite a bit of time over the past year discussing risk management. We made the point several times that one manages risk by managing the components of risk. Those components are: • • • • •
Threats. Vulnerabilities. Impacts. Countermeasures. Inter-domain communications.
We manage incidents the same way. We begin by performing a thorough risk analysis and applying the results to incident management as follows.
Threats What threats are there against the enterprise, internal and external, that could result in an incident? How do we manage those threats to minimize their ability to cause harm? If a threat agent is successful at delivering a threat against a vulnerability and causing an impact, how should we respond?
Vulnerabilities What are our exploitable vulnerabilities? How can we manage those vulnerabilities? What happens if a threat agent delivers a threat against one successfully? How should we respond? These first two components help dictate both our protective and responsive postures. They help us pinpoint where we should place our training and remediation efforts in advance of an incident. 17
getting the whole picture
Impacts What are the probable impacts of attacks against our enterprise (refer to the definitions in last month’s column)? What can we do in advance of an attack to reduce the impact and, in the event of a successful attack, what must we do to contain damage?
Countermeasures What additional countermeasures are necessary to prevent an impact? If an incident does occur, are there additional countermeasures we should put in place to contain and recover? For example, are there emergency rule sets for routers, switches and firewalls that we should invoke, sort of like closing water-tight doors on a ship that is in danger of sinking?
Inter-domain communications What communications may be present that could enable an incident? How do we control those channels? What domains are most vulnerable? If we have an incident, how do we contain it? If we remediate risks with incidents in mind we will reduce our dependence on reactive measures significantly. Viewing risks in isolation only addresses half the problem. Our true objective would be non-existent if there were no successful attacks possible. So attacks, and the incidents they can cause, are the whole reason we perform risk analysis. That in mind, we need to apply the knowledge we get from risk management to the problem of incident management. Returning to last month, we revisit three definitions: Definition 2 – Computer Security Incident A computer security incident is a change of state in a bounded computer system from the desired state to an undesired state, where the state change is caused by the application of a stimulus external to the system. Definition 7 – Cyber Exploit A cyber exploit is an attempt to cause a state change in a bounded computer 18
system from the desired state to an undesired state, where the state change is caused by the application of a threat against a vulnerability. Definition 8 – Cyber Attack A cyber attack is a collection of related exploits leveled against one target or multiple related targets for the purpose of causing a state change in the target. An attack may contain one or many related exploits. If the attack succeeds, it results in an incident (Definition 2).
We need to respond should it become necessary Each of these definitions refers to a “state change”. Preventing that state change is the answer to all of our problems. Also, we can see a hierarchy here: an exploit leads to an attack which, if successful, results in an incident. The earlier in that cycle we are able to intervene, the better. Knowing about that cycle is a product of risk management. Addressing it is a product of incident management. Obviously, addressing state changes at the exploit level will prevent or reduce incidents. We cannot realistically plan for, prevent or proactively address all possible incidents because we cannot know all possible threats and vulnerabilities in advance. Here we take another page from our discussions of risk management. We operate at a higher level of abstraction. This is an application of stratification theory and it says, basically, that since we cannot know all possible threats and vulnerabilities, we must abstract (i.e., stratify) the problem to a level that we can know and control. We have discussed this in the past in regards to risk management and we’ll revisit it in the future. For now, the point is that it is easier to understand threat and vulnerability families than it is to understand and respond to individual threats and vulnerabilities.
When an incident does occur Given that we have done our best to prepare for incidents, how do we handle the inevitable ones? First, if we have done our proactive work well, we will have fewer incidents and the ones we have generally will proceed at a more leisurely pace. That allows us a bit of extra time to put the fire out. However, how we address the fire makes a very big difference in how bad the damage is. Looking at virtually all of the incident response methodologies, we see that they break down into a very simple (remember the goal of simplification?) four step, or four stage, process: 1 Interdiction: Stopping or interrupting the incident 2 Containment: Isolating damage and preventing it from spreading 3 Recovery: Returning the business to the pre-incident state 4 Analysis: Post-incident root cause analysis (post-mortem) We can break incidents into four basic types: • • • •
Penetration. Fraud. Denial-of-service. Virus/worm infection.
We deal with each of these in specific ways addressing the four stages of an incident.
Penetration • Purposes • Data theft. • Extortion. • Joy riding. • Web defacement. • Incident management preparation • Preventative and detective controls. • Managing penetration incidents • Interdiction – terminate connection if on-line, notify law enforcement as
getting the whole picture appropriate, launch internal investigation. • Containment – Locate root kits, compromised accounts, etc. and correct vulnerability that allowed initial penetration. • Recovery – analyze damage and respond. • Analysis – generally part of the containment and recovery stages, formal incident post mortem at completion of incident
Fraud • Incident management preparation • Preventative and detective controls. • Managing fraud incidents • Interdiction – terminate connection if on-line, notify law enforcement as appropriate, launch internal investigation • Containment – Locate root kits, compromised accounts, etc. and correct vulnerability that allowed initial penetration • Recovery – analyze damage and respond • Analysis – generally part of the containment and recovery stages, formal incident post mortem at completion of incident
Denial of service • Incident management preparation • Harden and test perimeter • Relationship with ISP • Managing denial-of-service incidents • Interdiction – terminate connection via ISP backbone routers • Containment – if necessary, terminate Internet connection until attack subsides, isolate vulnerable/critical assets temporarily • Recovery – analyze damage and respond • Analysis – formal incident post mortem at completion of incident
Virus or worm attack • Incident management preparation
• Harden and apply defense in depth • Use security policy domains • Relationship with ISP • Managing virus and worm incidents • Interdiction – terminate connection via ISP backbone routers to stop incoming worm or virus attack • Containment –isolate infected security policy domains temporarily • Recovery – analyze damage and respond, don’t re-open infected domains until all domains have been cleared • Analysis – formal incident post mortem at completion of incident Remember that these outlines are just that: outlines. There is no one-size-fits-all in incident management. Each incident is unique and each organization is unique. You must tailor your incident management plan to your organization. When we must face an incident, we need to do several things. Preparation for these things is critical to the successful outcome of any incident response. The first step, of course, is to have a plan. All of the incident response experts say this, of course, but there is a bit more to it than that. One important point that is generally missed is that everyone who could be touched by a digital incident is, de facto, a part of the incident management team. That means that, although the core team may be a small group of experts, the whole team contains everyone who could possibly affect or be affected by an incident. For example, having an early warning system in place buys time. Who is most likely to be the first to recognize that something is wrong? Note that we didn’t say “recognize an incident”. That is not necessary. It is only necessary to note that things aren’t going as they usually do and raise the alarm. When that happens, what is the reporting process? Second, we in the IT world always seem to believe that we have to reinvent the wheel. In this case that means that we have created our own approach to incident management when there are plenty of tried and true approaches in other
disciplines. The notion of emergency response teams is well established. The second reason my classes always give for being there is that they want to merge their IT response plan with their general response plan. And why not? Does it not make sense simply to have a response plan that covers all sorts of contingencies? That means that we need to take a few pages from the emergency management book. One of those is command and control. Building a command and control architecture lets you put the experts in a central location to manage the incident while specialists are in the field gathering and isolating data, making commanddirected repairs and reporting to a central location. That way, we manage the incident instead of the incident managing us. Breaking the response team into command and control, first responders and field responders is important. The command and control manages the incident, analyzes global data and directs the activities in the field. It should be located in a central location that has immediate and direct visibility of the enterprise. The network operation center is a good location. First responders have as their primary goal interdiction. They are first on the scene with the technical ability to respond rapidly and “stop the bleeding”. They usually are system administrators and other first line support. Field responders are the experts in the field that work with the command center to manage the incident. They are the command centers eyes, ears and hands. First responders may also, by virtue of their expertise, be field responders.
Conclusion In this column we have addressed the skeleton of incident management. Next month we’ll put a bit of meat on the bones and work through a mock incident to demonstrate the process.
19