Multi-objective optimization of IT service availability and costs

Multi-objective optimization of IT service availability and costs

Author’s Accepted Manuscript Multi-Objective Optimization of IT Service Availability and Costs Sascha Bosse, Matthias Splieth, Klaus Turowski www.els...

738KB Sizes 0 Downloads 24 Views

Author’s Accepted Manuscript Multi-Objective Optimization of IT Service Availability and Costs Sascha Bosse, Matthias Splieth, Klaus Turowski

www.elsevier.com/locate/ress

PII: DOI: Reference:

S0951-8320(15)00331-2 http://dx.doi.org/10.1016/j.ress.2015.11.004 RESS5442

To appear in: Reliability Engineering and System Safety Received date: 30 April 2015 Revised date: 15 October 2015 Accepted date: 7 November 2015 Cite this article as: Sascha Bosse, Matthias Splieth and Klaus Turowski, MultiObjective Optimization of IT Service Availability and Costs, Reliability Engineering and System Safety, http://dx.doi.org/10.1016/j.ress.2015.11.004 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Multi-Objective Optimization of IT Service Availability and Costs Sascha Bossea,∗, Matthias Splietha , Klaus Turowskia a Magdeburg

Research and Competence Cluster for Very Large Business Applications Faculty of Computer Science Otto von Guericke University Magdeburg P.O. Box 4120, 39016 Magdeburg, Germany

Abstract The continuous provision of highly available IT services is a crucial task for IT service providers in order to fulfill service level agreements with customers. Although the introduction of redundant components increases availability, the associated cost may be very high. Therefore, decision makers in the IT service design stage face a trade-off between cost and availability in order to define suitable service level objectives. Although this task can be seen as a redundancy allocation problem, the existing definitions in this area are not transferable to IT service design due to the assumption of independent component failures, which has been identified as unrealistic in IT systems. In this paper, a multi-objective redundancy allocation problem for IT service design is defined. Therefore, a Petri net Monte Carlo simulation is developed that estimates the availability and costs of a specific design. In order to provide (sub)optimal solutions to an IT service redundancy allocation problem, two meta-heuristics, namely a genetic algorithm and tabu search, are adapted. The approach is utilized to optimize the IT service design of an application service provider in terms of availability and cost to demonstrate its feasibility and suitability. Keywords: IT Service Management, Availability, Reliability, Redundancy Allocation Problem, Cost Optimization

customers [3], and can be defined as the likelihood that a

1. Introduction

service is able to provide its function at a certain point The importance of IT services is ever increasing. On

in time [4]. Although service availability is seen “at the

the one hand, trends such as Cloud Computing bring mil15

lions of consumers in contact with IT services. On the 5

127], even the big IT companies are suffering severe service

other hand, even internal IT organizations are commonly

disruptions that last hours or even days (e.g. Amazon

understood as IT service providers in order to effectively

[6], Apple [7] and Microsoft [8]). In August 2013, a five

manage costs and the business value of IT [1]. Service or

minute inaccessibility of the Google services led to a 40

Operational Level Agreements (SLAs/OLAs)1 document 20

the quality of service that is to be expected by an IT ser10

smaller enterprises are also affected by unavailability: 134

Availability is one of the most crucial quality aspects for

companies that have been studied by the Aberdeen Group each suffered on average a revenue loss of more than one

∗ Corresponding

Preprint submitted to Reliability Engineering & System Safety

% decrease in internet traffic and an estimated revenue loss of over US-$ 500,000 for Google alone [9]. However,

vice consumer, for instance, for the service availability [2].

author Email addresses: [email protected] (Sascha Bosse), [email protected] (Matthias Splieth), [email protected] (Klaus Turowski) 1 In the following, the term SLA(s) is used for both SLAs and OLAs

core of customer satisfaction and business success” [5, p.

25

million US-$ in 2012 due to IT downtime [10].

The two basic approaches to increase system availabilNovember 14, 2015

ity are the introduction of more reliable components and 65

30

the implementation of redundancy mechanisms [11]. How-

component recovery or switching, leading to the fact that

ever, the associated cost of these approaches may not be

operator errors are a major cause for IT service unavail-

justified by the availability improvements. In addition to

ability [20, 21]. However, these errors are not represented

this, the special characteristics of software components

in classical availability models.

limit their reliability [12]. The balancing of availability 70

35

40

45

50

55

In order to provide a general modeling approach that

and cost with respect to desired or existing service level

overcomes the independent failure assumption, several ap-

objectives is one of the core activities in IT Service Man-

proaches that are applicable for IT service availability es-

agement, and is described in well-known frameworks such

timation from design information were recently developed.

as the ISO 20000 (Service Continuity and Availability

The majority of these approaches are based on the model-

Management)[13], CoBIT 5 (Managing Availability and 75

ing of the availability state-space that allows for the intro-

Capacity) [14] and the IT Infrastructure Library (ITIL)

duction of dependencies [22]. However, the capability of

(Availability Management).

these approaches for decision support in availability man-

In the 2011 version of ITIL, service availability is char-

agement is questionable since these approaches require a

acterized as an essential service quality attribute immedi-

high modeling effort and were never integrated with opti-

ately influencing customer satisfaction [5]. Additionally, 80

mization procedures in order to suggest (sub)optimal de-

SLA violations in the operation phase can lead to penalty

sign configurations.

costs and loss of reputation for the IT service provider

Such approaches have been developed in the context

[15]. Since design changes due to insufficient service qual-

of redundancy allocation problems (RAPs). Under this

ity in the operation phase (reactive measures) can be very

term, several reliability/availability models and optimiza-

costly, measures to achieve sufficient IT service availability 85

tion procedures are subsumed that can be utilized to opti-

should be considered in the service design stage (proactive

mize system design in terms of availability, cost and other

measures) [5, 16]. Nevertheless, the lack of feasible sup-

constraints. Therefore, required subsystems and possible

porting tools for high availability design is also noticed [5].

component choices for each subsystem are modeled so that

This is mainly caused by the fact that classical analytical

(sub)optimal combinations of component choices can be

availability/reliability models that have been successfully 90

identified. Since the combinatorial computation of avail-

applied in other domains assume independent component

ability in these approaches assumes independent compo-

failures [17]. This assumption is unrealistic in modern IT

nent failures, they are not applicable for IT service de-

systems due to the presence of inter-component dependen-

sign. Nevertheless, the developed optimization procedures

cies, thus rendering results obtained from these models

are mainly based on flexible meta-heuristics such as evolu-

useless for decision support [18].

60

In addition, operator interaction may be required for

95

tionary algorithms, which can be applied to a wide range

Examples of these inter-component dependencies are

of problems. Therefore, the question of whether or not

common cause failures and imperfect switching. In the

these procedures can be integrated with IT service avail-

former case, even heterogeneous components can be sub-

ability estimation methods for IT service design optimiza-

ject to the same fault under certain conditions [12]. Imper-

tion arises.

fect switching describes the phenomenon that a redundant100

The goal of this work is to provide decision support for

component may not cover the failure of an active compo-

IT service designers. Therefore, a redundancy allocation

nent due to problems in the switching process [19].

problem is defined that models the relevant aspects of IT 2

105

service availability and costs depending on possible IT ser-

basis, optimization algorithms identify (sub)optimal solu-

vice designs. In the course of the paper, this problem is re-140

tions in terms of availability, cost and other constraints.

ferred to as the ITRAP. In combination with a suitable IT

The first RAPs were defined during the 1960s and were

service availability estimation method and solution algo-

mostly solved by exact solution methods such as linear

rithm, the ITRAP can be utilized to optimize availability

or dynamic programming. In the last 25 years, more com-

and costs of an IT service based on design information.

plex definitions were established in order to provide a more

A constructivist approach is followed in order to achieve145 110

115

realistic model of the investigated systems. Due to the in-

this goal (cf. e.g. [23, 24]). In Section 2, the related lit-

creased complexity of the optimization problem, more effi-

erature is presented to outline the relevance of the inves-

cient solution algorithms, especially meta-heuristics, were

tigated problem. In this section, suitable approaches in

applied. In the following, the development of RAP def-

the topics of IT service availability estimation and redun-

initions and solution algorithms is briefly sketched. For

dancy allocation optimization in order to develop a RAP150

more information of the RAP topic, one may refer to the

for IT service design are identified as well. Based on the

literature reviews in [25–27].

literature analysis, requirements for a RAP for IT service Definitions. In 1962, Kettelle was one of the first re-

design are derived.

120

125

These requirements as well as the ITRAP artifact are

searchers to describe and solve an optimization problem

presented in Section 3. This artifact is a framework de-

in which cost is to be minimized subject to an availability

signed in order to reach the goal of this work and consists155

constraint [28]. The term redundancy allocation problem

of the problem definition, an availability and costs estima-

was first used by Fyffe et al. in 1968 [29]. Researchers

tion method based on Petri net Monte Carlo simulation,

utilized RAPs to maximize availability/reliability (e.g. in

and two adapted solution algorithms (a genetic algorithm

[29, 30]), to minimize cost (e.g. in [28, 31, 32]), as well

and a tabu search). The artifact is evaluated by apply-

as for the multi-objective optimization of availability and

ing a prototypical implementation of the ITRAP to an IT160 cost (e.g. in [33–38]). In 1992, Chern et al. defined a RAP, which they proved service design optimization in a real-world use-case, an into be a NP-hard optimization problem [39]:

ternational application service provider, which is presented in Section 4. Section 5 concludes the article by discussing

(i) A system consists of s required subsystems (series-

the contribution of the paper as well as by providing an 130

parallel system).

outlook to further research activities. 165

(ii) In a subsystem, a number of functionally equal components can be used in active redundancy.

2. Related Work

(iii) A subsystem component can either be working or

2.1. The Redundancy Allocation Problem

failed (binary-state); failures are independent and In reliability/availability optimization, a redundancy al-

identically distributed in a subsystem (homogeneous

location problem (RAP) can be utilized to determine suit170 135

redundancy).

able redundancy configurations for systems design. Therefore, it is instantiated with system design information,

(iv) The reliability of the system is to be maximized sub-

e.g. about reliability characteristics of possible component

ject to linear constraints such as cost, weight, or vol-

choices for required functional units of a system. On that

ume. 3

175

In the subsequent years, this definition was further ex-

of promising areas in the search space as well as the ex-

tended in order to provide a more realistic problem. Some

ploration of the whole search space are conducted. A so-

literature examples for introduced characteristics are pre-

lution’s degree of feasibility is represented by its fitness

sented in Table 1.

value, which is the major input for an evolutionary algo-

Nevertheless, only a few works could be identified in the215

180

rithm’s operations that adapt solutions.

RAP literature that deal with failure dependencies and,

Soltani identified the following classes of meta-heuristics

therefore, are not using combinatorial approaches to esti-

in RAP research [27] that are presented with exemplary

mate system availability. In [12], Chi and Kuo modeled

papers in the following:

common cause failures in software systems by introducing

– Genetic algorithm (GA) [34, 38, 40, 43, 45, 47, 56, 57],

an additional subsystem-critical component, the common cause component. Other dependencies such as operator in185

220

– Tabu search (TS) [35, 49, 55],

teraction or imperfect switching were not considered. Lins

– Particle swarm optimization (PSO) [37, 58],

and Droguett modeled repair policies and failure-repair-

– Simulated annealing (SA) [46, 59],

cycles in a RAP, and used alternating renewal processes – Ant colony optimization (ACO) [41],

in combination with discrete-event simulation to compute system availability [34]. 190

– Honey bee mating algorithm (HBMO) [30],

Limited operator resources as

well as other tasks, for instance the activation of standby-225 redundant components (takeover), were not modeled.

– Artificial bee colony (ABC) [36, 52], – Harmony search (HS) [32, 60], – Immune-based algorithm (IA) [42, 61] and

Solution Algorithms. In the first decades since their intro– Cuckoo search (CS) [62, 63].

duction, the defined RAPs were mostly solved by mathematical methods such as linear, dynamic, or non-linear 195

programming (e.g. [28, 29]). However, with increasing

Although the proposed solution algorithms are all evo230

problem complexity, these approaches are very inefficient

performance of the search processes can have a high va-

unless the search space is massively restricted [40]. There-

riety. Most approaches are evaluated by using standard

fore, mathematical methods are only applicable to small-

examples from literature, for instance Nakagawa’s and

sized problems [55]. Besides the mathematical program200

ming approaches, Soltani identified heuristics and meta-

lutionary algorithms based on the same principles, the

Miyazaki’s 33 problems presented in [64], and comparing 235

heuristics [27] as alternate solution methods.

an algorithm’s performance to other results from literature (e.g. in terms of solution quality or computational com-

On the one hand, heuristics are developed for a specific

plexity). However, the question under which conditions an

problem and are hardly transferable [27]. Meta-heuristics,

approach is superior to other ones can only be answered

on the other hand, are heuristics that can be applied to 205

if the approaches would be compared in a wide range of arbitrary optimization problems, which makes them very240 numerical problems [25]. popular. Since these approaches are mostly evolutionarily 2.2. Estimating IT Service Availability

inspired, they are based on artificial reasoning instead of

210

classical mathematics [25]. In general, such an algorithm

Availability estimation methods for IT services can be

performs a directed search through the search space while

classified as qualitative and quantitative approaches as

considering several solutions. Therefore, the exploitation

well as black-box and white-box approaches [20]. While 4

Characteristic

Description

References

Heterogeneous redundancy

Subsystem components may have different failure distributions

[33, 34, 40–42]

Passive redundancy

Decreased failure rate for passive components

[38, 43–45]

Complex design

Subsystems can be arranged hierarchically/arbitrary

[46–48]

Multi-state components

Modeling of performance degradation

[31, 49, 50]

Uncertainty

Stochastic [44, 51], fuzzy [11, 37, 52], fuzzy-random [53] and interval [43, 47, 54] input parameters Table 1: Included characteristics in RAP definitions with exemplary references.

245

250

255

260

qualitative approaches such as expert interviews are rather

in [19, 67–69]), but suffer from the problem of state-space

subjective and hardly transferable [19], quantitative black-275

explosion, which leads to problems in construction, stor-

box or data-based methods utilize availability data, e.g. of

age and solution, especially in large-scale applications [70].

monitoring tools, in order to estimate future service avail-

The combination of system-level combinatorial models

ability quantitatively. However, these approaches require

and component-level state-space models (hierarchical ap-

suitable data sources that may not be accessible in the

proach) can reduce this problem significantly (e.g. in [22]).

service design stage. Therefore, the internal structure and280

Another alternative is the encoding of the state-space in

composition of a (software) system should be used as the

a Petri net [17], which can be solved by Monte Carlo sim-

input to a quantitative estimation method in this phase,

ulation in order to avoid problems regarding state-space

leading to white-box or analytical approaches [65].

explosion (e.g. in [17, 70–72]). Simulation approaches also

These analytical approaches can be further distin-

have the advantage that they allow for dynamic analysis

guished by the underlying computation model in combi-285

[73], which can support IT service management more effec-

natorial and state-space-based methods [22]. In combi-

tively [74]. The increased time-consumption of simulation

natorial models, the component availability A(c) is com-

techniques can be reduced by the parallelization of the

puted from the mean time to failure (M T T F ) as well as

independent replications, making Monte Carlo simulation

the mean time to recovery (M T T R). System availabil-

feasible for real-world problems [70].

ity can be computed from components’ availability using probability theory (cf. e.g. [66]). Combinatorial approaches are very fast and easy to ap-290

265

270

In [20], the author develops a hierarchical Petri net

ply and are, therefore, mostly used for redundancy alloca-

Monte Carlo simulation approach for predicting IT service

tion problems. However, their applicability for IT service

availability. The model includes limited operator capaci-

availability estimation is limited due to the assumption of

ties, operator errors, series-parallel systems, arbitrary time

independent components. Therefore, complex dependen-

to failure and recovery distributions as well as standby re-

cies of modern IT systems such as imperfect coverage or295

dundancy mechanisms. This approach is extended in [75],

standby systems cannot be modeled [17, 22].

in which an interface for generic inter-component depen-

Those dependencies can be mapped by using a state-

dencies is defined. Since this approach is very flexible and

space approach in which all possible system states and

scalable to real-world problems, the availability estimation

the transition probabilities/rates between them are mod-

method that is used to evaluate designs for the ITRAP is

eled. Markov chains model the state-space directly (e.g.300

based on this work.

5

ily reproduced and fixed), mandelbugs (complex cause,

3. A Multi-Objective RAP for IT Service Design

seemingly chaotic behavior), heisenbugs (non-reproducible

(ITRAP)

when isolated) and age-related bugs [78]. This last bug

3.1. Requirements Analysis 330

A RAP for IT service design can be described by the

degradation (R1.3), which can be mapped by modeling

optimization problem in Equation 1: for a given time t,

components as multi-state machines [31]. Since poor IT

designs X have to be found in which the costs of the ser-

service performance can lead to unavailability, the concept

vice C(X, t) are minimized while the service availability A(X, t) is maximized. A multi-objective problem is cho-

of performability has to be considered in an availability es335

sen since it is more flexible with respect to resource con-

tion times that is made in order to apply pure Markov

and cost-efficient service-level objectives for availability in

approaches is unrealistic, especially for recovery times

the design stage.

305

310

[34, 67, 68]. This means that arbitrarily distributed tran(1)

340

320

325

sition times are a requirement for a RAP for IT service design (R1.4).

With respect to scientific literature in RAP and IT ser-

In order to compute IT service availability from compo-

vice availability estimation research, functional require-

nent availabilities, modeling of the IT service dependencies

ments for a RAP for IT service design could be defined.

is required (R2). This includes the classical series-parallel

These are presented in Table 2. The requirements are345

system (R2.1) in which different functionally equivalent

structured in four groups according to their scope, namely

components can be used that differ in their availability or

requirements for IT service component models, for the re-

cost characteristics (R2.2). The introduction of heteroge-

lation of components to the service (dependencies), for the

neous redundancy increases the realism of the RAP [40],

human operators as well as for the associated cost.

and is possibly even required to reach high availability [50].

While the first three groups of requirements refer mainly350

315

timation [75]. The assumption of exponentially distributed transi-

straints [25], and thus can support the definition of feasible

min C(X, t) ∧ max A(X, t)

type is characterized by the phenomenon of performance

Standby redundancy mechanisms are also required to be

to IT service availability (R1-R3), R4 refers to IT service

included in the RAP (R2.3) due to their influence on IT

costs. In order to estimate the service availability, first

service availability and costs: a standby redundant com-

the component availability has to be modeled (R1). IT

ponent can have a much lower failure rate [44, 45] and

service components can be hardware, software, network

reduced energy costs [79]. These effects as well as the

and infrastructure components [19] as well as other sup-355

time until a standby component may take over for a failed

porting IT services [76]. Since availability depends on the

or degraded active component depend on the redundancy

failure as well as the recovery time, a component must be

type, which can be e.g. hot, warm or cold standby [38, 45].

repairable (R1.1) [34].

These types differ in the standby component’s initial state,

In addition, a component may be affected by more than

which can be a full operation without load (hot), a near-

one type of fault (R1.2). In general systems, one can dis-360

operational state (warm) or a completely deactivated com-

tinguish between transient (short-time), intermittent (fre-

ponent (cold).

quent after first occurrence) and permanent faults (need

However, the takeover process of a standby component

for replacement) [77]. The complexity of software systems

can be error-prone, which is called imperfect switching

leads to specific software fault types such as bohrbugs (eas-

[19, 59]. In order to map such dependencies as well as 6

R1

IT service components

R2

IT service dependencies

R1.1 R1.2 R1.3 R1.4

Repairable components Different fault types Performance degradation Arbitrary time distributions

R2.1 R2.2 R2.3 R2.4

Parallel-series system Heterogeneous redundancy Standby dependencies Inter-component dependencies

R3

IT service operators

R4

IT service costs

R3.1 R3.2 R3.3

Operator tasks Limited operators Human errors

R4.1 R4.2 R4.3 R4.4

Capital costs Power costs Operator costs Recovery costs

Table 2: Functional requirements of a RAP for IT service design.

365

complex system designs as described e.g. in [47], generic390 inter-component dependencies are another requirement of

3.2. ITRAP Definition Let IT RAPΔt = (S, demand, wageΔt , perr ) be a RAP

the RAP for IT service design (R2.4).

for IT service design defining subsystems S, demand level as needed service performance (demand ∈ R>0 ), the op-

370

The modeling of operators (R3) is essential for IT service

erator wage for a timestep wageΔt , and the probabil-

availability estimation since they may be responsible for395

ity of operator errors perr ∈ [0, 1]. S = {s1 , . . . , sn } is

maintenance, recovery as well as takeover tasks (R3.1) [20].

the set of subsystems with si = {xi1 , . . . , xim } denot-

However, operators are limited resources (R3.2) that pro-

ing a set of different components. A component xij =

cess tasks according to their priority [19]. Due to the com-

(Z, per, powΔt , cinit , Rij , Tij ) is characterized by its states

plexity of operator interaction, one of the major sources

375

Z, the component performance function depending on the for IT service unavailability are operator errors as indi-400 current state per : Z → R , the power costs function ≥0 cated in [80, 81]. Therefore, operator errors (R3.3) should pow : Z → R, the initial costs c as well as sets of Δt

be incorporated in service availability models [21].

init

redundancy types Rij and state transitions Tij . A redundancy type r = (z0 , permin , T T A, ω, ρ) defines

The costs of an IT service (R4) can be classified as capi-

the component’s initial state z0 , the minimal subsystem

tal and operational expenses [82]. Capital costs (R4.1) are405 performance permin before the component is activated to 380

385

initial investments for the IT service and are traditionally

full operation, a random variable T T A for the time to ac-

modeled in RAPs as the sum of components’ initial costs.

tivation and the Boolean value ω determining if operator

However, costs for data center site construction may also

interaction is needed for activation as well as the task pri-

arise [82]. The operational expenses are mainly the sum

ority ρ. A transition tr = (zstart , zend , T T, ω, ρ, ctr , Dtr )

of power (R4.2), operator (R4.3) [82] and recovery costs410

describes the process of a state change from zstart to zend

(R4.4) [34]. The power costs of components and cooling

with the random variable T T representing the transition

infrastructure are great cost drivers for data centers [83].

time, maybe also requiring operator interaction with as-

However, studies such as in [84] depict that the cooling

sociated task priority.

costs are directly proportional to component power costs.

transition costs ctr are created (e.g. recovery costs).  Dtr = (d, x, p) represents a set of generic dependencies

415

When a transition takes place,

with a function d : Zx → Zx that changes a state of a

In the following, the ITRAP definition is presented that

component x with probability p ∈ [0, 1].

matches these requirements. 7

420

In a subsystem si , Ψi = {(xij , r)|xij ∈ si , r ∈ Rij }

well as the actual service performance per(X, t). There-

is the set of all possible component choices. A sequence

fore, the availability objective function satisfies the re-

ψi = ((x, r)k )k∈N+ with (x, r)k ∈ Ψi defines the com-

quirement of modeling a relation between availability and

ponent choices for a subsystem with at least one and

performance.

possibly multiple equal choices.

⎧ ⎨ 1 p(X, t) = ⎩ per(X,t)

A solution candidate

X = (ψ1 , . . . , ψn , o) is the sequence of the subsystems’ component choices and the number of operators o. 425

demand

, if per(X, t) ≥ demand

The state of a component choice χ = (x, r) at time t,

The performance of an IT service design X at time t,

zχ (t), depends not only on its redundancy type and state

per(X, t), is defined as the minimum subsystem perfor-

transitions, but also on the number of available operators

mance, which is the sum of its components’ performances

for transitions as well as on defined dependencies to other

as shown in Equation 4 [50].

component choices. In addition to that, the introduction 430

of concurring arbitrary distributed transitions complicates

per(X, t) = min

1≤i≤n

the state-space. Therefore, the state distribution of the

435

(3)

else



per(zχ (t))

(4)

χ∈ψi

whole IT service system cannot be easily decomposed to

Using the definitions given above, the optimization

component state distributions, which means it cannot be

problem in Equation 1 can be addressed by estimating the

modeled as a Markov or a renewal process. Thus, there

system state distribution and, thus, the IT service costs

is no analytical expression of the system state distribution

and availability.

[34]. Given a IT RAPΔt , the costs of an IT service design X

450

Due to the complex state-space of an ITRAP even for

for a time period t, C(X, t), can be calculated according

small-sized problems, the explicit modeling of the state-

to Equation 2 with 440

• Cinit (X) =

space would be problematic. Therefore, the state-space



cinitχ the initial costs,

χ∈X

• Cpow (X, t) =

1 Δt

t

 χ∈X

0

of a solution candidate X for an ITRAP is encoded in a

powχ (z(τ ))dτ the power

455

costs,

generalized stochastic Petri net (GSPN) (cf. [85]). Generalized Stochastic Petri Nets. A GSPN is a bipartite

• Cop (X, t) = o · wage · • Ctr (X, t) = 445

3.3. Availability and Cost Estimation



 χ∈X

t Δt

graph consisting of places and transitions. Its basic model the operator costs and

tr∈Tχ ctr

elements are presented in Figure 1. Places (displayed as

· #(tr, t) the transition

costs.

blank circles) represent states and conditions that can be 460

marked with tokens (smaller black filled circles) to indicate that a state/condition is active. Thus, the marking

C(X, t) = Cinit (X) + Cpow (X, t) +

of all places of a GSPN represents the system state. Tran-

(2)

Cop (X, t) + Ctr (X, t)

sitions change this system state by destroying and creating

The availability objective to be maximized is the average t IT service availability, defined as A(X, t) = 1t 0 p(X, τ )dτ ,465

tokens in certain places (firing). In a GSPN, transitions

which can be interpreted as the mean performability of the

(black filled rectangle) or after a deterministic respectively

service. The performability p at time t is determined ac-

random time (timed transitions, displayed as blank rect-

cording to Equation 3 and depends on the demand level as

angles) after the marking is reached. The edges between 8

are activated in certain markings and can fire immediately

470

the nodes of a GSPN determine which places have to be

variables immediately in the case of transition firing (im-

marked for a transition to be activated (input arcs) and

pulse reward) or continuously based on the marking of

in which places new tokens are created after firing (output

places (rate reward) [85].

arcs).

Assumptions. In order to create a GSPN from an ITRAP 500

definition, the following assumptions are made: • A component choice is in its initial state at t = 0, • Components can always be recovered from fail-

Place with Marking

Timed Transition

Immediate Transition

ures/performance degradations (unlimited supply of recovery measures),

2 505

Weighted Input/ Output Arc

always available for operator tasks (no operator sched-

Inhibitor Arc

ule) and • Operator errors happen with a constant probability

Figure 1: Basic elements of a GSPN.

and lead to the fact that a task must be repeated.

For timed transitions, the marking can change after ac-

475

480

485

• An operator in the model represents a position that is

tivation, but before the associated firing time has passed.510

ITRAP GSPN Model. Depending on the ITRAP defini-

If the firing of a transition may lead to a new marking in

tion and the solution candidate X, a GSPN is automati-

which another transition is no longer activated, a conflict

cally created. First, for each component choice (x, r), |Zx |

exists between these transitions. In a GSPN, the tran-

places are generated representing the component’s states.

sition with the shorter firing time is executed, which is

Timed transitions between these states are established ac-

called race or concurrency. The firing time is computed515

cording to the defined state transitions in Tx . For manual

by a deterministic or randomly distributed value minus

transitions, an operator place is created as an additional

the transition’s age value. In this value, the time between

precondition for this transition. In Figure 2, an example

a transition’s activation and deactivation due to the fir-

component model is displayed with three states as well as

ing of another transition can be stored (race age) [86]. If

failure and recovery transitions. The recovery from state

the transition fires eventually, the value is reset to zero.520

0 to state 2 requires operator interaction.

A conflict of more than one immediate transition can be

A component choice’s performance and power costs are

resolved if a probability is assigned for these transitions

modeled as rate rewards according to the performance

that determines their firing probability.

function perc and the power function powc . Depending

In addition to input and output arcs, inhibitor arcs from

on the redundancy type of a component choice, a token is

places to transitions can be defined (represented by edges525 created in the state z0 . Each time a transition fires, an 490

with a small circle at the head). If such a place is marked,

impulse reward according to the assigned transition costs

the connected transition will be deactivated at all events.

is generated.

This concept can be generalized by so-called enabling func-

If z0 is not the state of full performance, the redundancy

tions in which Boolean state conditions can be defined un-

is considered as standby, leading to creation of a standby

der which a transition is activated. 495

530

Reward functions can be introduced to change global

place that is marked in the initial system state. This token deactivates recovery transitions to states with higher

9

State 0

For the defined number of operators, tokens are created in the operator pool place. Each time an operator token is required to activate a state transition (recovery or

Failure

Recovery

560

takeover), this is called an operator task. An immediate transition between the operator pool and operator places

State 1

Failure

for each task is defined. The enabling function of this im-

Recovery

mediate transition is true if the corresponding operator Operator

Failure

task exists and no other task with higher priority is due.

Recovery State 2

565

Thus, operator tokens remain in the operator pool until tasks are created. If more tasks are due than operator tokens are in the operator pool, the task with the higher priority is executed.

Figure 2: An example component GSPN model.

Another immediate transition is created between the op-

535

540

performance. A takeover transition with possible operator570

erator place and the operator pool that is enabled if the

interaction is generated that destroys the standby token

operator task is finished and leads to the operator’s re-

and sets the component state to full operation. An imme-

turn to the pool with probability 1 − perr . A conflicting

diate standby transition creates the standby token again

immediate transition with probability perr represents an

and sets the component to the standby state. An enabling

operator error, leading to a rollback of the task. That

function is associated to the takeover and standby transi-575

means that the component state that was leading to the

tions, which ensure these transitions are only activated

task is restored as well as the token in the operator place

if subsystem performance is under (respectively above)

(cf. Figure 5).

the defined performance minimum permin . Therefore, a

When the model for a solution candidate X is created,

Boolean function is built from all possible subsystem states

its behavior can be simulated by a discrete-event simula-

in which this happens. In Figure 3, such a standby compo-

tion. Therefore, the availability in the beginning A(X, 0)

nent is illustrated for a two-state component with manual

can be computed from the component choices initial states.

takeover. 545

Each time t a state change happens (or at the end of the

After GSPN models for the component choices are

simulation), the new average availability can be computed

instantiated, the dependencies d ∈ Dtr are processed.

by Equation 5 using the time of the last state change (or

Therefore, a transition tr generates a token in a depen-

the beginning of the simulation) t0 , the old average avail-

dency place each time it fires. An immediate transition be-

ability A(X, t0 ) and performability p(X, t0 ) (cf. Equation

hind this place changes the state of each component choice 550

3). The power costs for a component can be computed ac-

that is of the affected component type xd with probability

cordingly and be added to the initial costs, operator costs

pd . Another immediate transition with probability 1 − pd

and the transition costs rewards to estimate IT service

simply deletes the token. An example of a dependency

costs.

for which the recovery of one component may lead to the failure of another is presented in Figure 4, for instance, for 555

A(X, t) =

a hard disk that may fail if a data center recovers from a

t0 · A(X, t0 ) + (t − t0 ) · p(X, t0 ) t

(5)

In order to achieve statistically significant results, the

power outage [87]. 10

State 0

Operator Standby

Failure Standby Transition

Recovery

Takeover

State 1

Figure 3: GSPN model of a standby redundant component.

p

State 0

Failure

Recovery

State 0

Failure

State 1

Recovery State 1

Dependency Place

1-p Figure 4: Example dependency GSPN model.

discrete-event simulation is carried out a number of times

Tests on an Intel(R) Core(TM) i5-2500 CPU @ 3.30 GHz

for different random seeds (Monte Carlo simulation). By

with 1000 replications indicated a weak quadratic depen-

2

using the mean values μ and empirical variances s of

dency between the time consumption of the simulation and

availability and costs after n independent replications, a

the number of events x with R2 = 0, 9992 (cf. Equation

α-confidence interval for the expected value μ ˆ of a ran-585

7).

dom variable with unknown distribution and variance can be computed as shown in Equation 6, for which zq is the q-quantile of the standard normal distribution. s s P (ˆ μ ∈ [μ − z1− α2 √ ; μ + z1− α2 √ ]) = 1 − α n n

seconds(x) = 2 · 10−14 x2 + 2 · 10−7 x + 0.2298

(7)

(6) In this setup, the simulation has provided stable results for more than 109 events generated (e.g. by 6.67 · 108 ERP

580

Nevertheless, the time consumption of a Monte Carlo

components in a solution candidate as described later in

simulation run is dramatically increased in comparison to

Table 3 simulated for one year). Parallelization of the in-

a combinatorial evaluation even for small-sized problems.590

dependent replications could further reduce the time con-

11

State 0 If State 0

perr If not(State 0)

Failure

Operator Pool

Recovery Operator

State 1

If not(State 0)

1-perr Figure 5: Operator task and error mechanism.

sumption while providing stable accuracy [70]. Therefore,

Pareto (-optimal) front is a set of mutually non-dominating

the evaluation method is scalable to large-sized problems.615

solutions [89]. In this paper, the non-dominating sorting genetic algo-

3.4. Solution Algorithms

rithm II (NSGA-II), proposed by Deb et al. in [90], and tabu search, adapted by Kulturel-Konak in [89] for a RAP,

In order to find (sub)optimal solutions for an ITRAP, 595

are used. On the one hand, the NSGA-II has been chosen

an optimization procedure has to be defined that performs a guided search through the solution-space of different

620

has been efficiently and successfully applied to RAPs (e.g.

component choices and numbers of operators. Since the

in [38, 57, 91]). Tabu search, on the other hand, is also

ITRAP is a very complex RAP and mathematical pro-

reported to provide a stable and efficient solution method

gramming methods are only suitable for small-sized prob600

for multi-objective RAPs while resulting in especially wide

lems, heuristics and meta-heuristics could be used for solving the ITRAP [27]. Although more parameters have to

since it is one of the latest multi-objective GA methods and

625

Pareto fronts in comparison to other meta-heuristics [89].

be set for meta-heuristics, in general they produce better solutions than heuristics [33].

Genetic Algorithm. Based on the plus-selection genetic al-

Due to the fact that these meta-heuristics normally re605

gorithm (cf. e.g. [92]) presented in Algorithm 1, the ge-

quire a single scalar goal function, the single objectives are

netic operators proposed in [40] have been adapted to the

often aggregated. This, however, requires a priori knowl-

ITRAP by using the following functions.

edge about the desired solutions. If that knowledge is not available, Pareto-based optimization techniques can be applied, which produce several solutions that are non-630 610

Encoding

An individual is encoded by n + 1 genes

dominating so that decision makers can choose the best so-

for the n subsystems and the number of operators. A

lution for their needs [88]. For the ITRAP, a solution with

subsystem gene is a sequence of component choices (x, r)

(A1 , C1 ) is dominating another solution (A2 , C2 ) if and

with x, r ∈ N as indices for components and redundancy

only if (A1 > A2 ∧ C1 ≤ C2 ) ∨ (C1 < C2 ∧ A1 ≥ A2 ). A

types. 12

Algorithm 1 Genetic algorithm: (μ + λ) plus-selection 1: procedure GA(μ, λ) 2: gen ← 0 3: pop ← initialize(μ) 4: evaluate(pop) 5: while not terminationCriteria(pop, gen) do 6: pop ← pop ∪ recombine(pop, λ) 7: pop ← mutate(pop) 8: evaluate(pop) 9: pop ← select(pop, μ) 10: gen ← gen + 1 11: end while 12: return pop 13: end procedure

660

Mutation

An individual is mutated with probabil-

ity pmut by selecting a gene randomly. In this gene, a random component choice or an operator is added or removed. Removing is disabled if there is only one component choice/operator in the gene.

665

Selection

The NSGA-II algorithm, introduced in [90],

is used as the selection procedure. First, a rank is assigned to each individual. Therefore, non-dominating individuals in the population are identified and removed from the pop-

635

640

Initialization

In order to initialize the population,

ulation with rank 0. For the remaining population, this is

μ individuals are generated. Therefore, the subsystems670

repeated with the successive rank until the population is

sizes are determined according to the ceiling of an ex-

empty. Additionally, a crowding distance is assigned to

ponentially distributed random variable with mean value

each individual as a measure for an individuals uniqueness

subSize. The number of operators is determined corre-

in terms of its fitness values for a certain front rank. The

spondingly with mean value opN um. In a subsystem si ,

population is sorted ascending by rank and for equal ranks

random component choices from Ψi are added until the675

descending by crowding distance. From that sorting, the

respective subsystem size is reached. The first component

first μ individuals are selected for the next generation.

choice in a subsystem is always set to an active redunTabu Search. Kulturel-Konak et al. developed in [35] a

dancy.

tabu search algorithm for the solution of multi-objective RAPs (cf. Algorithm 2). 645

Fitness Evaluation

An individual’s i corresponding 680

solution candidate Xi is simulated in replications inde-

In contrast to a genetic or other evolutionary algorithms, the tabu search considers only one solution candidate as a

pendent runs. The mean value for availability and costs

current solution, but stores non-dominating solutions over

form the fitness tuple (Ai , Ci ).

all iterations. After a candidate is initialized randomly, its neighbor solutions can be generated by applying problem-

Termination Criteria 650

The procedure is terminated685

after maxGen generations.

dependent moves on the current solution. Based on the separate treatment of the objectives, the best solution in the neighborhood is identified. If this solution is subject

655

Recombination Until λ child individuals are gener-

to a so-called tabu, it will be only selected if it dominates

ated, two parent individuals are selected randomly ac-

each other solution in the non-dominating solutions so far.

cording to a inverse cost-proportional distribution to pro-690

In the other case, the best solution not subject to a tabu

duce two child individuals. For this production, uniform

is chosen as the next solution candidate. All solutions

crossover is chosen since it is superior to other techniques

dominated by the new solution are deleted from the non-

for combinatorial problems [40]. Therefore, genes (subsys-

dominating list.

tem configurations as well as the number of operators) are

A tabu can be seen as the difference between the old and

exchanged randomly between the two parent individuals695

the new solution candidate in each iteration, and should

to create two child individuals.

assure that a previously conducted move is not repeated 13

Algorithm 2 Tabu search for multi-objective optimization 1: procedure MTS(maxIterations, size, tabuSize) 2: tabu ← ∅ 3: count, lastRestart, iteration ← 0 4: candidate ←initialize 5: nonDom ← {candidate} 6: while count < maxIterations do 7: objective ←selectObjective 8: newCandidate ←searchNeighborhood(candidate, tabu, nonDom, objective, size) 9: if isNotDominated(newCandidate, nonDom) then 10: nonDom ← nonDom ∪ {newCandidate} 11: count ← 0 12: else 13: count ← count + 1 14: end if 15: nonDom ←deleteDominated(nonDom, newCandidate) 16: tabu ←updateTabuList(candidate, newCandidate, tabuSize) 17: candidate ← newCandidate 18: if count − lastRestart > maxIterations then 4 19: lastRestart ← count 20: candidate ←randomElement(nonDom) 21: tabu ← ∅ 22: end if 23: end while 24: return nonDom 25: end procedure 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36:

function searchNeighborhood(candidate, tabu, nonDom, objective, size) newCandidate ← ∅ while newCandidate = ∅ do neigborhood ←move(candidate, size) newCandidate ←selectBest(neighborhood, objective) if isTabu(newCandidate) ∧ ¬dominatesAll(newCandidate, nonDom) then newCandidate ←selectBestNonTabu(neighborhood, objective) end if end while return newCandidate end function

in the following iterations. Tabus are collected in a tabu

700

Objective Selection

For the two objectives of the

list in which newer tabus replace older tabus. If the non-

ITRAP, the cost or availability objective is selected ran-

dominating list is not updated for a quarter of the defined

domly in each iteration with the same probability. As

maximal iterations, a new solution candidate will be ran-

shown in [89], this method is suitable for RAPs.

domly selected from the non-dominating list and the procedure starts again. The whole search-process ends if even715

705

Moves

According to mutation in the genetic al-

the changes in the current solution candidate will not lead

gorithm presented above, a move is performed by

to a change in the non-dominating list for the maximum

adding/removing component choices/an operator in a ran-

iterations.

dom subsystem.

For the ITRAP, the operations of the tabu search proTabu

cedure are instantiated as follows: 720

When a new solution candidate replaces an old

one, a tabu is created that represents the changed subsystem/number of operators of the old solution candidate.

A candi-

A new candidate is subject to a tabu, if it contains a

date is encoded and initialized equally to the genetic algo-

subsystem/number of operators that exists in the tabu

rithm presented above.

list. If the tabu list would exceed the maximum list size

Candidate Encoding and Initialization

710

14

725

ERP Blade

ERP System

tabuSize, the oldest tabu is deleted from the list.

Internal Network

HP ProLiant BL620c G7 SAP ERP 6.0

Local Area Network HP ProLiant BL980 G7

On the basis of the defined ITRAP, the GSPN model

Database System

External Network

HP ProLiant BL460c G7

and the presented solution algorithms, a prototype of these

Wide Area Network

IBM DB2 HP ProLiant DL360 G6

concepts has been implemented in the Eclipse-based Java Proxy

730

Database Blade

Storage Management

simulation framework AnyLogic 6.8.1.

Power

Power Supply

SAP Router

HP Storage Essentials Uninterruptible Power Supply

Load Balancer

4. Evaluation

SAP Adaptive Computing Controller

Storage

HP Storage Works EVA 6400

In this section, the concept of the ITRAP is evaluated by applying its prototypical implementation to the opti-

Figure 6: Subsystems and components of the application example.

mization of the IT service design in a real-world use-case. 735

This design is required by an IT service provider that hosts

(IBM DB2 instances) running on a database blade server

SAP ERP systems (application service provider – ASP) for

subsystem. The data is physically stored in a storage area

hundreds of customers all over the world and is described

network (SAN) subsystem which is managed by a storage

in the following.

management subsystem (HP Storage Essentials). A local

4.1. Use-Case: 740

An International Application Service

765

area network (LAN) is connecting these systems with the wide area network (WAN) while the power supply supports

Provider

the landscape with energy. For the ERP service of the investigated ASP, suitable For the ERP blade as well as for the database blade

availability Service Level Objectives should be identified

subsystem, two different components can be selected as

on the basis of costs and availability of optimal designs. In Figure 6, the required subsystems for providing the 745

770

ible power supply can be used as a (standby) alternative

ERP service are illustrated with dashed rectangles. The

to the supply of an electricity provider.

components that can be used in each subsystem are dis-

The problem of the application service provider is to find

played as rectangles. Hardware and software subsystems

suitable combinations of component choices and a number

are presented on the left while infrastructure subsystems are located on the right side. 750

775

define an availability Service Level Objective and decide which design is used for the service.

proxy subsystem, formed by so-called SAP Routers. Since

According to the given ITRAP definition, the compo-

the ASP utilizes multiple ERP systems to serve various customer demands, a load balancing subsystem (consisting780

nents were constructed and parametrized. Due to space

of SAP Adaptive Computing Controllers - ACC) forwards

limitations, the complete parametrization of all compo-

requests to the corresponding ERP instance. The ACC

nents cannot be presented in this context. Therefore, only

is also used to observe and manage ERP and database

the parametrization of the SAP ERP 6.0 component is

instances. The SAP ERP 6.0 application servers form-

illustrated as an example in Table 3.

ing the ERP subsystem are hosted on a blade subsystem785 760

of operators in order to optimize IT service availability and costs. From the identified solutions, a service designer can

In order to provide the customers access to the service, user requests have to be routed to the ERP subsystem by a

755

shown in Figure 6. For the power supply, an uninterrupt-

(ERP blades) which is connected to a database subsystem

Since the investigated ASP was not able to provide all the required long-term data for the ITRAP’s parametriza-

15

Parameter

Value

Name States (perf. level) Power costs Initial costs

SAP ERP 6.0 Failed (0.0), reduced performance (0.5), full performance (1.0) Full & reduced performance & failed 0e per h 2.500e ”‘Failure”’ Start & end state Full performance → failed TTF ˜exp(8.760h) TTR ˜exp(1.73h) Operator interaction True Recovery costs 0e ”‘Performance degradation”’ Start & end state Full performance → reduced performance TTF ˜exp(4.380h) TTR ˜exp(0.865h) Operator interaction True Recovery costs 0e Hot Standby Activation time ˜triang(0.083;0.292h; 0.5h) Common cause failure: failure →p=0.3 failure of SAP ERP 6.0

Transitions

Standby redundancy Dependencies

Table 3: Example of the parametrization of a component type.

790

tion, other sources were used as well. The distributions805

3. This means that if one ERP system fails, other ERP

and parameters for the failure and recovery times were

systems can also be affected with a 30 % probability.

taken from literature if available (e.g. [19, 93]) and from

The demand level of the service is set to 1, the operator

the analysis of monitoring data of the Los Alamos National

wage to 40 e per hour. The probability of an operator

Lab 2 . For cost aspects, manufacturer information as well

error is set to 5 % for all tasks.

as power consumption data such as from the HP Power810 3

4

Advisor and the SPECpower benchmark were used.

795

In addition to that, a combinatorial fitness computation from [40] was implemented in contrast to the simulation-

For each component, two to four states with different

based fitness computation. In the combinatorial formula,

performance levels (from zero to one) were modeled. Tran-

modeled aspects such as passive redundancy, operator in-

sitions (failure-recovery cycles) were defined that induce

teraction, multiple states and dependencies are ignored.

state changes as described exemplary in Table 3. Ad-815

Thus, these results represent a classical RAP and can be

ditionally, inter-component dependencies were defined on

compared to the results of the ITRAP.

the basis of expert interviews. For instance, an additional 800

4.2. Results

WAN access point can reduce the probability that cable failures lead to unavailability. Nevertheless, if the internet

Using the application scenario and its parametriza-

service provider has an outage, all access points will be

tion, the solution algorithms in combination with the

suffering a common cause failure. A common cause failure820

Monte Carlo simulation have been applied to gain non-

is also modeled for the ERP system, as shown in Table

dominating solutions. Therefore, the genetic algorithm (GA) and the tabu search (TS) have each been carried

2 http://institute.lanl.gov/data/fdata/

out ten times (as e.g. in [40, 55]) due to the stochastic na-

3 http://www8.hp.com/de/de/products/servers/solutions.

html?compURI=1439951#.VTDZnFxdJpw 4 http://www.spec.org/power_ssj2008/results/power_ ssj2008.html

ture of the algorithms. The results of each algorithm have 825

16

been aggregated, and the fronts of non-dominating solu-

Algorithm

Parameter

Value

Both

Runs replications t subSize opN um

10 25 8760 h 1 1/3

GA

μ λ maxGen pmut

100 50 25 0.1

TS

tabuSize maxIterations size

15 100 15

850

best value compared to the other algorithm is printed in bold on a gray background (greater value for N P S, M S and max A; smaller value for S, N P F and min C). The members of the four Pareto fronts are illustrated in Figure 855

One of the first things to mention when analyzing the Pareto fronts displayed in Figure 7 is the fact that the curves of the classical RAP algorithms are significantly 860

cal RAP algorithms achieve very high levels of availability,

frame of one year was considered. The parameters of GA

which are not achieved in the ITRAP. Due to the lack of

and TS are presented in Table 4, and have been chosen so

dependencies and operator interaction in the RAP, avail-

that the total time consumption of both procedures has

ability can be increased to nearly one hundred percent by

been in the same magnitude for the ITRAP. By obtaining865

using a sufficient number of parallel components. However, this is not a behavior that can be observed in reality: first, inter-component dependencies such as com-

fronts have been generated. In [38], the authors present some metrics to compare

mon cause failures lead to the fact that downtime cannot

different Pareto fronts that have been used for the com-

be minimized to an arbitrary extent by simply using more

parison of the obtained four fronts:

870

ceeds the number of operators for a moment in time, the time to recover increases. This may only be avoided by

Manhattan distance to the nearest neighbor solution

using more operators, which, however, will produce sig-

(in terms of the fitness tuple) and d¯ the mean value875

nificantly increased cost. Therefore, the upper bound for

of di for all solutions,

availability that is 1 in the RAP, will be lower in reality what is reflected in the ITRAP solution.

– Maximum spread

MS = (max Ai − min Ai )2 + (max Ci − min Ci )2

845

components. Second, more components mean a greater need for operator interaction. If the number of tasks ex-

– The number of Pareto solutions N P S, N P S 1 ¯ 2 with di the – Spacing S = i=1 (di − d) NP S

840

different from those of the ITRAP. In particular, the classi-

tions have been extracted. For fitness evaluation, a time

fronts for the ITRAP as well as for the RAP, four Pareto

835

7. 4.3. Discussion

Table 4: Parameter setting for both solution algorithms in the casestudy example.

830

costs that were found by the solution algorithms. The

On the other hand, it can be stated that in the low-cost

as the Euclidean distance between the maximum

area, the ITRAP is able to provide slightly better results

availability and costs and the respective minimums880

than the RAP. This can be explained by the fact that

as well as

the use of standby redundancy enables power cost savings 

– Non-uniformity N P F =

(di/d¯−1)2 N P S−1

i

without affecting availability considerably. Therefore, the to measure the

design suggestions of a RAP are significantly different to

non-uniformity of the Pareto distribution curve.

those of the ITRAP. Especially in the high availability

The results for these values are presented in Table 5885 along with the maximum availability and the minimum

area, the ITRAP results make it clear that the availability of an IT service cannot be increased nearly to 100 % only

17

1

0.995

Availability

0.99

0.985

0.98

0.975 €475,000

€575,000

€675,000

ITRAP TS

€775,000 Costs ITRAP GA

€875,000

RAP TS

€975,000

€1,075,000

RAP GA

Figure 7: Pareto fronts of both solution algorithms for ITRAP and RAP.

by component-level redundancy, and even slight improve-

in a wider range, spacing is also higher than for the RAP

ments in this area will produce much higher costs. This

algorithms.

behavior has also been observed in IT service studies such 890

905

as [80], which indicates that the ITRAP results are more

bution, there is no significant difference between the RAP

realistic than those of classical RAPs.

and the ITRAP solutions. However, a massive difference can be observed for the time-consumption, which is 5,000

As shown in Table 5, the RAP GA (155 solutions) and

to 50,000 times higher in the ITRAP solution algorithms.

TS (43) find much more non-dominating solutions than the ITRAP GA (21) and TS (25). This may also be explained 895

910

calculation. However, the independent simulation replica-

tion of an additional component will not always increase

tions were not parallelized in the performed experiments.

availability and will even sometimes decrease it due to lim-

By using massive parallelization, the time difference could

ited operator capacities for recovery tasks. These limited

900

This is solely caused by the higher time-consumption of the Monte Carlo simulation in comparison to a combinatorial

by dependencies, which lead to the fact that the introduc-

capacities can only be raised by hiring more operators,

Regarding the non-uniformity of the Pareto front distri-

915

be reduced to a factor of 500.

which is very costly. That can also be seen in the maxi-

With a look on the Pareto fronts with respect to the

mum spread metric, which is higher in both ITRAP solu-

corresponding solution algorithm as well as on the qual-

tion algorithms. Since the ITRAP produces less solutions

ity metrics of the Pareto fronts for the both scenarios, it 18

Metric NPS S MS NPF max A min C Time in s

RAP GA

RAP TS

ITRAP GA

ITRAP TS

155 6,389 243,602 4.1 0.999986 502,328 0.0034

43 7,877 416,496 1.44 > 1 − 1 · 10−14 507,328 0.0016

21 22,487 552,354 1.74 0.99827 498,004 18,642

25 45,038 595,800 2.79 0.998923 500,406 18,370

Table 5: Quality metrics for the four Pareto fronts.

920

925

930

935

becomes obvious that the genetic algorithm and the tabu

perform inversely and in a smaller range (TS 2.79 and GA

search come to different results. For both the RAP and950

1.74).

the ITRAP, the GA identifies superior solutions for low

Since the solution algorithm tuning parameters were

costs while the TS performs better in the high availability

chosen so that the time-consumption of TS and GA are

area.

comparable in the ITRAP scenario, it seems surprising

This can be explained by the solution algorithms’ me-

that the TS needs only half the time in the RAP sce-

chanics: due to the parameters subSize = 1 and opN um =955

nario. This can be explained by the fact that the algo-

1 3,

the initial solutions consist of only a few components per

rithms were configured to simulate equal solution candi-

subsystem and normally one operator. Therefore, these so-

dates only once in order to save time for the ITRAP. Due

lutions are located in the low cost area. In the GA, new

to the plus-selection of the GA, the population may con-

solutions are created mainly by recombining solutions with

sist of several individuals that have been generated many

a small probability for mutation, which leads to a better960

generations ago so that less solution candidates have to be

exploitation of the initial zone, but not a good exploration

simulated. In the TS, the search process is concentrated

for higher costs. The TS, on the other hand, starts from a

on a single path through the solution space, and in each

single solution and performs the search solely by using the

iteration several solution candidates are considered from

mutation operator. This produces a higher spread regard-

which only one is selected for the next generation. There-

ing the number of components and operators and, thus,965

fore, the GA considers a significantly greater number of

better solutions in the high cost area compared to the GA.

solution candidates than the TS for the chosen tuning pa-

This conclusion is also supported by the maximum

rameters, which is reflected in the runtime in the RAP scenario.

spread metric, which is higher for the TS than for the

In the end, it can be stated that the results produced

GA as well as the higher minimal costs and maximum 940

945

availability in both scenarios (cf. Table 5). The feature970

by the RAP scenario are significantly different from those

of TS to identify wide Pareto fronts was also observed in

of the ITRAP, which justifies the heavily increased time-

[89]. The spacing metric, i.e. the standard deviation of the

consumption in such a real-world scenario with inter-

nearest neighbor distance, is smaller in the GAs, which is

component dependencies and limited operator capacities.

also majorly influenced by the high cost solutions of the

For the case of the analyzed application service provider,

TS that have a great distance between them. Regarding975

the two solution algorithms perform differently. However,

the non-uniformity of the four curves, the results are in-

the decision as to which algorithm is preferable for this

conclusive. For the RAP, the N P F metric ranges from

problem cannot be made. Therefore, a detailed perfor-

1.44 (TS) to 4.1 (GA) while in the ITRAP the algorithms

mance analysis of both algorithms is carried out. 19

4.4. Performance Analysis of the Genetic Algorithm and 980

Tabu Search

consumption (1 equals the time consumption of the ex-

1015

act solution). It can be seen that both algorithms achieve maximal accuracy faster than the exhaustive search. Ad-

In order to compare both solution algorithms, the dif-

ditionally, both algorithms obtain results with 90 % accu-

ference between the resulting and an optimal Pareto front

racy in under 20 % of the time for obtaining the optimal

should be analyzed. Obtaining this front can be done

solution. The fact that tabu search is a very effective so-

by exhaustive search. However, the solution space of the 1020 985

ITRAP is unlimited due to the lack of a maximum compo-

while the F1 value is only slightly higher than for the GA,

nent number per subsystem as used in other works. There-

the mean ideal distance of the TS results is, even for very

fore, a scenario is defined on the basis of the analyzed use-

short execution times, very low in comparison to the GA.

case for which exhaustive search is feasible.

Therefore, it can be stated that the tabu search should be

In this scenario, only the ERP, the ERP blade, the proxy 1025 990

lution method for RAPs can be confirmed in this scenario:

preferred for this problem.

and load balancer subsystem have been modeled (cf. Figure 6). The possible component number per subsystem

5. Conclusion

has been restricted to three. Thus, the solution space con-

995

sists of 16,200 distinct solutions. After all solutions have

Unavailability of IT services is inconvenient for both

been evaluated by simulation, non-dominating solutions

providers and consumers. Besides a loss of reputation and

have been extracted for the optimal Pareto front.

possible opportunity costs for the IT service provider, vio1030

After that, the genetic algorithm as well as the tabu

costs. Although ensuring sufficient IT service availability

search were executed for different parameter settings so

is a crucial task that is mainly affected by IT service de-

that the processing time of the meta-heuristics could be

sign, there is a lack of suitable decision support systems

compared to the processing time of selecting the optimal 1000

that help IT service designers building highly available sys-

Pareto front. To assess the performance of the solution algorithms, two metrics have been used.

1035

tems at low costs. As a prominent research area in availability optimiza-

The F1 value, as shown in Equation 8, is computed from

tion, the redundancy allocation problem (RAP) addresses

the ratio of Pareto-optimal solutions to the obtained solu-

the issue of achieving an optimal trade-off between avail-

tions (precision) as well as the ratio of identified optimal 1005

lations of the service-level agreements may lead to penalty

ability and resource-consumption. However, the combina-

solutions to all optimal solutions (recall). If the obtained 1040

front equals the optimal result, the value is one. If both

torial computation of system availability from component availabilities in proposed RAPs cannot map all the facts

fronts have no common solutions, the value becomes zero.

that are affecting IT service availability such as dependent F1 = 2

precision · recall precision + recall

component failures and limited operator capacities. On

(8)

the other hand, state-space-based approaches that are ca-

In addition to this value, the mean ideal distance (M ID)1045

1010

pable of introducing these facts in analytical models have

from [38] is used, which is defined as the mean Euclidean

been developed in recent years in order to estimate IT ser-

distance in the objective space between an obtained solu-

vice availability.

tion and the nearest Pareto-optimal solution.

Therefore, the goal of this paper has been to combine

In Figure 8, those metrics are displayed for different

the recent research in RAP and IT service availability es-

parameter settings, differentiated by the normalized time1050

timation in order to provide multi-objective decision sup-

20

0.012

1 0.9

0.01

0.8 0.7

Mean Ideal Distance

0.008

F1-Value

0.6 0.5 0.4

0.006

0.004

0.3 0.2

0.002

0.1 0

0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

Normalized Time Consumption GA

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Normalized Time Consumption GA

TS

TS

Figure 8: Performance metrics of the both solution algorithms for different parameter settings.

port for IT service design in terms of availability and cost.1075 mechanism, their effect on the energy consumption has

1055

1060

On the basis of the related work that has been identified,

never been introduced in the costs estimation, although

requirements for a RAP for IT service design have been

that allows for lower costs while achieving equal availabil-

derived. The ITRAP that matches these requirements

ity levels. Another important point implicitly modeled in

has been defined.

the ITRAP approach is that component-level redundancy

Two meta-heuristic algorithms have

been adapted that, in combination with a Petri net Monte1080

cannot increase availability to the upper bound of 100 per-

Carlo simulation for availability and costs estimation, can

cent due to operator capacities and inter-component de-

provide Pareto-optimal solutions to an ITRAP. These are

pendencies. These values could only be achieved by using

based on genetic algorithms as well as tabu search.

system-level redundancy, e.g. with mirrored data centers, which increases costs massively.

A prototypical implementation of the ITRAP concept has been applied to the optimization of an international1085

1065

1070

The disadvantages of the ITRAP approach are the in-

application service provider’s landscape excerpt in order

creased modeling effort and time-consumption. The for-

to demonstrate the feasibility of the approach. The com-

mer is justified if the ITRAP results would be different

parison of the ITRAP simulation-based fitness estimation

to classical RAPs, which is the case for IT service design

to a classical combinatorial evaluation has revealed that

problems with dependencies. Additionally, the difference

the new approach comes to different design suggestions.1090

in time-consumption in comparison to combinatorial esti-

In addition to this, it could be shown that the genetic al-

mation models is not a crucial drawback since the ITRAP

gorithm and the tabu search are both suitable to solve the

should be applied in the service design stage. In this lifecy-

problem, although the tabu search provides better results,

cle phase, the time effort to optimize a planned IT service

especially for small execution times.

is negligible in comparison to the cost of correcting wrong

The results obtained in the use-case example indicate1095

decision in later lifecycle stages.

that the ITRAP is able to provide more realistic design

Therefore, the ITRAP concept that has been presented

suggestions than classical RAP approaches. While some

in this work can be seen as a first step to establish a multi-

work was conducted in the area of standby redundancy

objective RAP for IT service design in order to support 21

1100

decision makers in the service design stage. Nevertheless,

on the current utilization [83]. Therefore, a introduction

there is also some potential for improvement:

of these aspects would increase the quality of the costs estimation.

At first, an important research area in RAP literature was not considered here, which is parameter uncertainty.1140

1105

While operator interaction and limited operator capac-

It is often not feasible or even possible to conduct long-

ities are considered in the ITRAP definition, assigning an

term analysis on component availability characteristics,

equal error probability to each operator is questionable.

especially for software systems in development. There-

Operators have very different levels of expertise, which

fore, uncertainty should be incorporated in a RAP for IT

greatly influences the probability of operator errors. In

services. This could be possible by using simple interval1145 addition, it is assumed that the number of operators is con-

1110

1115

arithmetic or more complex fuzzy arithmetic approaches.

stant throughout the whole operation, which may not be

In addition to this, no resource constraints for e.g.

the case due to operator schedules. Since operator errors

weight or volume have been considered for the optimiza-

are a major cause for IT service downtime, more and more

tion problem. However, linear constraints depending on

organizations introduce mechanisms to lower operator er-

component choices as used in other RAPs can be easily1150

ror probability, e.g. validation systems in which interac-

introduced in the estimation model. With regard to the

tions are tested before they are transferred to a productive

solution algorithms, this would require a penalty function

system. Such mechanisms should also be integrated in the

for the fitness tuple of a solution candidate that scales with

availability and costs estimation.

the degree of infeasibility (cf. e.g. [40, 89]). References

While a multi-objective approach provides a greater range of optimal service designs, it may be overwhelming1155

[1] J. Ward, J. Peppard, Strategic Planning for Information Systems, 3rd ed., John Wiley & Sons, New York, NY, USA, 2002.

for a decision maker to be confronted with a lot of de-

[2] A. Keller, H. Ludwig, The WSLA Framwork: Specifying and 1120

signs that are not superior to each other. If the simulation

Monitoring Service Level Agreements for Web Services, Journal

would contain a model for service-level agreements, the

of Network and Systems Management 11 (2003) 57–81.

cost of violations due to poor IT service availability could1160

sis and optimisation in SLAs, International Journal of Network

be integrated into the costs estimation. That would al-

Management 22 (2012) 104–130.

low for the transformation of the multi-objective problem 1125

[4] D. Siewiorek, R. Swarz, The Theory and Practice of Reliable

with availability and costs to a single-objective problem

System Design, Digital Press, 1982.

with the goal of minimizing costs and, thus, for provid-1165

[5] L. Hunnebeck, ITIL Service Design 2011 Edition, The Stationery Office, Norwich, UK, 2011.

ing a single (sub)optimal solution. However, if the cost of

1130

[3] E. Zambon, S. Etalle, R. Wieringa, A2 thOS: availability analy-

[6] D. Henschen, Amazon Outage Scrooges Netflix, Heroku,

unavailability cannot be determined a priori or if Service

2012.

Level Objectives are not yet defined, the multi-objective

computing/infrastructure/amazon-outage-scrooges-

problem should be preferred.

1170

URL:

http://www.informationweek.com/cloud-

netflix-heroku/240145338, Accessed: 2015-04-30. [7] G. Keizer, Apple service outage stretches into hours [accessed:

In the ITRAP, availability was modeled depending on

2015-04-30],

URL:

http://www.computerworld.com/

the ratio of actual to desired performance. This represen-

article/2895394/outage-hits-apple-services-including-

tation is rather simple, and does not consider changing

icloud-and-app-store.html, accessed: 2015-04-30.

1175

demands that could be modeled by a load curve as well as 1135

2015.

[8] L. Whitney, Microsoft pins Hotmail, Outlook outage on hot data center [Accessed:

2015-04-30], 2013. URL: http:

component utilization. Power costs of IT systems, how-

//news.cnet.com/8301-10805_3-57574270-75/microsoft-

ever, depend not only on the component state, but also

pins-hotmail-outlook-outage-on-hot-data-center/.

22

[9] D. 1180

in

Tweney, revenue

5-minute [Accessed:

outage

costs

2015-04-30],

Google 2013.

5017 of Lecture Notes in Computer Science, Springer Verlag

$545,000

URL:

Berlin Heidelberg, Tokyo, Japan, 2008, pp. 20–25.

http:

[23] A. R. Hevner, S. T. March, J. Park, S. Ram, Design science in

//venturebeat.com/2013/08/16/3-minute-outage-costsgoogle-545000-in-revenue/, accessed: 2015-04-30. [10] D. Csaplar, 2015-04-30], 1185

1230

2012.

URL:

http://blogs.aberdeen.com/it-

design science research methodology for information systems re-

ac-

search, Journal of Management Information Systems 24 (2008)

infrastructure/the-cost-of-downtime-is-rising/, cessed: 2015-04-30.

45–78.

[11] H. Garg, M. Rani, S. P. Sharma, Y. Vishwakarma, Bi-objective1235

for series-parallel system, Journal of Manufacturing Systems 33

(2000) 176–187.

timization: State-of-the-art survey, Reliability Engineering &

development cost, IEEE Journal on Selected Areas in Commu-1240 nications 8 (1990) 276–282. [13] International Organization for Standardization,

Journal of Industrial Engineering Computations 5 (2014) 339– 364.

1245

R. Buyya, C. A. F. De Rose, Towards autonomic detection

[29] D. Fyffe, W. Hines, N. Lee, System reliability allocation and a

of SLA violations in Cloud infrastructures, Future Generation

computational algorithm, IEEE Transactions on Reliability 17

Computer Systems 28 (2012) 1017–1029.

(1968) 64–69.

[16] D. Terlit, H. Krcmar, Generic Performance Prediction for ERP1250

cation using an efficient heuristic and a honey bee mating algo-

Conference on Information Systems (ECIS), 2011.

rithm, Expert Systems with Applications 39 (2012) 990–999.

[17] G. Callou, P. Maciel, D. Tutsch, J. Arajo, J. Ferreira, R. Souza,

[31] J. E. Ramirez-Marquez, D. W. Coit, A heuristic for solving

A Petri Net-Based Approach to the Quantification of Data Cen-

the redundancy allocation problem for multi-state series-parallel

ter Dependability, in: P. Pawlewski (Ed.), Petri Nets - Manu-1255

systems, Reliability Engineering and System Safety 83 (2004)

facturing and Computer Science, InTech, 2012, pp. 313–336.

341–349.

[18] B. Littlewood, Comments on Reliability and performance anal-

[32] D. Zou, L. Gao, S. Li, J. Wu, An effective global harmony

ysis for fault-tolerant programs consisting of versions with differ-

search algorithm for reliability problems, Expert Systems with

and System Safety 91 (2006) 119–120.

Applications 38 (2011) 4642–4648.

1260

ability models,

liability 55 (2006) 551–558.

IEEE Transactions on Service Computing 4

[34] I. D. Lins, E. L. Droguett, Multiobjective optimization of avail-

(2011) 56–69. [20] S. Bosse, Predicting an IT Services Availability with Respect

ability and cost in repairable systems design via genetic algo-

to Operator Errors, in: Proceedings of the 19th Americas Con-1265

rithms and discrete event simulation, Pesquisa Operacional 29 (2009) 43–66.

ference on Information Systems (AMCIS), Chicago, IL, USA, 2013.

[35] S. Kulturel-Konak, D. W. Coit, F. Baheranwala, Pruned pareto-

[21] U. Franke, P. Johnson, J. K¨ onig, An architecture framework for

optimal sets for the system redundancy allocation problem

enterprise IT service availability analysis, Software and Systems

based on multiple prioritized objectives, Journal of Heuristics

Modeling 13 (2014) 1417–1445.

1225

[33] D. Coit, A. Konak, Multiple weighted objectives heuristic for the redundancy allocation problem, IEEE Transactions on Re-

[19] N. Milanovic, B. Milic, Automatic generation of service avail-

1220

[30] S. J. Sadjadi, R. Soltani, Alternative design redundancy allo-

and SOA Applications, in: Proceedings of the 18th European

ent characteristics by Gregory Levitin, Reliability Engineering

1215

[28] J. D. Kettelle Jr., Least-cost allocations of reliability investment, Operations Research 10 (1962) 249–265.

[15] V. C. Emeakaroha, M. A. S. Netto, R. N. Calheiros, I. Brandic,

1210

Reliability optimization of binary state non-

repairable systems: A state of the art survey, International

ISO/IEC

20000-1, 2011.

ISACA, 2012.

1205

System Safety 91 (2006) 1008–1026. [27] R. Soltani,

[14] Information Systems Audit and Control Association, COBIT 5,

1200

An annotated overview of system-

[26] M. Gen, Y. Yun, Soft computing approach for reliability op-

(2014) 335–347. [12] D.-H. Chi, W. Kuo, Optimal design for software reliability and

1195

[25] W. Kuo, V. R. Prasad,

reliability optimization, IEEE Transactions on Reliability 49

optimization of the reliability-redundancy allocation problem

1190

information systems research, MIS Quarterly 28 (2004) 75–105. [24] K. Peffers, T. Tuunanen, M. A. Rothenberger, S. Chatterjee, A

The cost of downtime is rising [accessed:

1270

14 (2008) 335–357.

[22] K. Trivedi, G. Ciardo, B. Dasarathy, M. Grottke, R. Matias,

[36] W.-C. Yeh, T.-J. Hsieh, Solving reliability redundancy alloca-

A. Rindos, B. Vashaw, Achieving and assuring high availability,

tion problems using an artificial bee colony algorithm, Computers & Operations Research 38 (2011) 1465–1473.

in: T. Nanya, F. Maruyama, A. Pataricza, M. Malek (Eds.), 5th International Service Availability Symposium (ISAS), volume

[37] H. Garg, S. Sharma, Multi-objective reliability-redundancy al-

23

1275

location problem using particle swarm optimization, Computers

172–178.

& Industrial Engineering 64 (2013) 247–255.

[52] G. Jiansheng, W. Zutong, Z. Mingfa, W. Ying, Uncertain multi-

[38] A. Chambari, S. Rahmati, A. Najafi, A. Karimi, A bi-objective1325

1280

model to optimize reliability and cost of system with a choice

based on artificial bee colony algorithm, Chinese Journal of

of redundancy strategies, Computers & Industrial Engineering

Aeronautics 27 (2014) 1477–1487. [53] S. Wang, J. Watada, Modelling redundancy allocation for a

63 (2012) 109–119. [39] M.-S. Chern, On the computational complexity of reliability

fuzzy random parallel-series system, Journal of Computational

redundancy allocation in a series system, Operations Research1330

[40] D. Coit, A. Smith, Solving the redundancy allocation problem

dancy allocation problem in series-parallel systems with bud-

using a combined neural network/genetic algorithm approach,

geted uncertainty, IEEE Transactions on Reliability 63 (2014)

Computers & Operations Research 23 (1996) 515–526.

239–250.

[41] Y.-C. Liang, A. Smith, An Ant Colony Optimization Algorithm1335

1290

redundancy allocation problem using tabu search, IIE Transac-

actions on Reliability 53 (2004) 417–423.

tions 35 (2003) 515–526.

Immune algorithms-based approach

[56] D. Coit, A. Smith, Reliability optimization of series-parallel

for redundant reliability problems with multiple component

systems using a genetic algorithm, IEEE Transactions on Reli-

[42] T.-C. Chen, P.-S. You,

choices, Computers in Industry 56 (2005) 195–205.

1340

parallel systems with a choice of redundancy strategies, Re-

rithms, Computers & Industrial Engineering 37 (1999) 145–149.

liability Engineering & System Safety 108 (2012) 10–20. [58] A. Dolatshahi-Zand, K. Khalili-Damghani, Design of SCADA

dancy allocation with the choice of redundancy strategy and1345

water resource management control center by a bi-objective re-

multiple choice of component type under uncertainty, Comput-

dundancy allocation problem and particle swarm optimization,

ers & Industrial Engineering (2015).

Reliability Engineering & System Safety 133 (2015) 11–21.

[45] M. A. Ardakan, A. Z. Hamadani, Reliability–redundancy allo-

[59] A. Chambari, A. Najafi, S. Rahmati, A. Karimi, An efficient

cation problem with cold-standby redundancy strategy, Simu-

simulated annealing algorithm for the redundancy allocation

lation Modelling Practice and Theory 42 (2014) 107–118.

problem with a choice of redundancy strategies, Reliability En-

1350

gineering & System Safety 119 (2013) 158–164.

Nonequilibrium simulated

annealing-algorithm applied to reliability optimization of com-

[60] L. Wang, L.-P. Li, A coevolutionary differential evolution with

plex system, IEEE Transactions on Reliability 46 (1997) 233–

harmony search for reliability–redundancy optimization, Expert

239.

Systems with Applications 39 (2012) 5271–5278.

[47] L. Sahoo, A. Bhunia, D. Roy, A genetic algorithm based relia-1355

1310

1320

[61] Y.-C. Hsieh, P.-S. You,

An effective immune based two-

bility redundancy optimization for interval valued reliabilities of

phase approach for the optimal reliability–redundancy alloca-

components, Journal of Applied Quantitative Methods 5 (2010)

tion problem,

Applied Mathematics and Computation 218

(2011) 1297–1307.

270–287. [48] M. Ziaee,

1315

Multi-objective reliability optimization of series-

liability with interval coefficient using improved genetic algo-

[46] V. Ravi, B. Murty, P. Reddy,

1305

ability 45 (1996) 254–266. [57] J. Safari,

[44] S. J. Sadjadi, R. Soltani, Minimum–Maximum regret redun-

1300

[55] S. Kulturel-Konak, A. Smith, D. Coit, Efficiently solving the

for the Redundancy Allocation Problem (RAP), IEEE Trans-

[43] T. Taguchi, T. Yokota, Optimal design problem of system re-

1295

and Applied Mathematics 232 (2009) 539–557. [54] M. J. Feizollahi, S. Ahmed, M. Modarres, The robust redun-

Letters 11 (1992) 309–315.

1285

objective redundancy allocation problem of repairable systems

Optimal Redundancy Allocation in Hierarchical

[62] E. Valian, E. Valian, A cuckoo search algorithm by lvy flights for

Series–Parallel Systems Using Mixed Integer Programming, Ap-1360

solving reliability redundancy allocation problems, Engineering

plied Mathematics 4 (2013) 79–83.

Optimization 45 (2013) 1273–1286.

[49] M. Ouzineb, M. Nourelfath, M. Gendreau, Tabu search for the

[63] G. Kanagaraj, S. Ponnambalam, N. Jawahar, A hybrid cuckoo

redundancy allocation problem of homogenous series–parallel

search and genetic algorithm for reliability–redundancy alloca-

multi-state systems, Reliability Engineering & System Safety

tion problems, Computers & Industrial Engineering 66 (2013)

93 (2008) 1257–1272.

1115–1124.

1365

[50] R. Abdelkader, Z. Abdelkader, R. Mustapha, M. Yamani,

[64] Y. Nakagawa, S. Miyazaki, Surrogate constraints algorithm for

Search Algorithms for Engineering Optimization, InTech, 2013,

reliability optimization problems with two constraints, IEEE

pp. 241–258.

Transactions on Reliability R-30 (1981) 175–180.

[51] L. Painton, J. Campbell, Genetic algorithms in optimization of

[65] A. Immonen, E. Niemel¨ a, Survey of reliability and availability

system reliability, IEEE Transactions on Reliability 44 (1995)1370

24

prediction methods from the viewpoint of software architecture,

Software and Systems Modeling 7 (2008) 49–65.

[80] D. Oppenheimer, A. Ganapathi, D. Patterson, Why do internet

[66] M. Shooman, Reliability of Computer Systems and Networks1420

2003.

York, New York, NY, USA, 2002. 1375

[67] K. Tokuno, S. Yamada, Markovian availability modeling for

[81] Z. Yin, X. Ma, J. Zheng, Y. Zhou, L. Bairavasundaram, S. Pasu-

software-intensive systems, International Journal of Quality &

pathy, An empirical study on conguration errors in commercial

Reliability Management 17 (2000) 200–212.

and open source systems, in: 23rd ACM Symposium on Oper-

1425

ating Systems Principles (SOSP), Cascais, Portugal, 2011.

[68] C. Chellappan, G. Vijayalakshmi, Dependability modeling and

[82] L. Barroso, J. Clidaras, U. H¨ olzle, The Datacenter as a Com-

analysis of hybrid redundancy systems, International Journal 1380

puter, Synthesis Lectures on Computer Architecture, 2 ed.,

of Quality & Reliability Management 26 (2009) 76–96. [69] R. Chinnaiyan, S. Somasundaram, Evaluating the reliability

Morgan & Claypool Publishers, 2013.

of component-based software systems, International Journal of1430

[70] A. Sachdeva, D. Kumar, P. Kumar, Reliability analysis of pulp-

of servers in cloud data centers, in: Proceedings of the 12th

ing system using Petri nets, International Journal of Quality &

International Conference on Wirtschaftsinformatik (WI), 2015.

Reliability Management 25 (2008) 860–877.

[84] C. Patel, A. Shah, Cost Model for Planning, Development: and

[71] V. Zille, C. Brenguer, A. Grall, A. Despujols, Simulation of1435

ment, in: P. Faulin, A. Juan, S. Martorell, J. Ramrez-Mrquez

[85] G. Ciardo, J. Muppala, K. Trivedi, SPNP: Stochastic Petri Net

(Eds.), Simulation Methods for Reliability and Availability of

Package, in: Proceedings of the 3rd International Workshop

Complex Systems, Springer, Berlin, Heidelberg, 2010, pp. 253–

PNPM, IEEE Computer Society, 1989, pp. 142–151.

272.

1395

1440

A. Cumani, The Effect of Execution Policies on the Seman-

Prediction and Modeling of High Availability OSCAR Cluster,

tics and Analysis of Stochastic Petri Nets, IEEE Transactions

in: 5th IEEE International Conference on Cluster Computing,

on Software Engineering 15 (1989) 832–846. [87] E. Pinheiro, W.-D. Weber, L. A. Barroso, Failure trends in a

[73] D. Jewell, Performance Modeling and Engineering, Springer US,1445 2008, pp. 29–55. [74] U. Franke, Optimal IT Service Availability: Shorter Outages,

[88] C. M. Fonseca, P. J. Fleming, An overview of evolutionary

or Fewer?, IEEE Transactions on Network and Service Man-

algorithms in multiobjective optimization., Evolutionary Com-

agement 9 (2012) 22–33.

putation 3 (1995) 1–16. Predicting Availability1450

tabu search using a multinomial probability mass function, Eu-

22nd European Conference on Information Systems (ECIS), Tel

ropean Journal of Operational Research 169 (2006) 918–931. [90] K. Deb, S. Agrawal, A. Pratap, T. Meyarivan, A Fast Elitist

Aviv, Israel, 2014. URL: http://aisel.aisnet.org/ecis2014/

Non-dominated Sorting Genetic Algorithm for Multi-objective

proceedings/track20/5/.

Bondavalli,

Optimization: NSGA-II, in: Proceedings of the 6th International Conference on Parallel Problem Solving from Nature,

Office, 2011. [77] A.

S.

Chiaradonna,

F.

Di

volume 1917 of Lecture Notes in Computer Science, Springer,

Giandomenico,

Berlin, Heidelberg, 2000.

F. Grandoni, Threshold-based mechanisms to discriminate transient from intermittent faults, IEEE Transactions on Computers 49 (2000) 230–245.

[91] M. A. Ardakan, A. Z. Hamadani, M. Alinaghian, Optimizing

1460

bi-objective redundancy allocation problem with a mixed redundancy strategy, ISA Transactions 55 (2015) 116–128.

[78] M. Grottke, K. Trivedi, A classficiation of software faults, Jour-

[92] T. B¨ ack, Evolutionary Algorithms in Theory and Practice, Ox-

nal of Reliability Engineering Association of Japan 27 (2005) 1415

[89] S. Kulturel-Konak, A. E. Smith, B. A. Normal, Multi-objective

and Response Times of IT Services, in: Proceedings of the

[76] D. Cannon, ITIL Service Strategy 2011 Edition, The Stationery1455

1410

large disk drive population, in: Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST), 2007.

[75] S. Bosse, C. Schulz, K. Turowski,

1405

[86] M. A. Marsan, G. Balbo, A. Bobbio, G. Chiola, G. Conte,

[72] C. Leangsuksun, L. Shen, T. Liu, H. Song, S. Scott, Availability

IEEE Computer Society, Hong Kong, China, 2003, pp. 380–386.

1400

Operation of a Data Center, Technical Report, Hewlett-Packard Laboratories Palo Alto, 2005.

maintained multicomponent systems for dependability assess-

1390

[83] M. Splieth, S. Bosse, C. Schulz, K. Turowski, Analyzing the effects of load distribution algorithms on energy consumption

Quality & Reliability Management 27 (2010) 78–88.

1385

services fail, and what can be done about it?, in: 4th Usenix Symposium on Internet Technologies and Systems (USITS),

Fault Tolerance, Analysis, and Design, John Wiley & Sons New

425–438.

ford University Press, 1996.

[79] A.-C. Orgerie, M. D. De Assuncao, L. Lefevre, A survey on

[93] B. Schroeder, E. Pinheiro, W.-D. Weber, DRAM Errors in the

techniques for improving the energy efficiency of large scale dis-1465

Wild: A Large-Scale Field Study, Communications of the ACM

tributed systems, ACM Computing Surveys 46 (2014) 1–35.

54 (2011) 100–107.

25