Author’s Accepted Manuscript Multi-Objective Optimization of IT Service Availability and Costs Sascha Bosse, Matthias Splieth, Klaus Turowski
www.elsevier.com/locate/ress
PII: DOI: Reference:
S0951-8320(15)00331-2 http://dx.doi.org/10.1016/j.ress.2015.11.004 RESS5442
To appear in: Reliability Engineering and System Safety Received date: 30 April 2015 Revised date: 15 October 2015 Accepted date: 7 November 2015 Cite this article as: Sascha Bosse, Matthias Splieth and Klaus Turowski, MultiObjective Optimization of IT Service Availability and Costs, Reliability Engineering and System Safety, http://dx.doi.org/10.1016/j.ress.2015.11.004 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Multi-Objective Optimization of IT Service Availability and Costs Sascha Bossea,∗, Matthias Splietha , Klaus Turowskia a Magdeburg
Research and Competence Cluster for Very Large Business Applications Faculty of Computer Science Otto von Guericke University Magdeburg P.O. Box 4120, 39016 Magdeburg, Germany
Abstract The continuous provision of highly available IT services is a crucial task for IT service providers in order to fulfill service level agreements with customers. Although the introduction of redundant components increases availability, the associated cost may be very high. Therefore, decision makers in the IT service design stage face a trade-off between cost and availability in order to define suitable service level objectives. Although this task can be seen as a redundancy allocation problem, the existing definitions in this area are not transferable to IT service design due to the assumption of independent component failures, which has been identified as unrealistic in IT systems. In this paper, a multi-objective redundancy allocation problem for IT service design is defined. Therefore, a Petri net Monte Carlo simulation is developed that estimates the availability and costs of a specific design. In order to provide (sub)optimal solutions to an IT service redundancy allocation problem, two meta-heuristics, namely a genetic algorithm and tabu search, are adapted. The approach is utilized to optimize the IT service design of an application service provider in terms of availability and cost to demonstrate its feasibility and suitability. Keywords: IT Service Management, Availability, Reliability, Redundancy Allocation Problem, Cost Optimization
customers [3], and can be defined as the likelihood that a
1. Introduction
service is able to provide its function at a certain point The importance of IT services is ever increasing. On
in time [4]. Although service availability is seen “at the
the one hand, trends such as Cloud Computing bring mil15
lions of consumers in contact with IT services. On the 5
127], even the big IT companies are suffering severe service
other hand, even internal IT organizations are commonly
disruptions that last hours or even days (e.g. Amazon
understood as IT service providers in order to effectively
[6], Apple [7] and Microsoft [8]). In August 2013, a five
manage costs and the business value of IT [1]. Service or
minute inaccessibility of the Google services led to a 40
Operational Level Agreements (SLAs/OLAs)1 document 20
the quality of service that is to be expected by an IT ser10
smaller enterprises are also affected by unavailability: 134
Availability is one of the most crucial quality aspects for
companies that have been studied by the Aberdeen Group each suffered on average a revenue loss of more than one
∗ Corresponding
Preprint submitted to Reliability Engineering & System Safety
% decrease in internet traffic and an estimated revenue loss of over US-$ 500,000 for Google alone [9]. However,
vice consumer, for instance, for the service availability [2].
author Email addresses:
[email protected] (Sascha Bosse),
[email protected] (Matthias Splieth),
[email protected] (Klaus Turowski) 1 In the following, the term SLA(s) is used for both SLAs and OLAs
core of customer satisfaction and business success” [5, p.
25
million US-$ in 2012 due to IT downtime [10].
The two basic approaches to increase system availabilNovember 14, 2015
ity are the introduction of more reliable components and 65
30
the implementation of redundancy mechanisms [11]. How-
component recovery or switching, leading to the fact that
ever, the associated cost of these approaches may not be
operator errors are a major cause for IT service unavail-
justified by the availability improvements. In addition to
ability [20, 21]. However, these errors are not represented
this, the special characteristics of software components
in classical availability models.
limit their reliability [12]. The balancing of availability 70
35
40
45
50
55
In order to provide a general modeling approach that
and cost with respect to desired or existing service level
overcomes the independent failure assumption, several ap-
objectives is one of the core activities in IT Service Man-
proaches that are applicable for IT service availability es-
agement, and is described in well-known frameworks such
timation from design information were recently developed.
as the ISO 20000 (Service Continuity and Availability
The majority of these approaches are based on the model-
Management)[13], CoBIT 5 (Managing Availability and 75
ing of the availability state-space that allows for the intro-
Capacity) [14] and the IT Infrastructure Library (ITIL)
duction of dependencies [22]. However, the capability of
(Availability Management).
these approaches for decision support in availability man-
In the 2011 version of ITIL, service availability is char-
agement is questionable since these approaches require a
acterized as an essential service quality attribute immedi-
high modeling effort and were never integrated with opti-
ately influencing customer satisfaction [5]. Additionally, 80
mization procedures in order to suggest (sub)optimal de-
SLA violations in the operation phase can lead to penalty
sign configurations.
costs and loss of reputation for the IT service provider
Such approaches have been developed in the context
[15]. Since design changes due to insufficient service qual-
of redundancy allocation problems (RAPs). Under this
ity in the operation phase (reactive measures) can be very
term, several reliability/availability models and optimiza-
costly, measures to achieve sufficient IT service availability 85
tion procedures are subsumed that can be utilized to opti-
should be considered in the service design stage (proactive
mize system design in terms of availability, cost and other
measures) [5, 16]. Nevertheless, the lack of feasible sup-
constraints. Therefore, required subsystems and possible
porting tools for high availability design is also noticed [5].
component choices for each subsystem are modeled so that
This is mainly caused by the fact that classical analytical
(sub)optimal combinations of component choices can be
availability/reliability models that have been successfully 90
identified. Since the combinatorial computation of avail-
applied in other domains assume independent component
ability in these approaches assumes independent compo-
failures [17]. This assumption is unrealistic in modern IT
nent failures, they are not applicable for IT service de-
systems due to the presence of inter-component dependen-
sign. Nevertheless, the developed optimization procedures
cies, thus rendering results obtained from these models
are mainly based on flexible meta-heuristics such as evolu-
useless for decision support [18].
60
In addition, operator interaction may be required for
95
tionary algorithms, which can be applied to a wide range
Examples of these inter-component dependencies are
of problems. Therefore, the question of whether or not
common cause failures and imperfect switching. In the
these procedures can be integrated with IT service avail-
former case, even heterogeneous components can be sub-
ability estimation methods for IT service design optimiza-
ject to the same fault under certain conditions [12]. Imper-
tion arises.
fect switching describes the phenomenon that a redundant100
The goal of this work is to provide decision support for
component may not cover the failure of an active compo-
IT service designers. Therefore, a redundancy allocation
nent due to problems in the switching process [19].
problem is defined that models the relevant aspects of IT 2
105
service availability and costs depending on possible IT ser-
basis, optimization algorithms identify (sub)optimal solu-
vice designs. In the course of the paper, this problem is re-140
tions in terms of availability, cost and other constraints.
ferred to as the ITRAP. In combination with a suitable IT
The first RAPs were defined during the 1960s and were
service availability estimation method and solution algo-
mostly solved by exact solution methods such as linear
rithm, the ITRAP can be utilized to optimize availability
or dynamic programming. In the last 25 years, more com-
and costs of an IT service based on design information.
plex definitions were established in order to provide a more
A constructivist approach is followed in order to achieve145 110
115
realistic model of the investigated systems. Due to the in-
this goal (cf. e.g. [23, 24]). In Section 2, the related lit-
creased complexity of the optimization problem, more effi-
erature is presented to outline the relevance of the inves-
cient solution algorithms, especially meta-heuristics, were
tigated problem. In this section, suitable approaches in
applied. In the following, the development of RAP def-
the topics of IT service availability estimation and redun-
initions and solution algorithms is briefly sketched. For
dancy allocation optimization in order to develop a RAP150
more information of the RAP topic, one may refer to the
for IT service design are identified as well. Based on the
literature reviews in [25–27].
literature analysis, requirements for a RAP for IT service Definitions. In 1962, Kettelle was one of the first re-
design are derived.
120
125
These requirements as well as the ITRAP artifact are
searchers to describe and solve an optimization problem
presented in Section 3. This artifact is a framework de-
in which cost is to be minimized subject to an availability
signed in order to reach the goal of this work and consists155
constraint [28]. The term redundancy allocation problem
of the problem definition, an availability and costs estima-
was first used by Fyffe et al. in 1968 [29]. Researchers
tion method based on Petri net Monte Carlo simulation,
utilized RAPs to maximize availability/reliability (e.g. in
and two adapted solution algorithms (a genetic algorithm
[29, 30]), to minimize cost (e.g. in [28, 31, 32]), as well
and a tabu search). The artifact is evaluated by apply-
as for the multi-objective optimization of availability and
ing a prototypical implementation of the ITRAP to an IT160 cost (e.g. in [33–38]). In 1992, Chern et al. defined a RAP, which they proved service design optimization in a real-world use-case, an into be a NP-hard optimization problem [39]:
ternational application service provider, which is presented in Section 4. Section 5 concludes the article by discussing
(i) A system consists of s required subsystems (series-
the contribution of the paper as well as by providing an 130
parallel system).
outlook to further research activities. 165
(ii) In a subsystem, a number of functionally equal components can be used in active redundancy.
2. Related Work
(iii) A subsystem component can either be working or
2.1. The Redundancy Allocation Problem
failed (binary-state); failures are independent and In reliability/availability optimization, a redundancy al-
identically distributed in a subsystem (homogeneous
location problem (RAP) can be utilized to determine suit170 135
redundancy).
able redundancy configurations for systems design. Therefore, it is instantiated with system design information,
(iv) The reliability of the system is to be maximized sub-
e.g. about reliability characteristics of possible component
ject to linear constraints such as cost, weight, or vol-
choices for required functional units of a system. On that
ume. 3
175
In the subsequent years, this definition was further ex-
of promising areas in the search space as well as the ex-
tended in order to provide a more realistic problem. Some
ploration of the whole search space are conducted. A so-
literature examples for introduced characteristics are pre-
lution’s degree of feasibility is represented by its fitness
sented in Table 1.
value, which is the major input for an evolutionary algo-
Nevertheless, only a few works could be identified in the215
180
rithm’s operations that adapt solutions.
RAP literature that deal with failure dependencies and,
Soltani identified the following classes of meta-heuristics
therefore, are not using combinatorial approaches to esti-
in RAP research [27] that are presented with exemplary
mate system availability. In [12], Chi and Kuo modeled
papers in the following:
common cause failures in software systems by introducing
– Genetic algorithm (GA) [34, 38, 40, 43, 45, 47, 56, 57],
an additional subsystem-critical component, the common cause component. Other dependencies such as operator in185
220
– Tabu search (TS) [35, 49, 55],
teraction or imperfect switching were not considered. Lins
– Particle swarm optimization (PSO) [37, 58],
and Droguett modeled repair policies and failure-repair-
– Simulated annealing (SA) [46, 59],
cycles in a RAP, and used alternating renewal processes – Ant colony optimization (ACO) [41],
in combination with discrete-event simulation to compute system availability [34]. 190
– Honey bee mating algorithm (HBMO) [30],
Limited operator resources as
well as other tasks, for instance the activation of standby-225 redundant components (takeover), were not modeled.
– Artificial bee colony (ABC) [36, 52], – Harmony search (HS) [32, 60], – Immune-based algorithm (IA) [42, 61] and
Solution Algorithms. In the first decades since their intro– Cuckoo search (CS) [62, 63].
duction, the defined RAPs were mostly solved by mathematical methods such as linear, dynamic, or non-linear 195
programming (e.g. [28, 29]). However, with increasing
Although the proposed solution algorithms are all evo230
problem complexity, these approaches are very inefficient
performance of the search processes can have a high va-
unless the search space is massively restricted [40]. There-
riety. Most approaches are evaluated by using standard
fore, mathematical methods are only applicable to small-
examples from literature, for instance Nakagawa’s and
sized problems [55]. Besides the mathematical program200
ming approaches, Soltani identified heuristics and meta-
lutionary algorithms based on the same principles, the
Miyazaki’s 33 problems presented in [64], and comparing 235
heuristics [27] as alternate solution methods.
an algorithm’s performance to other results from literature (e.g. in terms of solution quality or computational com-
On the one hand, heuristics are developed for a specific
plexity). However, the question under which conditions an
problem and are hardly transferable [27]. Meta-heuristics,
approach is superior to other ones can only be answered
on the other hand, are heuristics that can be applied to 205
if the approaches would be compared in a wide range of arbitrary optimization problems, which makes them very240 numerical problems [25]. popular. Since these approaches are mostly evolutionarily 2.2. Estimating IT Service Availability
inspired, they are based on artificial reasoning instead of
210
classical mathematics [25]. In general, such an algorithm
Availability estimation methods for IT services can be
performs a directed search through the search space while
classified as qualitative and quantitative approaches as
considering several solutions. Therefore, the exploitation
well as black-box and white-box approaches [20]. While 4
Characteristic
Description
References
Heterogeneous redundancy
Subsystem components may have different failure distributions
[33, 34, 40–42]
Passive redundancy
Decreased failure rate for passive components
[38, 43–45]
Complex design
Subsystems can be arranged hierarchically/arbitrary
[46–48]
Multi-state components
Modeling of performance degradation
[31, 49, 50]
Uncertainty
Stochastic [44, 51], fuzzy [11, 37, 52], fuzzy-random [53] and interval [43, 47, 54] input parameters Table 1: Included characteristics in RAP definitions with exemplary references.
245
250
255
260
qualitative approaches such as expert interviews are rather
in [19, 67–69]), but suffer from the problem of state-space
subjective and hardly transferable [19], quantitative black-275
explosion, which leads to problems in construction, stor-
box or data-based methods utilize availability data, e.g. of
age and solution, especially in large-scale applications [70].
monitoring tools, in order to estimate future service avail-
The combination of system-level combinatorial models
ability quantitatively. However, these approaches require
and component-level state-space models (hierarchical ap-
suitable data sources that may not be accessible in the
proach) can reduce this problem significantly (e.g. in [22]).
service design stage. Therefore, the internal structure and280
Another alternative is the encoding of the state-space in
composition of a (software) system should be used as the
a Petri net [17], which can be solved by Monte Carlo sim-
input to a quantitative estimation method in this phase,
ulation in order to avoid problems regarding state-space
leading to white-box or analytical approaches [65].
explosion (e.g. in [17, 70–72]). Simulation approaches also
These analytical approaches can be further distin-
have the advantage that they allow for dynamic analysis
guished by the underlying computation model in combi-285
[73], which can support IT service management more effec-
natorial and state-space-based methods [22]. In combi-
tively [74]. The increased time-consumption of simulation
natorial models, the component availability A(c) is com-
techniques can be reduced by the parallelization of the
puted from the mean time to failure (M T T F ) as well as
independent replications, making Monte Carlo simulation
the mean time to recovery (M T T R). System availabil-
feasible for real-world problems [70].
ity can be computed from components’ availability using probability theory (cf. e.g. [66]). Combinatorial approaches are very fast and easy to ap-290
265
270
In [20], the author develops a hierarchical Petri net
ply and are, therefore, mostly used for redundancy alloca-
Monte Carlo simulation approach for predicting IT service
tion problems. However, their applicability for IT service
availability. The model includes limited operator capaci-
availability estimation is limited due to the assumption of
ties, operator errors, series-parallel systems, arbitrary time
independent components. Therefore, complex dependen-
to failure and recovery distributions as well as standby re-
cies of modern IT systems such as imperfect coverage or295
dundancy mechanisms. This approach is extended in [75],
standby systems cannot be modeled [17, 22].
in which an interface for generic inter-component depen-
Those dependencies can be mapped by using a state-
dencies is defined. Since this approach is very flexible and
space approach in which all possible system states and
scalable to real-world problems, the availability estimation
the transition probabilities/rates between them are mod-
method that is used to evaluate designs for the ITRAP is
eled. Markov chains model the state-space directly (e.g.300
based on this work.
5
ily reproduced and fixed), mandelbugs (complex cause,
3. A Multi-Objective RAP for IT Service Design
seemingly chaotic behavior), heisenbugs (non-reproducible
(ITRAP)
when isolated) and age-related bugs [78]. This last bug
3.1. Requirements Analysis 330
A RAP for IT service design can be described by the
degradation (R1.3), which can be mapped by modeling
optimization problem in Equation 1: for a given time t,
components as multi-state machines [31]. Since poor IT
designs X have to be found in which the costs of the ser-
service performance can lead to unavailability, the concept
vice C(X, t) are minimized while the service availability A(X, t) is maximized. A multi-objective problem is cho-
of performability has to be considered in an availability es335
sen since it is more flexible with respect to resource con-
tion times that is made in order to apply pure Markov
and cost-efficient service-level objectives for availability in
approaches is unrealistic, especially for recovery times
the design stage.
305
310
[34, 67, 68]. This means that arbitrarily distributed tran(1)
340
320
325
sition times are a requirement for a RAP for IT service design (R1.4).
With respect to scientific literature in RAP and IT ser-
In order to compute IT service availability from compo-
vice availability estimation research, functional require-
nent availabilities, modeling of the IT service dependencies
ments for a RAP for IT service design could be defined.
is required (R2). This includes the classical series-parallel
These are presented in Table 2. The requirements are345
system (R2.1) in which different functionally equivalent
structured in four groups according to their scope, namely
components can be used that differ in their availability or
requirements for IT service component models, for the re-
cost characteristics (R2.2). The introduction of heteroge-
lation of components to the service (dependencies), for the
neous redundancy increases the realism of the RAP [40],
human operators as well as for the associated cost.
and is possibly even required to reach high availability [50].
While the first three groups of requirements refer mainly350
315
timation [75]. The assumption of exponentially distributed transi-
straints [25], and thus can support the definition of feasible
min C(X, t) ∧ max A(X, t)
type is characterized by the phenomenon of performance
Standby redundancy mechanisms are also required to be
to IT service availability (R1-R3), R4 refers to IT service
included in the RAP (R2.3) due to their influence on IT
costs. In order to estimate the service availability, first
service availability and costs: a standby redundant com-
the component availability has to be modeled (R1). IT
ponent can have a much lower failure rate [44, 45] and
service components can be hardware, software, network
reduced energy costs [79]. These effects as well as the
and infrastructure components [19] as well as other sup-355
time until a standby component may take over for a failed
porting IT services [76]. Since availability depends on the
or degraded active component depend on the redundancy
failure as well as the recovery time, a component must be
type, which can be e.g. hot, warm or cold standby [38, 45].
repairable (R1.1) [34].
These types differ in the standby component’s initial state,
In addition, a component may be affected by more than
which can be a full operation without load (hot), a near-
one type of fault (R1.2). In general systems, one can dis-360
operational state (warm) or a completely deactivated com-
tinguish between transient (short-time), intermittent (fre-
ponent (cold).
quent after first occurrence) and permanent faults (need
However, the takeover process of a standby component
for replacement) [77]. The complexity of software systems
can be error-prone, which is called imperfect switching
leads to specific software fault types such as bohrbugs (eas-
[19, 59]. In order to map such dependencies as well as 6
R1
IT service components
R2
IT service dependencies
R1.1 R1.2 R1.3 R1.4
Repairable components Different fault types Performance degradation Arbitrary time distributions
R2.1 R2.2 R2.3 R2.4
Parallel-series system Heterogeneous redundancy Standby dependencies Inter-component dependencies
R3
IT service operators
R4
IT service costs
R3.1 R3.2 R3.3
Operator tasks Limited operators Human errors
R4.1 R4.2 R4.3 R4.4
Capital costs Power costs Operator costs Recovery costs
Table 2: Functional requirements of a RAP for IT service design.
365
complex system designs as described e.g. in [47], generic390 inter-component dependencies are another requirement of
3.2. ITRAP Definition Let IT RAPΔt = (S, demand, wageΔt , perr ) be a RAP
the RAP for IT service design (R2.4).
for IT service design defining subsystems S, demand level as needed service performance (demand ∈ R>0 ), the op-
370
The modeling of operators (R3) is essential for IT service
erator wage for a timestep wageΔt , and the probabil-
availability estimation since they may be responsible for395
ity of operator errors perr ∈ [0, 1]. S = {s1 , . . . , sn } is
maintenance, recovery as well as takeover tasks (R3.1) [20].
the set of subsystems with si = {xi1 , . . . , xim } denot-
However, operators are limited resources (R3.2) that pro-
ing a set of different components. A component xij =
cess tasks according to their priority [19]. Due to the com-
(Z, per, powΔt , cinit , Rij , Tij ) is characterized by its states
plexity of operator interaction, one of the major sources
375
Z, the component performance function depending on the for IT service unavailability are operator errors as indi-400 current state per : Z → R , the power costs function ≥0 cated in [80, 81]. Therefore, operator errors (R3.3) should pow : Z → R, the initial costs c as well as sets of Δt
be incorporated in service availability models [21].
init
redundancy types Rij and state transitions Tij . A redundancy type r = (z0 , permin , T T A, ω, ρ) defines
The costs of an IT service (R4) can be classified as capi-
the component’s initial state z0 , the minimal subsystem
tal and operational expenses [82]. Capital costs (R4.1) are405 performance permin before the component is activated to 380
385
initial investments for the IT service and are traditionally
full operation, a random variable T T A for the time to ac-
modeled in RAPs as the sum of components’ initial costs.
tivation and the Boolean value ω determining if operator
However, costs for data center site construction may also
interaction is needed for activation as well as the task pri-
arise [82]. The operational expenses are mainly the sum
ority ρ. A transition tr = (zstart , zend , T T, ω, ρ, ctr , Dtr )
of power (R4.2), operator (R4.3) [82] and recovery costs410
describes the process of a state change from zstart to zend
(R4.4) [34]. The power costs of components and cooling
with the random variable T T representing the transition
infrastructure are great cost drivers for data centers [83].
time, maybe also requiring operator interaction with as-
However, studies such as in [84] depict that the cooling
sociated task priority.
costs are directly proportional to component power costs.
transition costs ctr are created (e.g. recovery costs). Dtr = (d, x, p) represents a set of generic dependencies
415
When a transition takes place,
with a function d : Zx → Zx that changes a state of a
In the following, the ITRAP definition is presented that
component x with probability p ∈ [0, 1].
matches these requirements. 7
420
In a subsystem si , Ψi = {(xij , r)|xij ∈ si , r ∈ Rij }
well as the actual service performance per(X, t). There-
is the set of all possible component choices. A sequence
fore, the availability objective function satisfies the re-
ψi = ((x, r)k )k∈N+ with (x, r)k ∈ Ψi defines the com-
quirement of modeling a relation between availability and
ponent choices for a subsystem with at least one and
performance.
possibly multiple equal choices.
⎧ ⎨ 1 p(X, t) = ⎩ per(X,t)
A solution candidate
X = (ψ1 , . . . , ψn , o) is the sequence of the subsystems’ component choices and the number of operators o. 425
demand
, if per(X, t) ≥ demand
The state of a component choice χ = (x, r) at time t,
The performance of an IT service design X at time t,
zχ (t), depends not only on its redundancy type and state
per(X, t), is defined as the minimum subsystem perfor-
transitions, but also on the number of available operators
mance, which is the sum of its components’ performances
for transitions as well as on defined dependencies to other
as shown in Equation 4 [50].
component choices. In addition to that, the introduction 430
of concurring arbitrary distributed transitions complicates
per(X, t) = min
1≤i≤n
the state-space. Therefore, the state distribution of the
435
(3)
else
per(zχ (t))
(4)
χ∈ψi
whole IT service system cannot be easily decomposed to
Using the definitions given above, the optimization
component state distributions, which means it cannot be
problem in Equation 1 can be addressed by estimating the
modeled as a Markov or a renewal process. Thus, there
system state distribution and, thus, the IT service costs
is no analytical expression of the system state distribution
and availability.
[34]. Given a IT RAPΔt , the costs of an IT service design X
450
Due to the complex state-space of an ITRAP even for
for a time period t, C(X, t), can be calculated according
small-sized problems, the explicit modeling of the state-
to Equation 2 with 440
• Cinit (X) =
space would be problematic. Therefore, the state-space
cinitχ the initial costs,
χ∈X
• Cpow (X, t) =
1 Δt
t
χ∈X
0
of a solution candidate X for an ITRAP is encoded in a
powχ (z(τ ))dτ the power
455
costs,
generalized stochastic Petri net (GSPN) (cf. [85]). Generalized Stochastic Petri Nets. A GSPN is a bipartite
• Cop (X, t) = o · wage · • Ctr (X, t) = 445
3.3. Availability and Cost Estimation
χ∈X
t Δt
graph consisting of places and transitions. Its basic model the operator costs and
tr∈Tχ ctr
elements are presented in Figure 1. Places (displayed as
· #(tr, t) the transition
costs.
blank circles) represent states and conditions that can be 460
marked with tokens (smaller black filled circles) to indicate that a state/condition is active. Thus, the marking
C(X, t) = Cinit (X) + Cpow (X, t) +
of all places of a GSPN represents the system state. Tran-
(2)
Cop (X, t) + Ctr (X, t)
sitions change this system state by destroying and creating
The availability objective to be maximized is the average t IT service availability, defined as A(X, t) = 1t 0 p(X, τ )dτ ,465
tokens in certain places (firing). In a GSPN, transitions
which can be interpreted as the mean performability of the
(black filled rectangle) or after a deterministic respectively
service. The performability p at time t is determined ac-
random time (timed transitions, displayed as blank rect-
cording to Equation 3 and depends on the demand level as
angles) after the marking is reached. The edges between 8
are activated in certain markings and can fire immediately
470
the nodes of a GSPN determine which places have to be
variables immediately in the case of transition firing (im-
marked for a transition to be activated (input arcs) and
pulse reward) or continuously based on the marking of
in which places new tokens are created after firing (output
places (rate reward) [85].
arcs).
Assumptions. In order to create a GSPN from an ITRAP 500
definition, the following assumptions are made: • A component choice is in its initial state at t = 0, • Components can always be recovered from fail-
Place with Marking
Timed Transition
Immediate Transition
ures/performance degradations (unlimited supply of recovery measures),
2 505
Weighted Input/ Output Arc
always available for operator tasks (no operator sched-
Inhibitor Arc
ule) and • Operator errors happen with a constant probability
Figure 1: Basic elements of a GSPN.
and lead to the fact that a task must be repeated.
For timed transitions, the marking can change after ac-
475
480
485
• An operator in the model represents a position that is
tivation, but before the associated firing time has passed.510
ITRAP GSPN Model. Depending on the ITRAP defini-
If the firing of a transition may lead to a new marking in
tion and the solution candidate X, a GSPN is automati-
which another transition is no longer activated, a conflict
cally created. First, for each component choice (x, r), |Zx |
exists between these transitions. In a GSPN, the tran-
places are generated representing the component’s states.
sition with the shorter firing time is executed, which is
Timed transitions between these states are established ac-
called race or concurrency. The firing time is computed515
cording to the defined state transitions in Tx . For manual
by a deterministic or randomly distributed value minus
transitions, an operator place is created as an additional
the transition’s age value. In this value, the time between
precondition for this transition. In Figure 2, an example
a transition’s activation and deactivation due to the fir-
component model is displayed with three states as well as
ing of another transition can be stored (race age) [86]. If
failure and recovery transitions. The recovery from state
the transition fires eventually, the value is reset to zero.520
0 to state 2 requires operator interaction.
A conflict of more than one immediate transition can be
A component choice’s performance and power costs are
resolved if a probability is assigned for these transitions
modeled as rate rewards according to the performance
that determines their firing probability.
function perc and the power function powc . Depending
In addition to input and output arcs, inhibitor arcs from
on the redundancy type of a component choice, a token is
places to transitions can be defined (represented by edges525 created in the state z0 . Each time a transition fires, an 490
with a small circle at the head). If such a place is marked,
impulse reward according to the assigned transition costs
the connected transition will be deactivated at all events.
is generated.
This concept can be generalized by so-called enabling func-
If z0 is not the state of full performance, the redundancy
tions in which Boolean state conditions can be defined un-
is considered as standby, leading to creation of a standby
der which a transition is activated. 495
530
Reward functions can be introduced to change global
place that is marked in the initial system state. This token deactivates recovery transitions to states with higher
9
State 0
For the defined number of operators, tokens are created in the operator pool place. Each time an operator token is required to activate a state transition (recovery or
Failure
Recovery
560
takeover), this is called an operator task. An immediate transition between the operator pool and operator places
State 1
Failure
for each task is defined. The enabling function of this im-
Recovery
mediate transition is true if the corresponding operator Operator
Failure
task exists and no other task with higher priority is due.
Recovery State 2
565
Thus, operator tokens remain in the operator pool until tasks are created. If more tasks are due than operator tokens are in the operator pool, the task with the higher priority is executed.
Figure 2: An example component GSPN model.
Another immediate transition is created between the op-
535
540
performance. A takeover transition with possible operator570
erator place and the operator pool that is enabled if the
interaction is generated that destroys the standby token
operator task is finished and leads to the operator’s re-
and sets the component state to full operation. An imme-
turn to the pool with probability 1 − perr . A conflicting
diate standby transition creates the standby token again
immediate transition with probability perr represents an
and sets the component to the standby state. An enabling
operator error, leading to a rollback of the task. That
function is associated to the takeover and standby transi-575
means that the component state that was leading to the
tions, which ensure these transitions are only activated
task is restored as well as the token in the operator place
if subsystem performance is under (respectively above)
(cf. Figure 5).
the defined performance minimum permin . Therefore, a
When the model for a solution candidate X is created,
Boolean function is built from all possible subsystem states
its behavior can be simulated by a discrete-event simula-
in which this happens. In Figure 3, such a standby compo-
tion. Therefore, the availability in the beginning A(X, 0)
nent is illustrated for a two-state component with manual
can be computed from the component choices initial states.
takeover. 545
Each time t a state change happens (or at the end of the
After GSPN models for the component choices are
simulation), the new average availability can be computed
instantiated, the dependencies d ∈ Dtr are processed.
by Equation 5 using the time of the last state change (or
Therefore, a transition tr generates a token in a depen-
the beginning of the simulation) t0 , the old average avail-
dency place each time it fires. An immediate transition be-
ability A(X, t0 ) and performability p(X, t0 ) (cf. Equation
hind this place changes the state of each component choice 550
3). The power costs for a component can be computed ac-
that is of the affected component type xd with probability
cordingly and be added to the initial costs, operator costs
pd . Another immediate transition with probability 1 − pd
and the transition costs rewards to estimate IT service
simply deletes the token. An example of a dependency
costs.
for which the recovery of one component may lead to the failure of another is presented in Figure 4, for instance, for 555
A(X, t) =
a hard disk that may fail if a data center recovers from a
t0 · A(X, t0 ) + (t − t0 ) · p(X, t0 ) t
(5)
In order to achieve statistically significant results, the
power outage [87]. 10
State 0
Operator Standby
Failure Standby Transition
Recovery
Takeover
State 1
Figure 3: GSPN model of a standby redundant component.
p
State 0
Failure
Recovery
State 0
Failure
State 1
Recovery State 1
Dependency Place
1-p Figure 4: Example dependency GSPN model.
discrete-event simulation is carried out a number of times
Tests on an Intel(R) Core(TM) i5-2500 CPU @ 3.30 GHz
for different random seeds (Monte Carlo simulation). By
with 1000 replications indicated a weak quadratic depen-
2
using the mean values μ and empirical variances s of
dency between the time consumption of the simulation and
availability and costs after n independent replications, a
the number of events x with R2 = 0, 9992 (cf. Equation
α-confidence interval for the expected value μ ˆ of a ran-585
7).
dom variable with unknown distribution and variance can be computed as shown in Equation 6, for which zq is the q-quantile of the standard normal distribution. s s P (ˆ μ ∈ [μ − z1− α2 √ ; μ + z1− α2 √ ]) = 1 − α n n
seconds(x) = 2 · 10−14 x2 + 2 · 10−7 x + 0.2298
(7)
(6) In this setup, the simulation has provided stable results for more than 109 events generated (e.g. by 6.67 · 108 ERP
580
Nevertheless, the time consumption of a Monte Carlo
components in a solution candidate as described later in
simulation run is dramatically increased in comparison to
Table 3 simulated for one year). Parallelization of the in-
a combinatorial evaluation even for small-sized problems.590
dependent replications could further reduce the time con-
11
State 0 If State 0
perr If not(State 0)
Failure
Operator Pool
Recovery Operator
State 1
If not(State 0)
1-perr Figure 5: Operator task and error mechanism.
sumption while providing stable accuracy [70]. Therefore,
Pareto (-optimal) front is a set of mutually non-dominating
the evaluation method is scalable to large-sized problems.615
solutions [89]. In this paper, the non-dominating sorting genetic algo-
3.4. Solution Algorithms
rithm II (NSGA-II), proposed by Deb et al. in [90], and tabu search, adapted by Kulturel-Konak in [89] for a RAP,
In order to find (sub)optimal solutions for an ITRAP, 595
are used. On the one hand, the NSGA-II has been chosen
an optimization procedure has to be defined that performs a guided search through the solution-space of different
620
has been efficiently and successfully applied to RAPs (e.g.
component choices and numbers of operators. Since the
in [38, 57, 91]). Tabu search, on the other hand, is also
ITRAP is a very complex RAP and mathematical pro-
reported to provide a stable and efficient solution method
gramming methods are only suitable for small-sized prob600
for multi-objective RAPs while resulting in especially wide
lems, heuristics and meta-heuristics could be used for solving the ITRAP [27]. Although more parameters have to
since it is one of the latest multi-objective GA methods and
625
Pareto fronts in comparison to other meta-heuristics [89].
be set for meta-heuristics, in general they produce better solutions than heuristics [33].
Genetic Algorithm. Based on the plus-selection genetic al-
Due to the fact that these meta-heuristics normally re605
gorithm (cf. e.g. [92]) presented in Algorithm 1, the ge-
quire a single scalar goal function, the single objectives are
netic operators proposed in [40] have been adapted to the
often aggregated. This, however, requires a priori knowl-
ITRAP by using the following functions.
edge about the desired solutions. If that knowledge is not available, Pareto-based optimization techniques can be applied, which produce several solutions that are non-630 610
Encoding
An individual is encoded by n + 1 genes
dominating so that decision makers can choose the best so-
for the n subsystems and the number of operators. A
lution for their needs [88]. For the ITRAP, a solution with
subsystem gene is a sequence of component choices (x, r)
(A1 , C1 ) is dominating another solution (A2 , C2 ) if and
with x, r ∈ N as indices for components and redundancy
only if (A1 > A2 ∧ C1 ≤ C2 ) ∨ (C1 < C2 ∧ A1 ≥ A2 ). A
types. 12
Algorithm 1 Genetic algorithm: (μ + λ) plus-selection 1: procedure GA(μ, λ) 2: gen ← 0 3: pop ← initialize(μ) 4: evaluate(pop) 5: while not terminationCriteria(pop, gen) do 6: pop ← pop ∪ recombine(pop, λ) 7: pop ← mutate(pop) 8: evaluate(pop) 9: pop ← select(pop, μ) 10: gen ← gen + 1 11: end while 12: return pop 13: end procedure
660
Mutation
An individual is mutated with probabil-
ity pmut by selecting a gene randomly. In this gene, a random component choice or an operator is added or removed. Removing is disabled if there is only one component choice/operator in the gene.
665
Selection
The NSGA-II algorithm, introduced in [90],
is used as the selection procedure. First, a rank is assigned to each individual. Therefore, non-dominating individuals in the population are identified and removed from the pop-
635
640
Initialization
In order to initialize the population,
ulation with rank 0. For the remaining population, this is
μ individuals are generated. Therefore, the subsystems670
repeated with the successive rank until the population is
sizes are determined according to the ceiling of an ex-
empty. Additionally, a crowding distance is assigned to
ponentially distributed random variable with mean value
each individual as a measure for an individuals uniqueness
subSize. The number of operators is determined corre-
in terms of its fitness values for a certain front rank. The
spondingly with mean value opN um. In a subsystem si ,
population is sorted ascending by rank and for equal ranks
random component choices from Ψi are added until the675
descending by crowding distance. From that sorting, the
respective subsystem size is reached. The first component
first μ individuals are selected for the next generation.
choice in a subsystem is always set to an active redunTabu Search. Kulturel-Konak et al. developed in [35] a
dancy.
tabu search algorithm for the solution of multi-objective RAPs (cf. Algorithm 2). 645
Fitness Evaluation
An individual’s i corresponding 680
solution candidate Xi is simulated in replications inde-
In contrast to a genetic or other evolutionary algorithms, the tabu search considers only one solution candidate as a
pendent runs. The mean value for availability and costs
current solution, but stores non-dominating solutions over
form the fitness tuple (Ai , Ci ).
all iterations. After a candidate is initialized randomly, its neighbor solutions can be generated by applying problem-
Termination Criteria 650
The procedure is terminated685
after maxGen generations.
dependent moves on the current solution. Based on the separate treatment of the objectives, the best solution in the neighborhood is identified. If this solution is subject
655
Recombination Until λ child individuals are gener-
to a so-called tabu, it will be only selected if it dominates
ated, two parent individuals are selected randomly ac-
each other solution in the non-dominating solutions so far.
cording to a inverse cost-proportional distribution to pro-690
In the other case, the best solution not subject to a tabu
duce two child individuals. For this production, uniform
is chosen as the next solution candidate. All solutions
crossover is chosen since it is superior to other techniques
dominated by the new solution are deleted from the non-
for combinatorial problems [40]. Therefore, genes (subsys-
dominating list.
tem configurations as well as the number of operators) are
A tabu can be seen as the difference between the old and
exchanged randomly between the two parent individuals695
the new solution candidate in each iteration, and should
to create two child individuals.
assure that a previously conducted move is not repeated 13
Algorithm 2 Tabu search for multi-objective optimization 1: procedure MTS(maxIterations, size, tabuSize) 2: tabu ← ∅ 3: count, lastRestart, iteration ← 0 4: candidate ←initialize 5: nonDom ← {candidate} 6: while count < maxIterations do 7: objective ←selectObjective 8: newCandidate ←searchNeighborhood(candidate, tabu, nonDom, objective, size) 9: if isNotDominated(newCandidate, nonDom) then 10: nonDom ← nonDom ∪ {newCandidate} 11: count ← 0 12: else 13: count ← count + 1 14: end if 15: nonDom ←deleteDominated(nonDom, newCandidate) 16: tabu ←updateTabuList(candidate, newCandidate, tabuSize) 17: candidate ← newCandidate 18: if count − lastRestart > maxIterations then 4 19: lastRestart ← count 20: candidate ←randomElement(nonDom) 21: tabu ← ∅ 22: end if 23: end while 24: return nonDom 25: end procedure 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36:
function searchNeighborhood(candidate, tabu, nonDom, objective, size) newCandidate ← ∅ while newCandidate = ∅ do neigborhood ←move(candidate, size) newCandidate ←selectBest(neighborhood, objective) if isTabu(newCandidate) ∧ ¬dominatesAll(newCandidate, nonDom) then newCandidate ←selectBestNonTabu(neighborhood, objective) end if end while return newCandidate end function
in the following iterations. Tabus are collected in a tabu
700
Objective Selection
For the two objectives of the
list in which newer tabus replace older tabus. If the non-
ITRAP, the cost or availability objective is selected ran-
dominating list is not updated for a quarter of the defined
domly in each iteration with the same probability. As
maximal iterations, a new solution candidate will be ran-
shown in [89], this method is suitable for RAPs.
domly selected from the non-dominating list and the procedure starts again. The whole search-process ends if even715
705
Moves
According to mutation in the genetic al-
the changes in the current solution candidate will not lead
gorithm presented above, a move is performed by
to a change in the non-dominating list for the maximum
adding/removing component choices/an operator in a ran-
iterations.
dom subsystem.
For the ITRAP, the operations of the tabu search proTabu
cedure are instantiated as follows: 720
When a new solution candidate replaces an old
one, a tabu is created that represents the changed subsystem/number of operators of the old solution candidate.
A candi-
A new candidate is subject to a tabu, if it contains a
date is encoded and initialized equally to the genetic algo-
subsystem/number of operators that exists in the tabu
rithm presented above.
list. If the tabu list would exceed the maximum list size
Candidate Encoding and Initialization
710
14
725
ERP Blade
ERP System
tabuSize, the oldest tabu is deleted from the list.
Internal Network
HP ProLiant BL620c G7 SAP ERP 6.0
Local Area Network HP ProLiant BL980 G7
On the basis of the defined ITRAP, the GSPN model
Database System
External Network
HP ProLiant BL460c G7
and the presented solution algorithms, a prototype of these
Wide Area Network
IBM DB2 HP ProLiant DL360 G6
concepts has been implemented in the Eclipse-based Java Proxy
730
Database Blade
Storage Management
simulation framework AnyLogic 6.8.1.
Power
Power Supply
SAP Router
HP Storage Essentials Uninterruptible Power Supply
Load Balancer
4. Evaluation
SAP Adaptive Computing Controller
Storage
HP Storage Works EVA 6400
In this section, the concept of the ITRAP is evaluated by applying its prototypical implementation to the opti-
Figure 6: Subsystems and components of the application example.
mization of the IT service design in a real-world use-case. 735
This design is required by an IT service provider that hosts
(IBM DB2 instances) running on a database blade server
SAP ERP systems (application service provider – ASP) for
subsystem. The data is physically stored in a storage area
hundreds of customers all over the world and is described
network (SAN) subsystem which is managed by a storage
in the following.
management subsystem (HP Storage Essentials). A local
4.1. Use-Case: 740
An International Application Service
765
area network (LAN) is connecting these systems with the wide area network (WAN) while the power supply supports
Provider
the landscape with energy. For the ERP service of the investigated ASP, suitable For the ERP blade as well as for the database blade
availability Service Level Objectives should be identified
subsystem, two different components can be selected as
on the basis of costs and availability of optimal designs. In Figure 6, the required subsystems for providing the 745
770
ible power supply can be used as a (standby) alternative
ERP service are illustrated with dashed rectangles. The
to the supply of an electricity provider.
components that can be used in each subsystem are dis-
The problem of the application service provider is to find
played as rectangles. Hardware and software subsystems
suitable combinations of component choices and a number
are presented on the left while infrastructure subsystems are located on the right side. 750
775
define an availability Service Level Objective and decide which design is used for the service.
proxy subsystem, formed by so-called SAP Routers. Since
According to the given ITRAP definition, the compo-
the ASP utilizes multiple ERP systems to serve various customer demands, a load balancing subsystem (consisting780
nents were constructed and parametrized. Due to space
of SAP Adaptive Computing Controllers - ACC) forwards
limitations, the complete parametrization of all compo-
requests to the corresponding ERP instance. The ACC
nents cannot be presented in this context. Therefore, only
is also used to observe and manage ERP and database
the parametrization of the SAP ERP 6.0 component is
instances. The SAP ERP 6.0 application servers form-
illustrated as an example in Table 3.
ing the ERP subsystem are hosted on a blade subsystem785 760
of operators in order to optimize IT service availability and costs. From the identified solutions, a service designer can
In order to provide the customers access to the service, user requests have to be routed to the ERP subsystem by a
755
shown in Figure 6. For the power supply, an uninterrupt-
(ERP blades) which is connected to a database subsystem
Since the investigated ASP was not able to provide all the required long-term data for the ITRAP’s parametriza-
15
Parameter
Value
Name States (perf. level) Power costs Initial costs
SAP ERP 6.0 Failed (0.0), reduced performance (0.5), full performance (1.0) Full & reduced performance & failed 0e per h 2.500e ”‘Failure”’ Start & end state Full performance → failed TTF ˜exp(8.760h) TTR ˜exp(1.73h) Operator interaction True Recovery costs 0e ”‘Performance degradation”’ Start & end state Full performance → reduced performance TTF ˜exp(4.380h) TTR ˜exp(0.865h) Operator interaction True Recovery costs 0e Hot Standby Activation time ˜triang(0.083;0.292h; 0.5h) Common cause failure: failure →p=0.3 failure of SAP ERP 6.0
Transitions
Standby redundancy Dependencies
Table 3: Example of the parametrization of a component type.
790
tion, other sources were used as well. The distributions805
3. This means that if one ERP system fails, other ERP
and parameters for the failure and recovery times were
systems can also be affected with a 30 % probability.
taken from literature if available (e.g. [19, 93]) and from
The demand level of the service is set to 1, the operator
the analysis of monitoring data of the Los Alamos National
wage to 40 e per hour. The probability of an operator
Lab 2 . For cost aspects, manufacturer information as well
error is set to 5 % for all tasks.
as power consumption data such as from the HP Power810 3
4
Advisor and the SPECpower benchmark were used.
795
In addition to that, a combinatorial fitness computation from [40] was implemented in contrast to the simulation-
For each component, two to four states with different
based fitness computation. In the combinatorial formula,
performance levels (from zero to one) were modeled. Tran-
modeled aspects such as passive redundancy, operator in-
sitions (failure-recovery cycles) were defined that induce
teraction, multiple states and dependencies are ignored.
state changes as described exemplary in Table 3. Ad-815
Thus, these results represent a classical RAP and can be
ditionally, inter-component dependencies were defined on
compared to the results of the ITRAP.
the basis of expert interviews. For instance, an additional 800
4.2. Results
WAN access point can reduce the probability that cable failures lead to unavailability. Nevertheless, if the internet
Using the application scenario and its parametriza-
service provider has an outage, all access points will be
tion, the solution algorithms in combination with the
suffering a common cause failure. A common cause failure820
Monte Carlo simulation have been applied to gain non-
is also modeled for the ERP system, as shown in Table
dominating solutions. Therefore, the genetic algorithm (GA) and the tabu search (TS) have each been carried
2 http://institute.lanl.gov/data/fdata/
out ten times (as e.g. in [40, 55]) due to the stochastic na-
3 http://www8.hp.com/de/de/products/servers/solutions.
html?compURI=1439951#.VTDZnFxdJpw 4 http://www.spec.org/power_ssj2008/results/power_ ssj2008.html
ture of the algorithms. The results of each algorithm have 825
16
been aggregated, and the fronts of non-dominating solu-
Algorithm
Parameter
Value
Both
Runs replications t subSize opN um
10 25 8760 h 1 1/3
GA
μ λ maxGen pmut
100 50 25 0.1
TS
tabuSize maxIterations size
15 100 15
850
best value compared to the other algorithm is printed in bold on a gray background (greater value for N P S, M S and max A; smaller value for S, N P F and min C). The members of the four Pareto fronts are illustrated in Figure 855
One of the first things to mention when analyzing the Pareto fronts displayed in Figure 7 is the fact that the curves of the classical RAP algorithms are significantly 860
cal RAP algorithms achieve very high levels of availability,
frame of one year was considered. The parameters of GA
which are not achieved in the ITRAP. Due to the lack of
and TS are presented in Table 4, and have been chosen so
dependencies and operator interaction in the RAP, avail-
that the total time consumption of both procedures has
ability can be increased to nearly one hundred percent by
been in the same magnitude for the ITRAP. By obtaining865
using a sufficient number of parallel components. However, this is not a behavior that can be observed in reality: first, inter-component dependencies such as com-
fronts have been generated. In [38], the authors present some metrics to compare
mon cause failures lead to the fact that downtime cannot
different Pareto fronts that have been used for the com-
be minimized to an arbitrary extent by simply using more
parison of the obtained four fronts:
870
ceeds the number of operators for a moment in time, the time to recover increases. This may only be avoided by
Manhattan distance to the nearest neighbor solution
using more operators, which, however, will produce sig-
(in terms of the fitness tuple) and d¯ the mean value875
nificantly increased cost. Therefore, the upper bound for
of di for all solutions,
availability that is 1 in the RAP, will be lower in reality what is reflected in the ITRAP solution.
– Maximum spread
MS = (max Ai − min Ai )2 + (max Ci − min Ci )2
845
components. Second, more components mean a greater need for operator interaction. If the number of tasks ex-
– The number of Pareto solutions N P S, N P S 1 ¯ 2 with di the – Spacing S = i=1 (di − d) NP S
840
different from those of the ITRAP. In particular, the classi-
tions have been extracted. For fitness evaluation, a time
fronts for the ITRAP as well as for the RAP, four Pareto
835
7. 4.3. Discussion
Table 4: Parameter setting for both solution algorithms in the casestudy example.
830
costs that were found by the solution algorithms. The
On the other hand, it can be stated that in the low-cost
as the Euclidean distance between the maximum
area, the ITRAP is able to provide slightly better results
availability and costs and the respective minimums880
than the RAP. This can be explained by the fact that
as well as
the use of standby redundancy enables power cost savings
– Non-uniformity N P F =
(di/d¯−1)2 N P S−1
i
without affecting availability considerably. Therefore, the to measure the
design suggestions of a RAP are significantly different to
non-uniformity of the Pareto distribution curve.
those of the ITRAP. Especially in the high availability
The results for these values are presented in Table 5885 along with the maximum availability and the minimum
area, the ITRAP results make it clear that the availability of an IT service cannot be increased nearly to 100 % only
17
1
0.995
Availability
0.99
0.985
0.98
0.975 €475,000
€575,000
€675,000
ITRAP TS
€775,000 Costs ITRAP GA
€875,000
RAP TS
€975,000
€1,075,000
RAP GA
Figure 7: Pareto fronts of both solution algorithms for ITRAP and RAP.
by component-level redundancy, and even slight improve-
in a wider range, spacing is also higher than for the RAP
ments in this area will produce much higher costs. This
algorithms.
behavior has also been observed in IT service studies such 890
905
as [80], which indicates that the ITRAP results are more
bution, there is no significant difference between the RAP
realistic than those of classical RAPs.
and the ITRAP solutions. However, a massive difference can be observed for the time-consumption, which is 5,000
As shown in Table 5, the RAP GA (155 solutions) and
to 50,000 times higher in the ITRAP solution algorithms.
TS (43) find much more non-dominating solutions than the ITRAP GA (21) and TS (25). This may also be explained 895
910
calculation. However, the independent simulation replica-
tion of an additional component will not always increase
tions were not parallelized in the performed experiments.
availability and will even sometimes decrease it due to lim-
By using massive parallelization, the time difference could
ited operator capacities for recovery tasks. These limited
900
This is solely caused by the higher time-consumption of the Monte Carlo simulation in comparison to a combinatorial
by dependencies, which lead to the fact that the introduc-
capacities can only be raised by hiring more operators,
Regarding the non-uniformity of the Pareto front distri-
915
be reduced to a factor of 500.
which is very costly. That can also be seen in the maxi-
With a look on the Pareto fronts with respect to the
mum spread metric, which is higher in both ITRAP solu-
corresponding solution algorithm as well as on the qual-
tion algorithms. Since the ITRAP produces less solutions
ity metrics of the Pareto fronts for the both scenarios, it 18
Metric NPS S MS NPF max A min C Time in s
RAP GA
RAP TS
ITRAP GA
ITRAP TS
155 6,389 243,602 4.1 0.999986 502,328 0.0034
43 7,877 416,496 1.44 > 1 − 1 · 10−14 507,328 0.0016
21 22,487 552,354 1.74 0.99827 498,004 18,642
25 45,038 595,800 2.79 0.998923 500,406 18,370
Table 5: Quality metrics for the four Pareto fronts.
920
925
930
935
becomes obvious that the genetic algorithm and the tabu
perform inversely and in a smaller range (TS 2.79 and GA
search come to different results. For both the RAP and950
1.74).
the ITRAP, the GA identifies superior solutions for low
Since the solution algorithm tuning parameters were
costs while the TS performs better in the high availability
chosen so that the time-consumption of TS and GA are
area.
comparable in the ITRAP scenario, it seems surprising
This can be explained by the solution algorithms’ me-
that the TS needs only half the time in the RAP sce-
chanics: due to the parameters subSize = 1 and opN um =955
nario. This can be explained by the fact that the algo-
1 3,
the initial solutions consist of only a few components per
rithms were configured to simulate equal solution candi-
subsystem and normally one operator. Therefore, these so-
dates only once in order to save time for the ITRAP. Due
lutions are located in the low cost area. In the GA, new
to the plus-selection of the GA, the population may con-
solutions are created mainly by recombining solutions with
sist of several individuals that have been generated many
a small probability for mutation, which leads to a better960
generations ago so that less solution candidates have to be
exploitation of the initial zone, but not a good exploration
simulated. In the TS, the search process is concentrated
for higher costs. The TS, on the other hand, starts from a
on a single path through the solution space, and in each
single solution and performs the search solely by using the
iteration several solution candidates are considered from
mutation operator. This produces a higher spread regard-
which only one is selected for the next generation. There-
ing the number of components and operators and, thus,965
fore, the GA considers a significantly greater number of
better solutions in the high cost area compared to the GA.
solution candidates than the TS for the chosen tuning pa-
This conclusion is also supported by the maximum
rameters, which is reflected in the runtime in the RAP scenario.
spread metric, which is higher for the TS than for the
In the end, it can be stated that the results produced
GA as well as the higher minimal costs and maximum 940
945
availability in both scenarios (cf. Table 5). The feature970
by the RAP scenario are significantly different from those
of TS to identify wide Pareto fronts was also observed in
of the ITRAP, which justifies the heavily increased time-
[89]. The spacing metric, i.e. the standard deviation of the
consumption in such a real-world scenario with inter-
nearest neighbor distance, is smaller in the GAs, which is
component dependencies and limited operator capacities.
also majorly influenced by the high cost solutions of the
For the case of the analyzed application service provider,
TS that have a great distance between them. Regarding975
the two solution algorithms perform differently. However,
the non-uniformity of the four curves, the results are in-
the decision as to which algorithm is preferable for this
conclusive. For the RAP, the N P F metric ranges from
problem cannot be made. Therefore, a detailed perfor-
1.44 (TS) to 4.1 (GA) while in the ITRAP the algorithms
mance analysis of both algorithms is carried out. 19
4.4. Performance Analysis of the Genetic Algorithm and 980
Tabu Search
consumption (1 equals the time consumption of the ex-
1015
act solution). It can be seen that both algorithms achieve maximal accuracy faster than the exhaustive search. Ad-
In order to compare both solution algorithms, the dif-
ditionally, both algorithms obtain results with 90 % accu-
ference between the resulting and an optimal Pareto front
racy in under 20 % of the time for obtaining the optimal
should be analyzed. Obtaining this front can be done
solution. The fact that tabu search is a very effective so-
by exhaustive search. However, the solution space of the 1020 985
ITRAP is unlimited due to the lack of a maximum compo-
while the F1 value is only slightly higher than for the GA,
nent number per subsystem as used in other works. There-
the mean ideal distance of the TS results is, even for very
fore, a scenario is defined on the basis of the analyzed use-
short execution times, very low in comparison to the GA.
case for which exhaustive search is feasible.
Therefore, it can be stated that the tabu search should be
In this scenario, only the ERP, the ERP blade, the proxy 1025 990
lution method for RAPs can be confirmed in this scenario:
preferred for this problem.
and load balancer subsystem have been modeled (cf. Figure 6). The possible component number per subsystem
5. Conclusion
has been restricted to three. Thus, the solution space con-
995
sists of 16,200 distinct solutions. After all solutions have
Unavailability of IT services is inconvenient for both
been evaluated by simulation, non-dominating solutions
providers and consumers. Besides a loss of reputation and
have been extracted for the optimal Pareto front.
possible opportunity costs for the IT service provider, vio1030
After that, the genetic algorithm as well as the tabu
costs. Although ensuring sufficient IT service availability
search were executed for different parameter settings so
is a crucial task that is mainly affected by IT service de-
that the processing time of the meta-heuristics could be
sign, there is a lack of suitable decision support systems
compared to the processing time of selecting the optimal 1000
that help IT service designers building highly available sys-
Pareto front. To assess the performance of the solution algorithms, two metrics have been used.
1035
tems at low costs. As a prominent research area in availability optimiza-
The F1 value, as shown in Equation 8, is computed from
tion, the redundancy allocation problem (RAP) addresses
the ratio of Pareto-optimal solutions to the obtained solu-
the issue of achieving an optimal trade-off between avail-
tions (precision) as well as the ratio of identified optimal 1005
lations of the service-level agreements may lead to penalty
ability and resource-consumption. However, the combina-
solutions to all optimal solutions (recall). If the obtained 1040
front equals the optimal result, the value is one. If both
torial computation of system availability from component availabilities in proposed RAPs cannot map all the facts
fronts have no common solutions, the value becomes zero.
that are affecting IT service availability such as dependent F1 = 2
precision · recall precision + recall
component failures and limited operator capacities. On
(8)
the other hand, state-space-based approaches that are ca-
In addition to this value, the mean ideal distance (M ID)1045
1010
pable of introducing these facts in analytical models have
from [38] is used, which is defined as the mean Euclidean
been developed in recent years in order to estimate IT ser-
distance in the objective space between an obtained solu-
vice availability.
tion and the nearest Pareto-optimal solution.
Therefore, the goal of this paper has been to combine
In Figure 8, those metrics are displayed for different
the recent research in RAP and IT service availability es-
parameter settings, differentiated by the normalized time1050
timation in order to provide multi-objective decision sup-
20
0.012
1 0.9
0.01
0.8 0.7
Mean Ideal Distance
0.008
F1-Value
0.6 0.5 0.4
0.006
0.004
0.3 0.2
0.002
0.1 0
0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
Normalized Time Consumption GA
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Normalized Time Consumption GA
TS
TS
Figure 8: Performance metrics of the both solution algorithms for different parameter settings.
port for IT service design in terms of availability and cost.1075 mechanism, their effect on the energy consumption has
1055
1060
On the basis of the related work that has been identified,
never been introduced in the costs estimation, although
requirements for a RAP for IT service design have been
that allows for lower costs while achieving equal availabil-
derived. The ITRAP that matches these requirements
ity levels. Another important point implicitly modeled in
has been defined.
the ITRAP approach is that component-level redundancy
Two meta-heuristic algorithms have
been adapted that, in combination with a Petri net Monte1080
cannot increase availability to the upper bound of 100 per-
Carlo simulation for availability and costs estimation, can
cent due to operator capacities and inter-component de-
provide Pareto-optimal solutions to an ITRAP. These are
pendencies. These values could only be achieved by using
based on genetic algorithms as well as tabu search.
system-level redundancy, e.g. with mirrored data centers, which increases costs massively.
A prototypical implementation of the ITRAP concept has been applied to the optimization of an international1085
1065
1070
The disadvantages of the ITRAP approach are the in-
application service provider’s landscape excerpt in order
creased modeling effort and time-consumption. The for-
to demonstrate the feasibility of the approach. The com-
mer is justified if the ITRAP results would be different
parison of the ITRAP simulation-based fitness estimation
to classical RAPs, which is the case for IT service design
to a classical combinatorial evaluation has revealed that
problems with dependencies. Additionally, the difference
the new approach comes to different design suggestions.1090
in time-consumption in comparison to combinatorial esti-
In addition to this, it could be shown that the genetic al-
mation models is not a crucial drawback since the ITRAP
gorithm and the tabu search are both suitable to solve the
should be applied in the service design stage. In this lifecy-
problem, although the tabu search provides better results,
cle phase, the time effort to optimize a planned IT service
especially for small execution times.
is negligible in comparison to the cost of correcting wrong
The results obtained in the use-case example indicate1095
decision in later lifecycle stages.
that the ITRAP is able to provide more realistic design
Therefore, the ITRAP concept that has been presented
suggestions than classical RAP approaches. While some
in this work can be seen as a first step to establish a multi-
work was conducted in the area of standby redundancy
objective RAP for IT service design in order to support 21
1100
decision makers in the service design stage. Nevertheless,
on the current utilization [83]. Therefore, a introduction
there is also some potential for improvement:
of these aspects would increase the quality of the costs estimation.
At first, an important research area in RAP literature was not considered here, which is parameter uncertainty.1140
1105
While operator interaction and limited operator capac-
It is often not feasible or even possible to conduct long-
ities are considered in the ITRAP definition, assigning an
term analysis on component availability characteristics,
equal error probability to each operator is questionable.
especially for software systems in development. There-
Operators have very different levels of expertise, which
fore, uncertainty should be incorporated in a RAP for IT
greatly influences the probability of operator errors. In
services. This could be possible by using simple interval1145 addition, it is assumed that the number of operators is con-
1110
1115
arithmetic or more complex fuzzy arithmetic approaches.
stant throughout the whole operation, which may not be
In addition to this, no resource constraints for e.g.
the case due to operator schedules. Since operator errors
weight or volume have been considered for the optimiza-
are a major cause for IT service downtime, more and more
tion problem. However, linear constraints depending on
organizations introduce mechanisms to lower operator er-
component choices as used in other RAPs can be easily1150
ror probability, e.g. validation systems in which interac-
introduced in the estimation model. With regard to the
tions are tested before they are transferred to a productive
solution algorithms, this would require a penalty function
system. Such mechanisms should also be integrated in the
for the fitness tuple of a solution candidate that scales with
availability and costs estimation.
the degree of infeasibility (cf. e.g. [40, 89]). References
While a multi-objective approach provides a greater range of optimal service designs, it may be overwhelming1155
[1] J. Ward, J. Peppard, Strategic Planning for Information Systems, 3rd ed., John Wiley & Sons, New York, NY, USA, 2002.
for a decision maker to be confronted with a lot of de-
[2] A. Keller, H. Ludwig, The WSLA Framwork: Specifying and 1120
signs that are not superior to each other. If the simulation
Monitoring Service Level Agreements for Web Services, Journal
would contain a model for service-level agreements, the
of Network and Systems Management 11 (2003) 57–81.
cost of violations due to poor IT service availability could1160
sis and optimisation in SLAs, International Journal of Network
be integrated into the costs estimation. That would al-
Management 22 (2012) 104–130.
low for the transformation of the multi-objective problem 1125
[4] D. Siewiorek, R. Swarz, The Theory and Practice of Reliable
with availability and costs to a single-objective problem
System Design, Digital Press, 1982.
with the goal of minimizing costs and, thus, for provid-1165
[5] L. Hunnebeck, ITIL Service Design 2011 Edition, The Stationery Office, Norwich, UK, 2011.
ing a single (sub)optimal solution. However, if the cost of
1130
[3] E. Zambon, S. Etalle, R. Wieringa, A2 thOS: availability analy-
[6] D. Henschen, Amazon Outage Scrooges Netflix, Heroku,
unavailability cannot be determined a priori or if Service
2012.
Level Objectives are not yet defined, the multi-objective
computing/infrastructure/amazon-outage-scrooges-
problem should be preferred.
1170
URL:
http://www.informationweek.com/cloud-
netflix-heroku/240145338, Accessed: 2015-04-30. [7] G. Keizer, Apple service outage stretches into hours [accessed:
In the ITRAP, availability was modeled depending on
2015-04-30],
URL:
http://www.computerworld.com/
the ratio of actual to desired performance. This represen-
article/2895394/outage-hits-apple-services-including-
tation is rather simple, and does not consider changing
icloud-and-app-store.html, accessed: 2015-04-30.
1175
demands that could be modeled by a load curve as well as 1135
2015.
[8] L. Whitney, Microsoft pins Hotmail, Outlook outage on hot data center [Accessed:
2015-04-30], 2013. URL: http:
component utilization. Power costs of IT systems, how-
//news.cnet.com/8301-10805_3-57574270-75/microsoft-
ever, depend not only on the component state, but also
pins-hotmail-outlook-outage-on-hot-data-center/.
22
[9] D. 1180
in
Tweney, revenue
5-minute [Accessed:
outage
costs
2015-04-30],
Google 2013.
5017 of Lecture Notes in Computer Science, Springer Verlag
$545,000
URL:
Berlin Heidelberg, Tokyo, Japan, 2008, pp. 20–25.
http:
[23] A. R. Hevner, S. T. March, J. Park, S. Ram, Design science in
//venturebeat.com/2013/08/16/3-minute-outage-costsgoogle-545000-in-revenue/, accessed: 2015-04-30. [10] D. Csaplar, 2015-04-30], 1185
1230
2012.
URL:
http://blogs.aberdeen.com/it-
design science research methodology for information systems re-
ac-
search, Journal of Management Information Systems 24 (2008)
infrastructure/the-cost-of-downtime-is-rising/, cessed: 2015-04-30.
45–78.
[11] H. Garg, M. Rani, S. P. Sharma, Y. Vishwakarma, Bi-objective1235
for series-parallel system, Journal of Manufacturing Systems 33
(2000) 176–187.
timization: State-of-the-art survey, Reliability Engineering &
development cost, IEEE Journal on Selected Areas in Commu-1240 nications 8 (1990) 276–282. [13] International Organization for Standardization,
Journal of Industrial Engineering Computations 5 (2014) 339– 364.
1245
R. Buyya, C. A. F. De Rose, Towards autonomic detection
[29] D. Fyffe, W. Hines, N. Lee, System reliability allocation and a
of SLA violations in Cloud infrastructures, Future Generation
computational algorithm, IEEE Transactions on Reliability 17
Computer Systems 28 (2012) 1017–1029.
(1968) 64–69.
[16] D. Terlit, H. Krcmar, Generic Performance Prediction for ERP1250
cation using an efficient heuristic and a honey bee mating algo-
Conference on Information Systems (ECIS), 2011.
rithm, Expert Systems with Applications 39 (2012) 990–999.
[17] G. Callou, P. Maciel, D. Tutsch, J. Arajo, J. Ferreira, R. Souza,
[31] J. E. Ramirez-Marquez, D. W. Coit, A heuristic for solving
A Petri Net-Based Approach to the Quantification of Data Cen-
the redundancy allocation problem for multi-state series-parallel
ter Dependability, in: P. Pawlewski (Ed.), Petri Nets - Manu-1255
systems, Reliability Engineering and System Safety 83 (2004)
facturing and Computer Science, InTech, 2012, pp. 313–336.
341–349.
[18] B. Littlewood, Comments on Reliability and performance anal-
[32] D. Zou, L. Gao, S. Li, J. Wu, An effective global harmony
ysis for fault-tolerant programs consisting of versions with differ-
search algorithm for reliability problems, Expert Systems with
and System Safety 91 (2006) 119–120.
Applications 38 (2011) 4642–4648.
1260
ability models,
liability 55 (2006) 551–558.
IEEE Transactions on Service Computing 4
[34] I. D. Lins, E. L. Droguett, Multiobjective optimization of avail-
(2011) 56–69. [20] S. Bosse, Predicting an IT Services Availability with Respect
ability and cost in repairable systems design via genetic algo-
to Operator Errors, in: Proceedings of the 19th Americas Con-1265
rithms and discrete event simulation, Pesquisa Operacional 29 (2009) 43–66.
ference on Information Systems (AMCIS), Chicago, IL, USA, 2013.
[35] S. Kulturel-Konak, D. W. Coit, F. Baheranwala, Pruned pareto-
[21] U. Franke, P. Johnson, J. K¨ onig, An architecture framework for
optimal sets for the system redundancy allocation problem
enterprise IT service availability analysis, Software and Systems
based on multiple prioritized objectives, Journal of Heuristics
Modeling 13 (2014) 1417–1445.
1225
[33] D. Coit, A. Konak, Multiple weighted objectives heuristic for the redundancy allocation problem, IEEE Transactions on Re-
[19] N. Milanovic, B. Milic, Automatic generation of service avail-
1220
[30] S. J. Sadjadi, R. Soltani, Alternative design redundancy allo-
and SOA Applications, in: Proceedings of the 18th European
ent characteristics by Gregory Levitin, Reliability Engineering
1215
[28] J. D. Kettelle Jr., Least-cost allocations of reliability investment, Operations Research 10 (1962) 249–265.
[15] V. C. Emeakaroha, M. A. S. Netto, R. N. Calheiros, I. Brandic,
1210
Reliability optimization of binary state non-
repairable systems: A state of the art survey, International
ISO/IEC
20000-1, 2011.
ISACA, 2012.
1205
System Safety 91 (2006) 1008–1026. [27] R. Soltani,
[14] Information Systems Audit and Control Association, COBIT 5,
1200
An annotated overview of system-
[26] M. Gen, Y. Yun, Soft computing approach for reliability op-
(2014) 335–347. [12] D.-H. Chi, W. Kuo, Optimal design for software reliability and
1195
[25] W. Kuo, V. R. Prasad,
reliability optimization, IEEE Transactions on Reliability 49
optimization of the reliability-redundancy allocation problem
1190
information systems research, MIS Quarterly 28 (2004) 75–105. [24] K. Peffers, T. Tuunanen, M. A. Rothenberger, S. Chatterjee, A
The cost of downtime is rising [accessed:
1270
14 (2008) 335–357.
[22] K. Trivedi, G. Ciardo, B. Dasarathy, M. Grottke, R. Matias,
[36] W.-C. Yeh, T.-J. Hsieh, Solving reliability redundancy alloca-
A. Rindos, B. Vashaw, Achieving and assuring high availability,
tion problems using an artificial bee colony algorithm, Computers & Operations Research 38 (2011) 1465–1473.
in: T. Nanya, F. Maruyama, A. Pataricza, M. Malek (Eds.), 5th International Service Availability Symposium (ISAS), volume
[37] H. Garg, S. Sharma, Multi-objective reliability-redundancy al-
23
1275
location problem using particle swarm optimization, Computers
172–178.
& Industrial Engineering 64 (2013) 247–255.
[52] G. Jiansheng, W. Zutong, Z. Mingfa, W. Ying, Uncertain multi-
[38] A. Chambari, S. Rahmati, A. Najafi, A. Karimi, A bi-objective1325
1280
model to optimize reliability and cost of system with a choice
based on artificial bee colony algorithm, Chinese Journal of
of redundancy strategies, Computers & Industrial Engineering
Aeronautics 27 (2014) 1477–1487. [53] S. Wang, J. Watada, Modelling redundancy allocation for a
63 (2012) 109–119. [39] M.-S. Chern, On the computational complexity of reliability
fuzzy random parallel-series system, Journal of Computational
redundancy allocation in a series system, Operations Research1330
[40] D. Coit, A. Smith, Solving the redundancy allocation problem
dancy allocation problem in series-parallel systems with bud-
using a combined neural network/genetic algorithm approach,
geted uncertainty, IEEE Transactions on Reliability 63 (2014)
Computers & Operations Research 23 (1996) 515–526.
239–250.
[41] Y.-C. Liang, A. Smith, An Ant Colony Optimization Algorithm1335
1290
redundancy allocation problem using tabu search, IIE Transac-
actions on Reliability 53 (2004) 417–423.
tions 35 (2003) 515–526.
Immune algorithms-based approach
[56] D. Coit, A. Smith, Reliability optimization of series-parallel
for redundant reliability problems with multiple component
systems using a genetic algorithm, IEEE Transactions on Reli-
[42] T.-C. Chen, P.-S. You,
choices, Computers in Industry 56 (2005) 195–205.
1340
parallel systems with a choice of redundancy strategies, Re-
rithms, Computers & Industrial Engineering 37 (1999) 145–149.
liability Engineering & System Safety 108 (2012) 10–20. [58] A. Dolatshahi-Zand, K. Khalili-Damghani, Design of SCADA
dancy allocation with the choice of redundancy strategy and1345
water resource management control center by a bi-objective re-
multiple choice of component type under uncertainty, Comput-
dundancy allocation problem and particle swarm optimization,
ers & Industrial Engineering (2015).
Reliability Engineering & System Safety 133 (2015) 11–21.
[45] M. A. Ardakan, A. Z. Hamadani, Reliability–redundancy allo-
[59] A. Chambari, A. Najafi, S. Rahmati, A. Karimi, An efficient
cation problem with cold-standby redundancy strategy, Simu-
simulated annealing algorithm for the redundancy allocation
lation Modelling Practice and Theory 42 (2014) 107–118.
problem with a choice of redundancy strategies, Reliability En-
1350
gineering & System Safety 119 (2013) 158–164.
Nonequilibrium simulated
annealing-algorithm applied to reliability optimization of com-
[60] L. Wang, L.-P. Li, A coevolutionary differential evolution with
plex system, IEEE Transactions on Reliability 46 (1997) 233–
harmony search for reliability–redundancy optimization, Expert
239.
Systems with Applications 39 (2012) 5271–5278.
[47] L. Sahoo, A. Bhunia, D. Roy, A genetic algorithm based relia-1355
1310
1320
[61] Y.-C. Hsieh, P.-S. You,
An effective immune based two-
bility redundancy optimization for interval valued reliabilities of
phase approach for the optimal reliability–redundancy alloca-
components, Journal of Applied Quantitative Methods 5 (2010)
tion problem,
Applied Mathematics and Computation 218
(2011) 1297–1307.
270–287. [48] M. Ziaee,
1315
Multi-objective reliability optimization of series-
liability with interval coefficient using improved genetic algo-
[46] V. Ravi, B. Murty, P. Reddy,
1305
ability 45 (1996) 254–266. [57] J. Safari,
[44] S. J. Sadjadi, R. Soltani, Minimum–Maximum regret redun-
1300
[55] S. Kulturel-Konak, A. Smith, D. Coit, Efficiently solving the
for the Redundancy Allocation Problem (RAP), IEEE Trans-
[43] T. Taguchi, T. Yokota, Optimal design problem of system re-
1295
and Applied Mathematics 232 (2009) 539–557. [54] M. J. Feizollahi, S. Ahmed, M. Modarres, The robust redun-
Letters 11 (1992) 309–315.
1285
objective redundancy allocation problem of repairable systems
Optimal Redundancy Allocation in Hierarchical
[62] E. Valian, E. Valian, A cuckoo search algorithm by lvy flights for
Series–Parallel Systems Using Mixed Integer Programming, Ap-1360
solving reliability redundancy allocation problems, Engineering
plied Mathematics 4 (2013) 79–83.
Optimization 45 (2013) 1273–1286.
[49] M. Ouzineb, M. Nourelfath, M. Gendreau, Tabu search for the
[63] G. Kanagaraj, S. Ponnambalam, N. Jawahar, A hybrid cuckoo
redundancy allocation problem of homogenous series–parallel
search and genetic algorithm for reliability–redundancy alloca-
multi-state systems, Reliability Engineering & System Safety
tion problems, Computers & Industrial Engineering 66 (2013)
93 (2008) 1257–1272.
1115–1124.
1365
[50] R. Abdelkader, Z. Abdelkader, R. Mustapha, M. Yamani,
[64] Y. Nakagawa, S. Miyazaki, Surrogate constraints algorithm for
Search Algorithms for Engineering Optimization, InTech, 2013,
reliability optimization problems with two constraints, IEEE
pp. 241–258.
Transactions on Reliability R-30 (1981) 175–180.
[51] L. Painton, J. Campbell, Genetic algorithms in optimization of
[65] A. Immonen, E. Niemel¨ a, Survey of reliability and availability
system reliability, IEEE Transactions on Reliability 44 (1995)1370
24
prediction methods from the viewpoint of software architecture,
Software and Systems Modeling 7 (2008) 49–65.
[80] D. Oppenheimer, A. Ganapathi, D. Patterson, Why do internet
[66] M. Shooman, Reliability of Computer Systems and Networks1420
2003.
York, New York, NY, USA, 2002. 1375
[67] K. Tokuno, S. Yamada, Markovian availability modeling for
[81] Z. Yin, X. Ma, J. Zheng, Y. Zhou, L. Bairavasundaram, S. Pasu-
software-intensive systems, International Journal of Quality &
pathy, An empirical study on conguration errors in commercial
Reliability Management 17 (2000) 200–212.
and open source systems, in: 23rd ACM Symposium on Oper-
1425
ating Systems Principles (SOSP), Cascais, Portugal, 2011.
[68] C. Chellappan, G. Vijayalakshmi, Dependability modeling and
[82] L. Barroso, J. Clidaras, U. H¨ olzle, The Datacenter as a Com-
analysis of hybrid redundancy systems, International Journal 1380
puter, Synthesis Lectures on Computer Architecture, 2 ed.,
of Quality & Reliability Management 26 (2009) 76–96. [69] R. Chinnaiyan, S. Somasundaram, Evaluating the reliability
Morgan & Claypool Publishers, 2013.
of component-based software systems, International Journal of1430
[70] A. Sachdeva, D. Kumar, P. Kumar, Reliability analysis of pulp-
of servers in cloud data centers, in: Proceedings of the 12th
ing system using Petri nets, International Journal of Quality &
International Conference on Wirtschaftsinformatik (WI), 2015.
Reliability Management 25 (2008) 860–877.
[84] C. Patel, A. Shah, Cost Model for Planning, Development: and
[71] V. Zille, C. Brenguer, A. Grall, A. Despujols, Simulation of1435
ment, in: P. Faulin, A. Juan, S. Martorell, J. Ramrez-Mrquez
[85] G. Ciardo, J. Muppala, K. Trivedi, SPNP: Stochastic Petri Net
(Eds.), Simulation Methods for Reliability and Availability of
Package, in: Proceedings of the 3rd International Workshop
Complex Systems, Springer, Berlin, Heidelberg, 2010, pp. 253–
PNPM, IEEE Computer Society, 1989, pp. 142–151.
272.
1395
1440
A. Cumani, The Effect of Execution Policies on the Seman-
Prediction and Modeling of High Availability OSCAR Cluster,
tics and Analysis of Stochastic Petri Nets, IEEE Transactions
in: 5th IEEE International Conference on Cluster Computing,
on Software Engineering 15 (1989) 832–846. [87] E. Pinheiro, W.-D. Weber, L. A. Barroso, Failure trends in a
[73] D. Jewell, Performance Modeling and Engineering, Springer US,1445 2008, pp. 29–55. [74] U. Franke, Optimal IT Service Availability: Shorter Outages,
[88] C. M. Fonseca, P. J. Fleming, An overview of evolutionary
or Fewer?, IEEE Transactions on Network and Service Man-
algorithms in multiobjective optimization., Evolutionary Com-
agement 9 (2012) 22–33.
putation 3 (1995) 1–16. Predicting Availability1450
tabu search using a multinomial probability mass function, Eu-
22nd European Conference on Information Systems (ECIS), Tel
ropean Journal of Operational Research 169 (2006) 918–931. [90] K. Deb, S. Agrawal, A. Pratap, T. Meyarivan, A Fast Elitist
Aviv, Israel, 2014. URL: http://aisel.aisnet.org/ecis2014/
Non-dominated Sorting Genetic Algorithm for Multi-objective
proceedings/track20/5/.
Bondavalli,
Optimization: NSGA-II, in: Proceedings of the 6th International Conference on Parallel Problem Solving from Nature,
Office, 2011. [77] A.
S.
Chiaradonna,
F.
Di
volume 1917 of Lecture Notes in Computer Science, Springer,
Giandomenico,
Berlin, Heidelberg, 2000.
F. Grandoni, Threshold-based mechanisms to discriminate transient from intermittent faults, IEEE Transactions on Computers 49 (2000) 230–245.
[91] M. A. Ardakan, A. Z. Hamadani, M. Alinaghian, Optimizing
1460
bi-objective redundancy allocation problem with a mixed redundancy strategy, ISA Transactions 55 (2015) 116–128.
[78] M. Grottke, K. Trivedi, A classficiation of software faults, Jour-
[92] T. B¨ ack, Evolutionary Algorithms in Theory and Practice, Ox-
nal of Reliability Engineering Association of Japan 27 (2005) 1415
[89] S. Kulturel-Konak, A. E. Smith, B. A. Normal, Multi-objective
and Response Times of IT Services, in: Proceedings of the
[76] D. Cannon, ITIL Service Strategy 2011 Edition, The Stationery1455
1410
large disk drive population, in: Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST), 2007.
[75] S. Bosse, C. Schulz, K. Turowski,
1405
[86] M. A. Marsan, G. Balbo, A. Bobbio, G. Chiola, G. Conte,
[72] C. Leangsuksun, L. Shen, T. Liu, H. Song, S. Scott, Availability
IEEE Computer Society, Hong Kong, China, 2003, pp. 380–386.
1400
Operation of a Data Center, Technical Report, Hewlett-Packard Laboratories Palo Alto, 2005.
maintained multicomponent systems for dependability assess-
1390
[83] M. Splieth, S. Bosse, C. Schulz, K. Turowski, Analyzing the effects of load distribution algorithms on energy consumption
Quality & Reliability Management 27 (2010) 78–88.
1385
services fail, and what can be done about it?, in: 4th Usenix Symposium on Internet Technologies and Systems (USITS),
Fault Tolerance, Analysis, and Design, John Wiley & Sons New
425–438.
ford University Press, 1996.
[79] A.-C. Orgerie, M. D. De Assuncao, L. Lefevre, A survey on
[93] B. Schroeder, E. Pinheiro, W.-D. Weber, DRAM Errors in the
techniques for improving the energy efficiency of large scale dis-1465
Wild: A Large-Scale Field Study, Communications of the ACM
tributed systems, ACM Computing Surveys 46 (2014) 1–35.
54 (2011) 100–107.
25