The Journal of Systems and Software 85 (2012) 2048–2062
Contents lists available at SciVerse ScienceDirect
The Journal of Systems and Software journal homepage: www.elsevier.com/locate/jss
Dynamic service placement and replication framework to enhance service availability using team formation algorithm Boon-Yaik Ooi ∗ , Huah-Yong Chan, Yu-N. Cheah School of Computer Sciences, Universiti Sains Malaysia, 11800 Penang, Malaysia
a r t i c l e
i n f o
Article history: Received 25 July 2011 Received in revised form 16 December 2011 Accepted 6 February 2012 Available online 13 March 2012 Keywords: Autonomic computing Resource selection Resource management Service placement and replication
a b s t r a c t The motivation of this work is to reduce the complexity in managing and administering services in the ever growing distributed environment via automated service placement and replication with team formation algorithm. The team formation algorithm is designed in a way that it would continuously search for resources with better performance and pool resources together to achieve better availability. The main intention of this work is not to replace the human administrators but to provide a better alternative in managing services in dynamic distributed environment. Instructions from the administrators are still required but at a different level. Administrators are freed from making low level decisions such as to decide the actual placement of the services and design the failover capabilities for each of the services. The evaluation results showed that the framework is capable of managing resources according to the requirements given by administrator, even during in the event of multiple consecutive resources failure. The proposed solution had the probability of 72.1% of its services that are still available after 83.3% of the available resources were shut down while conventional failover solution using three redundant units had only the probability of 40.8% of services that are still available. © 2012 Elsevier Inc. All rights reserved.
1. Motivation The need for a self-management framework has begun to emerge (Ganek and Corbi, 2003; Kephart and Chess, 2003; Kounev et al., 2007; Folino and Spezzano, 2007; Liu et al., 2005; Parashar et al., 2006; Vlassov et al., 2006; Agarwal et al., 2003; Hariri et al., 2006) and it would be an essential item when administrators could no longer handle the scale and heterogeneity of the ever growing distributed computing environment. One of the most notable movements of self-management automation was initiated by IBM in 2001 and it is known as the Autonomic Computing Initiative (ACI) (Horn, 2001). The research direction of ACI is to design a distributed computing system which can autonomously configure, heal, optimize and protect according to the environment with minimal intervention from administrators. This self-management system is inspired by human body’s autonomic nervous system, which controls functions such as heart beat, blood sugar and respiration without requiring conscious action by human. Although the goal of autonomic computing has been set, the goal is yet to be realized completely. This is due to the fact that the concept of autonomic computing encompasses beyond the
∗ Corresponding author. E-mail addresses:
[email protected],
[email protected] (B.-Y. Ooi),
[email protected] (H.-Y. Chan),
[email protected] (Y.-N. Cheah). 0164-1212/$ – see front matter © 2012 Elsevier Inc. All rights reserved. doi:10.1016/j.jss.2012.02.010
traditional concept of automation and it requires each component in a distributed environment to orchestrate as one super organism that exhibits the capabilities of self-healing, self-configuring, self-protecting and self-optimizing (Ganek, 2007). Besides that, the common acceptable definition of an autonomic system’s characteristics has yet been standardized (Herrmann et al., 2005). For instance, it is difficult to distinguish between self-protecting, self-healing and self-optimizing mechanisms, as all of these mechanisms serve in an interwoven manner to improve the state of a system. Long before ACI, there were plenty attempts made to automate administrative tasks and reduce the complexity of distributed resource management. The increasing complexity of distributed system management has been noted all this while. As a prove, in order to assist administrators to manage their distributed system more effectively, organizations are sharing ICT best practices (ITIL, 2010) and there are continuous development of distributed system management tools such as network monitoring, performance monitoring, load balancing and fault tolerance. Unfortunately, all of the attempts made have one thing in common, that all of them still require a lot of human intervention to the finest detail. The advancement of virtual machines (VMs) had enabled administrator to decouple the dependency between the software and hardware at the expense of slight degradation of the hardware performance (Zhu et al., 2008) and as the technology of virtual machine (VM) and hardware improve, the performance
B.-Y. Ooi et al. / The Journal of Systems and Software 85 (2012) 2048–2062
difference between VM and physical machine will soon be insignificant (Barham et al., 2003). The abilities of VM in reducing setup time, migrating services live and consolidating multiple underutilized servers into a smaller number of physical machines have made VM technology seems as a viable way to reduce the complexity of resource management (MSCV, 2008; VMware, 2006). Unfortunately, these VM abilities have yet reduced the resource management complexity. This is because the administrator is still required to plan and map the VMs to physical hosts manually. Even though, automated VM migration tools exist but it still requires the administrator to provide details migration rules in advance (DRS, 2008). The emergence of cloud computing seems to be a viable solution on reducing the total cost of ownership (TCO) of many distributed systems by sharing their infrastructure hosted in a third party data centre (Armbrust, 2010; Armbrust et al., 2009). However, reducing the TCO does not reduce the administrative complexity and it will continue to attract larger scale of distribute systems to be developed and deployed (Sailer et al., 2010). Cloud computing does not provide solution on managing large amount of services (Goscinski and Brock, 2010) and it unreasonably allows these services to be deployed across multiple networks. Services running on cloud computing environment will still require the administrator to plan the architecture of the distributed system carefully to leverage on the scalability offered by cloud computing (Velte et al., 2010a,b). Services running on cloud computing does not mean that the service is automatically converted into high availability mode (Joseph, 2009). The introduction of various well-distributed computing paradigms such as cloud computing and ubiquitous computing along with the advancement of distributed computing technology namely server virtualization, service-oriented architecture (SOA), and distributed agent technologies, have greatly increased the flexibility and scalability of distributed systems. However, these flexibility and scalability are not achieved without problems of their own, notably the difficulty to manage and administer resources in a distributed environment (Kephart, 2005; Salehie and Tahvildari, 2005; Parashar and Hariri, 2005), in view of the complexity of maintaining the availability, cost, security and performance of a large amount of services running on a huge number of heterogeneous machines across different networks. Conventional resource management method which uses one or more human administrators to manage dedicated and specialized infrastructures such as cluster computing to host a fixed set of services is no longer suitable for the rapidly growing size of networks and the Internet as a whole (Herrmann et al., 2005). In general, administrators were required to constantly monitor the utilization of their services and physical resources, define high level utilization policies and perform low level implementation to ensure the performance of all the services to be within their respective acceptable range. As the network size increases, the complexity of the distributed system will affect the productivity of administrators (Ranganathan and Campbell, 2007) and it often result high operation cost and non-optimal use of resources. Thus, it is obvious that a self-management framework to act as the middleware between the services and the physical resources in distributed computing environment is required for human administrators to cope with the dynamic and fast growing distributed computing environment. In our previous work (Ooi et al., 2010), we have proposed a decentralized dynamic service placement and replication (DSPR) framework that enhance service availability on distributed computing environment by continuously searching for better machines to host the service and recruiting additional machines to replicate the service autonomously to achieve the service requirements specified by the administrator. This work
2049
provides the details of the proposed framework with experimental results. 1.1. Problem statements As a result of the ever-growing distributed computing environment, many service placement algorithms (Andrzejak et al., 2002; Nogueira and Pinho, 2009; Famaey et al., 2009; Karve et al., 2006; Adam and Stadler, 2007; Adam et al., 2007; Ardagnaa et al., 2007; Famaey et al., 2008) have been introduced to automatically manage the services. Besides being different in terms of their centralized and decentralized architectures, each of them was built on different resource evaluation schemes with different objectives. The decisions made are also based on different criteria. Most service placement algorithms are application specific and aim to improve service performance or to reduce operation cost. Most of the algorithms did not consider the availability of resources which will directly affect service availability. Besides the differences in architecture and objectives, most of the solutions proposed to improve service availability (Abd-El-Barr, 2007; Siewiorek and Swarz, 1992; Sun, 2002) are based on having redundant servers to mask failures. The effectiveness of these solutions is often very dependent on the experience and knowledge of the system administrators. Although the additional servers increase service availability, the complexity in managing the distributed system increases as well. In addressing these problems this research focuses the following questions: • How to free administrators from making low-level decisions such as to decide the actual placement of services and design the failover capabilities for each service? • How to enhance the availability of services even during the event of multiple consecutives resources failures? • How to evaluate the effectiveness of the proposed selfmanagement solution? 1.2. Organization of this paper This paper is organized as follows. In Section 2, we provide the literature review of the existing techniques that are commonly employed to improve availability of services in distributed computing systems and followed by a review of existing service placement algorithms. In Section 3, we present the detail of the DSPR framework and the team formation algorithm. In Section 4, we present the details of the simulations and results. Finally, we conclude this work in Section 5 with the limitation of this work and potential future work. 2. Literature review 2.1. Existing approaches to enhance service availability In most of the high-availability distributed systems, redundancy is used to increase availability and mask failure (Abd-El-Barr, 2007; Siewiorek and Swarz, 1992; Shooman, 2002). In case of failure, the redundant server will take over the responsibility of the actual server. This switching process is known as failover. Although the failed server is not repaired, the redundant server makes the system appear as available and operating as usual to the users. Once the failed server is repaired, a failback procedure is initiated to restore the configuration back to original before another failover occurred. This failback procedure usually requires human intervention. Although redundancy is able to increase the availability of a system, it is not without problems of its own, notably additional cost and underutilized resources. For instance database mirroring
2050
B.-Y. Ooi et al. / The Journal of Systems and Software 85 (2012) 2048–2062
and server replication techniques (HAS, 2010) require additional hardware that does not contribute to the performance of a system. There are also techniques that are available to reduce the number of underutilized resources by using the additional resources to improve the service performance. For example, server content caching (Barish and Obraczke, 2000) and load balancing techniques (Bryhni et al., Aug 2000) were introduced to use the additional resources to improve service performance. However, the design and management of these existing methods to improve service availability in a distributed system itself are by themselves complicated for the system administrators (IVC, 2008). This is because the administrators need to decide on the architecture, roles, and relationships of the servers, while matching applications to servers. These tasks would be even more complicated in a large distributed environment. In order to reduce the complexity, technologies such as clustering (Sun, 2002; Nagaraja et al., 2003; Roehm et al., 2005) and reliable server pooling (RSerPool)) (Dreibholz et al., 2007; Bozinovski et al., 2007) were devised. For example, the clustering failover technique used in Sun Grid Engine (SGE), SGE’s master node has all its child compute nodes arranged in a serial manner to form a series of redundant nodes. In the event of the master node failure, the first compute node in the serial arrangement will take over and continue its operation. This process can be repeated until all the nodes in the SGE fails (Sun, 2002). At a glance, RSerPool appears similar to clustering. However, RSerPool is different as it dynamically selects the redundant node. The redundant node is not being predefined and arranged in a serial manner. Administrators do need to manage RSerPool by defining the redundant server selection policies. There are two types of server selection policies, static or dynamic (Bozinovski et al., 2007). Static policies use predefined schemes such as round robin (RR) where servers are selected sequentially in a cyclic manner. Besides RR, there are weighted RR where weights are used to indicate the server’s capacity. On the other hand, dynamic policies make decisions based on the current state of the system. For example, the least used selection policy selects the server with the lowest load. Besides using the current state of the system, Dreibholz (Dreibholz and Rathgeb, 2007) proposed distances-aware least-used policy which uses distance between servers to make server selection decisions. The purpose of having such server selection policies is to allow servers to be distributed over a large geographical area to ensure survivability in the event of disasters such as earthquake, volcano eruptions or tsunami. 2.2. Service placement algorithms A survey of existing swarm intelligence techniques for selforganization and service placement was carried out by Andrzejak in year 2002 (Andrzejak et al., 2002). They opined that a good service placement solution should be decentralized while not overloading the communication channels. Besides comparing ant colony optimization, broadcast of local eligibility, and intelligent agents, they also compared simple and stateless techniques such as round robin and simple greedy algorithms. However, they concluded that different approaches have different levels of tradeoff between speed and solution accuracy. Each offers better performance in some circumstances and they proposed that a combination of techniques is necessary to solve the self-organization and service placement problem. Service placement algorithms are required to discover and select appropriate resources for all the services based on the preferences defined by the administrator. Different service placement algorithm employs different resource discovery method and different resource evaluation function. Resource discovery methods can be classified into centralized or decentralized and either complete or
heuristic. In order to distinguish preferable resources from the nonpreferable resources, these service placement algorithms require a resource evaluation function. Resources are usually distinguished using criteria such as performance, dependability, security and cost (Rosenberg et al., 2006). Therefore, these service placement algorithms, besides being different in terms of their centralized and decentralized architectures, each of them is built on different dynamic allocation schemes that have different objectives, and the decisions made are based on different criteria. Table 2.1 summarizes the service placement objectives and resource evaluation techniques.
2.3. Discussion Existing techniques are capable of improving service availability. However, most of the techniques did not consider the difficulty and complexity in managing the distributed system as the system grow larger. Although the dynamic server selection policies of RSerPool appear to be very similar to this work, RSerPool focus on availability only and it does not have the ability to make tradeoff between availability, operation cost and performance yet. An administrator is still required to manually manage the balance between availability, operation cost and performance of the services. Although all of the mentioned service placement algorithms had been successfully applied in many existing works, these techniques focus on a very specific set of objectives such as, minimizing the number of migrations, maximizing the performance and reducing the overall power consumption. Unlike existing solution, we are interested in combining service placement and service replication into a framework that not only manages performance but also cost and availability of a distributed system even in the event of resources failures. The framework has to be able to decide on whether to host a service on a service with better performance or a resource with better availability, if the costs of the resources are equal.
3. Overview of the dynamic service placement and replication (DSPR) framework The role of the administrator role in the DSPR framework can be simplified to specifying requirements for each service and manage the resources in the resource pool while the DSPR framework manages the placement of the services according to the given requirements. In this work, the service requirements involve three criteria: performance, availability and cost. From an end user’s perspective, the services managed by the DSPR framework would seem to be running on a high availability infrastructure. Fig. 3.1 illustrates the DSPR framework with relationships between the administrator, resources, and users. In the DSPR framework, each service is managed by a team. The term team used in this work refers to a group of managed resources that are working together. Each team manages only one service. The term managed resources used in this context refers to machines that are installed with a DSPR agent. Each machine will have one and only one DSPR agent. This agent is a stationary agent. Any managed resource can simultaneously participate in multiple teams. On the other hand, any resource without DSPR agent is considered as unmanaged resource. These unmanaged resources cannot be utilized until DSPR agents are assigned to them. Fig. 3.2 illustrates the components in a DSPR managed resource. A managed resource consists of applications that are managed by the DSPR with their respective service hosting platforms and a DSPR agent. There are 4 modules in a DSPR agent.
B.-Y. Ooi et al. / The Journal of Systems and Software 85 (2012) 2048–2062
2051
Table 2.1 List of existing service placement algorithms with respective service placement objectives and resource evaluation techniques. No.
Related work
Service placement objectives
1
Van et al. (2009)
2
Bozinovski et al. (2007)
3
Karve et al. (2006)
4
Ardagnaa et al. (2007)
5
Adam and Stadler (2007), Adam et al. (2007)
6
Famaey et al. (2008)
7
Verma et al. (2008)
8
Nogueira and Pinho (2009)
9
Oikonomou et al. (2008), Oikonomou and stavrakakis (2010)
Optimize global utilization which consists of SLA fulfillment and operating cost. Enhance service availability by automatically select backup candidate using predefined scheme. Maximize service performance with minimal number of placement changes. Maximize service revenue while balancing the cost of using the resources. The cost includes energy, software and hardware required. Improve service performance by performing automated service placement by mapping CPU demands to CPU supplies. Improve service performance by not only mapping CPU demands to CPU supplies but taking network latencies into consideration as well. Reduce power consumption via service consolidation using virtual machines. Ensure service QoS within acceptable level. It automatically tradeoffs between performance and QoS. Determine the optimal location of services without using global information. Improves QoS of a system by automatically selecting appropriate group of services. Designed an adaptive service placement algorithm to find a stable and low-cost replica placement.
10
Menasce et al. (2010), Menasce et al. (2011)
11
Hermann (2010)
Fig. 3.1. Overview of the relationships between administrators, resources, and users in the DSPR framework.
• Team formation coordinator module: to coordinate the inner workings of all the modules. It is the central of the team formation algorithm. • Execution module: to execute the team formation decisions made by the team formation coordinator module. • Resource evaluation module: to evaluate the suitability of the managed resources to the service based on the policies and preferences defined by the administrators. • Monitoring and discovery system (MDS): to discover and monitor for available resources in the DSPR resource pool.
will continuously search for the best team from all the available machines in the resource pool until the best working group is attained to satisfy the requirement specified by the administrator. The team formation algorithm can be represented using 5
As services are dynamically shifted and replicated to different servers, the users of the services must be informed of the location of the services. This can be achieved via the manipulation of domain name system (DNS) database. 3.1. The team formation algorithm The team formation algorithm is the core component of the DSPR framework that responsible for making decisions on service placement and service replication. The entire team formation algorithm is a close controlled loop process where the DSPR
Fig. 3.2. Anatomy of a DSPR managed resource.
2052
B.-Y. Ooi et al. / The Journal of Systems and Software 85 (2012) 2048–2062
Fig. 3.3. The proposed team formation cycle.
3.1.1. Stabilizing the team formation process A team is considered stable when no better team leader could be appointed and no better team could be formed from the resource pool. Although all the phrases are clearly distinguished, no two phrases should to be executed at the same time. Besides that, there is no turn basis between Phase 3 and Phase 5, which means it is not necessarily to execute Phase 3 before Phase 5 or vice versa. A stabilized team can become unstable again by either having a better team leader or a better team could be formed. The team could continuously change team members without changing the team leader or the team could continuously change the team leader without having a new team. An unstable team will repeat the cycle until the team is stabilized. In order for an unstable team to be stabilized, there are three conditions that must be fulfilled. The first condition is that the number of potential members must be finite. For the evaluation function, f(Xn ) to return the best score of n options, the f(Xn ) function must compare each and every option in the list. Let the list Xn = {x1 , x2 , x3 , xn } where each x is a valid combination of a team and x ⊂ r. Let r = {m1 , m2 , m3 . . . mn } where each m is a potential member in the resource pool, r. The total combination of teams can be mathematically computed as follows:
TC = phases. The following diagram shown in Fig. 3.3 illustrates the 5 phases. When the administrator introduces a new service to any of the member in the DSPR framework, the member will automatically appoint itself to be the team leader. This marks the beginning of the team formation algorithm (phase 1). The team leader will analyze the service requirement given by the administrator, and it will attempt to fulfill the requirement by itself. When the team leader realizes that it cannot fulfill the requirement, the team leader will proceed to Phase 2, initiating the member discovery process. In this process, all potential members will submit their profiles to the team leader. Profiles contain information such as the availability, performance and cost of each member depending on the criteria needed by the service. In Phase 3, once information of all potential members is gathered, the team leader begins to search for the best combination of members. This is the most important part of the entire team formation algorithm, and is carried out by the resource evaluation module. The resource evaluation module will return scores of all the candidate teams. The team with the highest score will be recommended for execution. There are two possible outcomes at the end of Phase 3:
• It recommends a better team to be executed, and proceed to Phase 4 • A better team could not be formed, and returns to Phase 2
In Phase 4, the recommended team will be executed. Phase 4 is considered complete when all members in the team are ready to function as a team. In Phase 5, the team leader begins to monitor the performance of each member periodically. If there is a member which out performs the existing team leader, that member will be appointed as the new team leader and the process will return to Phase 1. If there is no member is better than the existing team leader, the process will return to Phase 2 to recruit new members. This cycle will continue until a service termination signal is received from the administrator. This cycle can be stopped at any phase. Once the service is terminated, the team will be dissolved and all the members will return to the resource pool.
N i=1
n! i!(n − i)!
where TC is the total combination of n potential members and N is the maximum desired team size. As n approaches infinity (n → ∞) and N = n,TC will also approach infinity, (TC → ∞). Infinite total combination TC means that the evaluation function would not be able finish evaluating all the candidate teams, and would be unable to return the team with the best score. Therefore, a mechanism is required to limit the number of possible team combination. The second condition is that the evaluation function must be a deterministic function. A deterministic evaluation function is a function that will always return the same score whenever the function is being called with the same set of input values. If the function is non-deterministic, the team formation cannot be stabilized. For instance, if every execution of the evaluation function f(Xn ) returns a different score even with the exact same set of inputs, the team formation algorithm would have to construct a new team every time. Hence, the team will not be able to stabilize. The third condition is that the attributes of the members must not be overly volatile. The evaluation function gives scores to teams according to the criteria set by the administrator. For the purpose of this work for instance, availability, performance and cost are used. If the values of these attributes change drastically every time the evaluation function is called, the results of the evaluation function will return a different score, which in turn, will cause the team formation algorithm to construct a new team frequently. Unfortunately, in a dynamic distributed computing environment, it is difficult to fix the number of potential members in the resource pool as resources can be added and removed at any time. Besides that, resources are usually shared by multiple services. Thus, it is very difficult to have the attributes of all the resources to remain constant. Therefore, in order to help the team formation algorithm to achieve stable team, a snapshot technique is required. A snapshot records the status of all the potential members at a particular time and the team formation algorithm uses this information to construct a new team. All changes in the environment beyond the time of the snapshot are ignored. The delay between two snapshots would affect the sensitivity of the DSPR towards changes in the environment. Higher frequency of snapshots will allow the DSPR to make more decisions and moves but at the same
B.-Y. Ooi et al. / The Journal of Systems and Software 85 (2012) 2048–2062
2053
Fig. 3.4. Workflow of configuring the resource evaluation module.
time, additional workloads will be introduced to the distributed system. 3.1.2. Solving resource contention between teams Besides forming team, there are several issues that need to be considered as well. For instance, • Two teams competing for a same resource: This would be a very common issue in the DSPR framework, especially when the available resources are limited. Fortunately, such contention of resource is not a new problem and it is very similar to a classical networking problem where multiple devices attempt to use the same channel to transmit data. Thus, techniques such as Carrier sense multiple access with collision detection (CSMA/CD), and Carrier sense multiple access with collision avoidance (CSMA/CA) that were used to resolve the resource contention on networking environment can be adopted to solve the problem of multiple teams competing for a same resource. • Two members of a team competing to become the team leader: There would also be chances that two equally qualified members in a team that attempt to take over the team leader position. There are two methods to solve this issue. One, the change of team leader requires the candidate to be better. Equally good is not sufficient to take over the team leader position. Two, this issue can be solved at implementation level by avoiding the managed service from interrupted during the process of changing team leader. • A network disjointed team would cause a service to be illegally replicated: There are times that the networks failures would cause teams to be disjointed; a team divided into two sub-teams. The team members that are separated from the team leader would perceive that the team leader is down and will attempt to replace it. While the team leader that is separated from some of its members will recruit new members to form a new team. In order to solve this issue, team member that wanted to take over the team leader position must be able to communicate with the DNS service and confirm the connection to existing team leader is down. Members that could not communicate with the DNS are not allowed to become team leader. 3.2. Identifying the appropriate resource The DSPR framework is required to select appropriate resources for all the services based on the requirements specified by the administrators. Therefore, the resource evaluation module is a function made to return a metric to distinguish suitability of the given candidate team in hosting a service. Fig. 3.4 illustrates the workflow of configuring the resource evaluation module. The configuration process begins with configuring the FL of the resource evaluation module which includes the memberships’ functions and production rules of the fuzzy inference engine. The proposed resource evaluation function uses Fuzzy Logic (FL) to represent the administrator’s resource management policies using high-level rules such as if availability is “good”, performance is
Fig. 3.5. Limitation of the fuzzy inference system, producing the same score for Team A and B.
“moderate” and cost is “reasonable” then the acceptance is “moderate”. The proposed solution also incorporates Adaptive-Network-Based Fuzzy Inference System (ANFIS) into the evaluation function. The purpose of having ANFIS in the evaluation function is to allow the DSPR to learn from the feedbacks given by the administrators and make better decisions preferred by the administrators in the future. The reason of choosing FL over other resource evaluation methods is that FL has the ability to perform approximate reasoning rather than exact. Thus, it can tolerate resources that partially qualified when no other better options are available. On top of that, fuzzy logic employs a tradeoff mechanism that is carried out via the fuzzy membership function and fuzzy inference engine. For the tradeoff mechanism, the administrator specifies rules and range of values that are considered as good, average and poor in terms of availability, cost and performance. These values are more intuitive for the administrator to set compared to assigning specific weights in the case of utility function-based. Although FL has the ability to reason with imprecision and uncertainty, FL is not without its own problems. For instance, from Fig. 3.5, the two teams might be given the similar score because the membership functions involved for both teams are the same. This phenomenon is due to FL’s approximation nature. Therefore, we introduced ANFIS into the proposed resource evaluation technique. ANFIS allows the administrator to perform adjustments in the resource evaluation process without having to completely reconfigure the FL. For the purpose of this work, this adjustment process is known as feedback. Feedback is requested by the resource evaluation process when there are more than one options that have the same highest score, and the resource evaluation process could determine which option is better on behalf of the administrator. Fig. 3.6 illustrates the scenario: From the example in Fig. 3.6, the administrator is only required to choose the preferred option. The feedback process will automatically adjust the score of the all the non-preferred option to the second highest option’s score. In this case, the score of the second highest option is 21. The reduced score is then used by ANFIS to perform adaptation. Thus, if a similar case is being encountered in the future, the resource evaluation process would know which one is the preferred choice of the administrator. The reason for combining FL and ANFIS instead of using only ANFIS, is that because it is not possible for the administrator to provide a huge number of input–output data pairs in advance for ANFIS training. Therefore, FL is retained in the resource evaluation function. FL is used to represent the administrator’s resource management policies and generates input–output data pairs for ANFIS
2054
B.-Y. Ooi et al. / The Journal of Systems and Software 85 (2012) 2048–2062
Fig. 3.8. Average training time taken for 2500 data.
Fig. 3.6. An example of the DSPR evaluation function feedback.
training. Subsequently, ANFIS is used to learn feedbacks given by the administrators so that FL does not need to be reconfigured all over again. Once the training process is completed, the ANFIS is ready to be used to perform resource evaluation. Fig. 3.7 depicts the workflow of the resource evaluation process using ANFIS. For the resource evaluation process, the team score computed by the ANFIS will be used as a score to determine which team combination is more suitable without needing the FL. The feedback given by the administrator will also be used by ANFIS to perform adaptation which means that if in the future when similar case arises, the ANFIS will know which one of the options to choose.
Fig. 3.9. Average RMSE after the training.
3.2.1. Justifications for adopting ANFIS: ANFIS vs. neural network Prior to the proposal of using ANFIS to enable the resource evaluation function to learn from the administrator’s feedbacks, we had the intention to design the adaptive resource evaluation by using multilayer perceptron neural network (NN). However, (Negnevitsky, 2002; Kulkarni, 2001) claimed that ANFIS have better ability to generalize and converge rapidly. This feature is particularly important for application that requires online learning. Therefore, we compared ANFIS and NN based on time required for training, root mean square error (RMSE) and number of epoch required. For the comparison, 5 different services with different requirements were used. Fuzzy logic was used to capture the administrator’s resource management policies and generate input–output data pairs for the NNs and ANFIS training. A total of
2500 training data was used. The training algorithm used in the NNs was the Levenberg–Marquardt back propagation algorithm while the training algorithm used in ANFIS is the hybrid learning algorithm proposed in (Jang, 1993). Figs. 3.8–3.10 show the results from our experiments. From the results shown in Figs. 3.8–3.10, show that ANFIS did not produce the shortest training time. Neural networks with fewer hidden nodes completed the training process faster but produced a higher RSME. The experiment also showed that lower epoch does
Fig. 3.7. Workflow of the resource evaluation module.
Fig. 3.10. Average number of epoch required for training.
B.-Y. Ooi et al. / The Journal of Systems and Software 85 (2012) 2048–2062
2055
Fig. 3.11. Scatterplot charts that show the correlation between FL and ANFIS using different training size.
not imply lower training time. Although the training time for ANFIS is 33.317 s which seems long, the training time taken is for 2500 data. The adaptation process will have smaller number of cases and shorter training time. Unless, at the beginning, the administrator configured a set of unreasonable membership functions that causes the resource evaluation function unable to make decision and requests for feedbacks from administrator frequently. In this work, ANFIS was chosen to be incorporated into the resource evaluation function because ANFIS has a lower RMSE than NN which indicates that ANFIS has a better ability to learn and does not require the user to determine the optimal structure of the learning algorithm as in the case of NN.
From the experiments, the results showed that as the number of training samples increases, the Pearson’s correlation coefficient improves and approaches 1 which implies that ANFIS can model FL well with sufficient amounts of input–output data pairs. From Table 3.1, it showed that the improvement of Pearson correlation coefficient become insignificant when the number of training data used is larger than 2000. Therefore for the DSPR’s resource evaluation implementation, we proposed to use 2500 input–output data pairs for all the training process (configuring the resource evaluation function).
4. Experimental results 3.2.2. Ability of ANFIS to model fuzzy logic In view that ANFIS is introduced into the proposed resource evaluation function, the immediate concerns were the ability of ANFIS to model the reasoning ability of FL. The FL is configured by the administrator and used to generate input–output data pairs for ANFIS training. To address these concerns, experiments was carried out to measure the ability of ANFIS to model the reasoning ability of FL using scatterplot charts and Pearson’s correlation coefficient. Fig. 3.11 shows the result of training the ANFIS using various sample size, from 500 to 3000. Table 3.1 shows the Pearson’s correlation coefficient with corresponding p-value between the outputs of FL and ANFIS when given the same inputs.
The focus of these experiments is to show the ability of the proposed framework in enhancing service availability and reducing low-level decisions making problems from administrator. Table 4.1 presents a summary of the experiments conducted in this work.
4.1. Simulation settings for Test 1 and Test 2 For test 1 and test 2, 5 different services were used and each service was managed by a team of machines. The environment consisted of 30 machines that were evenly distributed between two buildings. Each machines is given different availability, cost
2056
B.-Y. Ooi et al. / The Journal of Systems and Software 85 (2012) 2048–2062
Table 3.1 Pearson’s correlation coefficient between the FL and ANFIS with corresponding p-value using different training size. ANFIS training size
500
1000
1500
2000
2500
3000
Pearson correlation coefficient p-Value
0.997543 7.0351E−34
0.998574 3.5057E−37
0.999039 1.4034E−39
0.999103 5.3505E−40
0.999347 6.2471E−42
0.999351 5.733E−42
Table 4.1 Summary of experiments. Experiments
Descriptions
Test 1
Evaluation of the ability of DSPR to increase service availability in environments where resources are dynamically removed and restored Compares the ability of DSPR in increasing service availability with conventional failover systems in environment where resources are dynamically removed and restored. Evaluation of the ability of DSPR to increase service availability in environments where resources are dynamically removed and NOT restored Tests the ability of DSPR to increase the availability of services in the event of multiple resource failures without restoration. DSPR framework from a user perspective Observe the actual service performance of a service managed by the DSPR framework from an end user’s perspective.
Test 2
Test 3
• Baseline phase: This is the initial stage of the system. At this phase, all services are working under normal conditions. • Test phase: This is when the working services are introduced with disturbance which is resources dynamically removed. • Check phase: This is when the results are collected and analyzed.
Fig. 4.1. Configuration of the simulation environment for testing DSPR ability in improving service availability.
We compared the number of services that were still available in the check phase with that in the baseline phase to obtain the probability of the services that were still available after all the fault injections. The experiments were repeated for 50 times to obtain the average probability of a service being available after the resources were being turned off and on randomly. Services that were not hosted by any server or workstation were considered as not available. The average of these probabilities was served as a measurement of the effectiveness of the DSPR framework. The average probability was computed using the following equation: 4.2. Test 1: evaluation of the ability of DSPR to increase service availability in environments where resources are dynamically removed and restored
Fig. 4.2. Autonomic computing benchmark phases.
and performance values. Fig. 4.1 illustrates the environment of the simulation. In this section, we measured the percentage of services that were still available after multiple resources failures. In this simulation, 5 different services, each with different requirements were deployed. The following is the details of the 5 services (Table 4.2). The method used to evaluate DSPR in this section is similar to the autonomic computing benchmark methodology proposed in (Coleman et al., 2008) which consisted of three phases. These phases are the baseline phase, the test phase and the check phase. Figs. 4.2 and 4.3 illustrate these phases.
Fig. 4.3. Fault injection subintervals.
The purpose of this simulation was to test the ability of the DSPR framework in enhancing service availability in an environment where servers and workstations could be randomly turned off and on. The service availability of the DSPR framework was benchmarked against the availability of services that used a system without failover capability and a system with failover capability with 2 and 3 redundant units in the simulated distributed environment. Fig. 4.4 is the result of the simulation. From the simulation result, it shows that DSPR is more capable of enhancing service availability than the system without failover capability (Static 01) and the system with two machines to enable failover capability (Static 02). Although the DSPR did not increase service availability to the level of having a failover capability with three machines (Static 03), the main difference between DSPR and conventional failover systems is that DSPR does not require the administrator to specify the placement of all the services in advance. To verify the results of the simulation, we compared our simulation results with results that were computed using hypergeometry distribution function (McClave and Sincich, 2003; Haggard et al., 2006). Hypergeometry distribution function could be used to compute the probability that a service running on r number of servers is still available when n number of servers are removed (taken out
B.-Y. Ooi et al. / The Journal of Systems and Software 85 (2012) 2048–2062
2057
Table 4.2 Service requirements of the 5 simulated services. Service name
Service descriptions (example)
Availability (uptime %)
Cost (Cents/H)
Performance (requests/s)
Service 1 Service 2 Service 3 Service 4 Service 5
Application server for deployment Application server for development Budgeted service that requires high-availability resources Service that requires high-availability resources Service that requires high-performance resources
99 90 99.99 99.999 99.9
70 30 100 300 250
300 300 300 350 1200
without replacement) from a population of N number of servers. For instance, the probability for a service running on 2 servers is being removed when 15 servers are removed simultaneously from a population of 30 servers can be calculated using hypergeometry distribution function. Mathematically: P(X) =
C(r, x) × C(N − r, n − x) C(N, n)
Where N = total number of servers, r = number of servers running the service, n = number of servers removed and x = number of servers running the service that would be removed from the n elements. For instance, the calculation for the example just mentioned would be, N = 30, r = 2, n = 15, and x = 2. P(X) =
C(2, 2) × C(30 − 2, 15 − 2) C(30, 15)
P(x) =
C(2, 2) × C(28, 13) C(30, 15)
1 × 37442160 P(x) = = 0.241379 155117520 Therefore, the probability that the service is still available after 15 machines are removed simultaneously would be 1 − P(x), because P(x) is the probability that the service is removed. Thus, the probability that the service is still available for the just mentioned example would be 1 − P(x) = 1 − 0.241379 = 0.758621. The hypergeometry distribution function is used to plot the probability of a service that is available for different number of machines being turn off simultaneously. The graph shown in Fig. 4.5 is generated from the hypergeometry distribution function, and Table 4.3 shows the mean square error between the simulation and the value computed using hypergeometry distribution function.
Fig. 4.5. Probability of a service being available with different number of machines being simultaneously turned off and on, calculated using hypergeometry distribution function.
The hypergeometry distribution function is not suitable to model DSPR because the number of service replicated is not static. Therefore, the shaded area in Fig. 4.5 is to show the approximated result of the DSPR based on the previous empirical experiment. Similarly, the MSE error for DSPR could not be theoretically computed. Thus, we omitted the MSE error of DSPR from Table 4.3. The small MSE between the simulation and the probability computed using hypergeometry distribution function indicates that the simulated results are close to the theoretical calculation. It verifies the accuracy of the simulation. However, it is important to note that the DSPR’s performance is very dependent on the available resources and the requirements given by the administrator. Nevertheless, the advantage of the DSPR highlighted by this experiment is that the administrator does not need to specify the placement details even in a dynamic environment. 4.3. Test 2: evaluation of the ability of DSPR to increase service availability in environments where resources are dynamically removed and NOT restored The purpose of this simulation was to test the ability of the DSPR framework in enhancing service availability in an environment where servers and workstations could be randomly removed and never restored. Table 4.3 Mean square error (MSE) between the simulation and the value computed using hypergeometry distribution function.
Fig. 4.4. Probability of a service being available with different number of machines being simultaneously turned off and on.
Mean square error (MSE)
Static01
Static02
Static03
0.00021
0.00056
0.00043
2058
B.-Y. Ooi et al. / The Journal of Systems and Software 85 (2012) 2048–2062 Table 4.4 The availability, cost and performance of the servers used by DSPR.
Fig. 4.6. Probability of a service that is available at different stages of the simulation.
Machines from both buildings were removed randomly in batches of 5 machines until all machines from both buildings were removed. Sufficient time was given to all the formation of a new team before the subsequent batch of machines was removed. There are 6 stages before all the machines were being removed. This experiment observes the number of services that remain at each stage. For benchmarking, systems with and without failover capabilities were used. For system a with failover capability, each service was given two servers from different buildings; while for a system without failover capability, each service was given one server located in one of the buildings. Services that did not have any machine were considered as not available. Besides that, the results also for DSPR were also compared with conventional methods such as RSerPool (Dreibholz et al., 2007). The graph shown in Fig. 4.6 shows the results from the experiment. The line Static01 shows the probability that the service is available without failover capability. Static02 and Static03 show the probability that the service is available with failover capabilities using 2 and 3 redundant services. From the results, it shows that DSPR has the probabilities of 71.2% and 70.4% respectively that their services were still available when 25 out of 30 machines (approximately 83.3%) were removed from the resource pool. Services running on 3 redundant servers (Static03) only had the probability of 40.8% that its services are still available. 4.4. Test 3: DSPR framework from a user perspective The purpose of this experiment was to observe the performance of a service as it is managed by the framework from a user perspective. In this experiment, a web-service was deployed on actual servers connected via the wide area network (WAN). The performance of the service was measured through a remote client computer. We observed the performance of the service changes as the proposed framework performs service placement and replication. Fig. 4.7 shows the overview of the experimental setting. The following is the requirement of the service: 1) Availability ≥0.99 2) Cost ≤$0.60/h 3) Performance ≥60 requests per second.
Machines
Availability (%)
Cost (S/h)
Performance (requests/s)
Aurora 01 AuroraQ2 Exabytes SunSPARC
85 85 99 99.95
0.10 0.10 0.50 1.05
31 28 51 108
In total, there were 5 machines involved in this experiment; 4 servers were used for DSPR to perform service placement and replication while 1 of the machine acted as both the DNS and the client computer. The client computer was installed with Apache’s JMeter server benchmarking tool. The JMeter can simulate multiple concurrent HTTP requests to the service hosted on the DPSR framework. The DSPR will constantly update the client machine regarding the location of the service by updating the DNS. For the purpose of this experiment, the deployed service is a web service that computes the total sum of all the values from 1 to any given value. For each service request, this value would be a random number. Before the experiment begins, the DSPR perform a performance evaluation on all the given servers and the administrator decides on the availability and cost of the servers. The value of requests/s for each machine is obtained via local machine test. The JMeter benchmark tool is installed and executed locally in respective machines. This information was used by the DSPR to manage the given service. Table 4.4 shows the availability, cost and performance of the machines in the distributed environment. The client machine was not shown here because the service will not be hosted by the client machine. Fig. 4.8 is the performance of the service managed by DSPR from the perspective of a user. The data was collected from the client machine using the installed Apache’s JMeter server benchmarking tool. The experiment began with the service deployed in Aurora01 with DSPR turned off. The number of concurrent HTTP requests to the service started at 80. The number concurrent HTTP requests are represented with threads/10 s. The 10 s is the ramp-up period of all the threads. Therefore, 120 threads/10 s means that JMeter will take 10 s to start all the 120 threads. This is to avoid large amount of threads from congesting the server at the beginning of the test. The error rate is the percentage of requests that fails to get response from the service. The team leader was used to handle all the requests and the team member will only be used when the team leader started to fail in handling the subsequent incoming requests. At 120 threads/10 s, the Aurora01 only manages to handle 7.49% of the requests and left 92.51% of the requests with errors. The DSPR of the service was only turned on when the number of requests increased to 120. The DSPR was not turned on earlier to avoid service placement and replication process to be executed before the readings were collected. When DSPR was turned on, it team up the Exabytes with the Aurora01 servers to attempt to fulfill the requirements of the service. The error rate was successfully reduced to zero even when the concurrent HTTP requests increased to 180. As the concurrent HTTP requests reached 200, the Exabytes server was removed to test the ability of the DSPR to perform service placement and replication. All the requests made to Exabytes are diverted to the Aurora01 server but due to the high number of concurrent HTTP requests, it causes the service running on Aurora01 to stall and 100% of the service requests failed. Aurora02 was then teamed up with Aurora01 by the DSPR to form a new team to fulfill the give service’s requirements. The new team managed to handle up to 240 concurrent HTTP requests with about 1.5% error rate. At 260 threads/10 s, the error rate increased to
B.-Y. Ooi et al. / The Journal of Systems and Software 85 (2012) 2048–2062
2059
Fig. 4.7. The environment settings for actual service performance observed by the end user.
96.42% when the team consisting of Aurora01 and Aurora02 servers could not handle the number of simultaneous requests. The Exabytes server was returned to the pool when the number of concurrent HTTP requests reached 280; the DSPR changed the team back to Aurora01 and Exabytes servers. From Fig. 4.8, this team is capable of handling up to 320 concurrent requests with 2.32% of error. The experiment was then continued. This time, instead of removing and adding a server, we modified the service requirements. The following were the new requirements of the service:
(1) Availability ≥0.99999
(2) Cost ≤$1.90/h (3) Performance ≥160 requests per second.
As the service requirements were modified, the DSPR changed the team that consisted of Exabytes and Aurora01 services to a team of consisting of Exabytes and SunSPARC servers when the number of concurrent HTTP requests was at 340. This new team was capable of handling up to 600 threads/10 s of HTTP requests without error. Finally, when the number of concurrent HTTP requests was at 620, the SunSPARC machine was removed and the Exabytes took over. Unfortunately, the number of concurrent requests was too
Fig. 4.8. The service performance observed by the user as the number of concurrent access increases.
2060
B.-Y. Ooi et al. / The Journal of Systems and Software 85 (2012) 2048–2062
overwhelming for the Exabytes server to handle. Exabytes server failed to manage the requests and the error rate reaches 99.11%. To conclude, this experiment showed that DSPR was capable of teaming up resources to improve the performance of a service. However, DSPR was not able to mask the errors caused by the removal of servers completely if the performance required by the service could not be satisfied by the remaining team members. Besides that, the DSPR framework needs some time to detect the failure or removal of a server. 5. Conclusion This work presented a framework that reduces the complexity of service management on the ever-growing distributed computing environment. The framework is designed to reduce the burdens of administrators from the issues related to service placement and service availability. Most of the existing solutions focuses on service placement and operate on a very specific set of objectives such as, minimizing the number of changes, maximizing the performance and reducing the overall power consumption. Unlike existing solutions, our framework incorporated service placement with service replication to enhance not only the performance but the availability of services. We also developed a simulator to show the effectiveness and limitation of the proposed solution. From our experiments, we showed that it is possible to free administrators from making low-level service placement decisions and designing the failover capabilities for all the managed services. Two simulated scenarios, removing resources without restoration and removing resources with restoration are very common issues in distributed system environment. We showed that the proposed framework is able to manage the services with minimal administrator intervention and the availabilities of the managed services are even better than conventional server failover solutions. 5.1. Research limitation Although the results from the experimental results indicate that the proposed framework is feasible, there are issues that need to be considered for this framework to be effectively applied in the real world. For instance: • This research did not take data synchronization issue into consideration. A mechanism is required to ensure that the data shared among the services must be accurate and synchronized timely. Services are continuously shifted and replicated; it is very difficult to keep track of the latest changes to their data. Although a centralized storage could solve the data race issue, a centralized storage would become a single point failure of this self-management framework. An efficient and effective distributed storage must be developed with a constraint that the synchronization process must not overload the network infrastructure. • This research assumed that services can be migrated and replicated across all platforms at no time. Although the advancement of virtual machines allows applications to be executed across different platforms, there is still a problem. For instance, not all hardware supports virtualizations. Besides that, replicating and migrating of virtual machines involves huge files especially involving the transfer of virtual hard disk file. Transferring such large files over the network would consume a lot of bandwidth, waiting time and prone to error. • The proposed framework allows services to be migrated and replicated across network and did not put computer security and information confidentiality into consideration. This work has yet
to address security issues such as server impersonation. There are no controls on which resources are restricted and which resources are accessible by the services. • Resource failures cannot be completely masked from user. From our experimental results, it showed that errors caused by the removal of resources require time to be detected and time to rectify it. 5.2. Research future work We would like to conclude this work by presenting possible future work for this research. • The type of services referred in this research is standalone and no dependencies on other the web services. This allows the proposed framework to shift services to any connected servers without any adverse effect. For future work, the framework should take multi-tier web services architecture into consideration. Multitier architecture often requires a service to be logically separated into presentation, application processing, and data management processes which could be running as individual services. These services are required to collaborate and work together to function as a whole. Thus, the framework would be required to consider the overall service performance before making decision on shifting. • The future work of this proposed framework could also explore the opportunity to team up resources to ensure the correctness of the outputs before replying to the service requestor. For instance, 2 similar services running on 2 different servers can be deployed as a system with dual modular redundant system (DMR) and 3 services can be deployed as triple modular redundant system (TMR) where errors could be detected and masked. • The complexity of the proposed resource evaluation function is relatively more complex than currently available solutions and it is more compute intensive in evaluating resources. Therefore a faster mechanism will be needed as the size of the distributing environment is continuously growing and more resources needed to be evaluated. Therefore techniques to reduce the team formation algorithm search space should also be further explored. For instance, resource clustering can be employed to cluster resources with similar specification and availability together to reduce the search space. Thus, instead of evaluating each resource individually, the DSPR framework evaluates the specification and the availability of the clusters. • Resource capacity and utilization predictions would be a good extension to this framework. If the framework has the ability to predict the max number of services or the utilization that of a resource can handle for a particular time, it would definitely help the framework to provide better team solutions. In conclusion, although this work showed that the proposed solution is feasible and the results are promising but there is more work to be done before the actual implementation of this self-management framework can be fully implemented on actual distributed environments. References Ganek, A.G., Corbi, T.A., 2003. The dawning of the autonomic computer era. IBM Systems Journal 42, 5–18. Kephart, J.O., Chess, D.M., 2003. The vision of autonomic computing. Computer 36, 41–50. Kounev, S., Nou, R., Torres, J., 2007. Autonomic QoS-aware resource management in grid computing using online performance models. In: Proceedings of the 2nd International Conference on Performance Evaluation Methodologies and Tools, ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), Nantes, France.
B.-Y. Ooi et al. / The Journal of Systems and Software 85 (2012) 2048–2062 Folino, G., Spezzano, G., 2007. An autonomic tool for building self-organizing gridenabled applications. Future Generation Computer Systems 23, 671–679. Liu, H., Bhat, V., Parashar, M., Klasky, S., 2005. An autonomic service architecture for self-managing grid applications. In: Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing, IEEE Computer Society. Parashar, M., Liu, H., Li, Z., Matossian, V., Schmidt, C., Zhang, G., Hariri, S., 2006. AutoMate: enabling autonomic applications on the grid. Cluster Computing 9, 161–174. Vlassov, V., Li, D., Popov, K., Haridi, S., 2006. A scalable autonomous replica management framework for grids. In: Proceedings of the IEEE John Vincent Atanasoff 2006 International Symposium on Modern Computing, IEEE Computer Society. Agarwal, M., Bhat, V., Liu, H., Matossian, V., Putty, V., Schmidt, C., Zhang, G., Zhen, L., Parashar, M., Khargharia, B., Hariri, S., 2003. AutoMate: enabling autonomic applications on the grid. In: Autonomic Computing Workshop, pp. 48–57. Hariri, S., Khargharia, B., Chen, H., Yang, J., Zhang, Y., Parashar, M., Liu, H., 2006. The autonomic computing paradigm. Cluster Computing 9, 5–17. P. Horn, Autonomic Computing: IBM’s Perspective on the State of Information Technology, Technical Report, IBM Corporation October 15, 2001. Ganek, A.G., 2007. Overview of autonomic computing: origins, evolution, direction, in autonomic computing. In: Parashar, M., Hariri, S. (Eds.), Concepts, Infrastructure, and Applications. CRC Press, pp. 3–18. Herrmann, K., Muhl, G., Geihs, K., 2005. Self management: the solution to complexity or just another problem? IEEE Distributed Systems Online 6. ITIL® Official Website, Internet: http://www.itil-officialsite.com/ (27.10.2010). Zhu, X., Young, D., Watson, B.J., Wang, Z., Rolia, J., Singhal, S., McKee, B., Hyser, C., Gmach, D., Gardner, R., Christian, T., Cherkasova, L., 2008. 1000 Islands: integrated capacity and workload management for the next generation data center. In: Proceedings of the 2008 International Conference on Autonomic Computing, IEEE Computer Society. Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer, R., Pratt, I., Warfield, A., 2003. Xen and the art of virtualization. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles Bolton Landing, ACM, NY, USA. Microsoft System Center Virtual Machine Manager 2008 white paper, Microsoft, April 2008. VMware Infrastructure Resource Management with VMware DRS, white paper, VMware, 2006. DRS Performance and Best Practices, VMware® Infrastructure 3, white paper, VMware, 2008. Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Stoica, I., Zaharia, M., 2010. A view of cloud computing. Communications of the ACM 53, 50–58. Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R.H., Konwinski, A., Lee, G., Patterson, D.A., Rabkin, A., Stoica, I., Zaharia, M., 2009, February. Above the clouds: a berkeley view of cloud computing. Electrical Engineering and Computer Sciences University of California at Berkeley. Sailer, A., Head, M.R., Kochut, A., Shaikh, H., 2010. Graph-based cloud service placement. In: IEEE International Conference on Services Computing (SCC), pp. 89–96. Goscinski, A., Brock, M., 2010. Toward dynamic and attribute based publication, discovery and selection for cloud computing. Future Generation Computer Systems 26, 947–970. Velte, A.T., Velte, T.J., Elsenpeter, R., 2010a. Chapter 13: migrating to the cloud. In: Cloud Computing: A Practical Approach. McGraw-Hill, pp. 277–295. Velte, A.T., Velte, T.J., Elsenpeter, R., 2010b. Chapter 14: best practices and the future of cloud computing. In: Cloud Computing: A Practical Approach. McGraw-Hill, pp. 297–314. Joseph, J., 2009, May. Patterns for high availability, scalability and computing power with windows azure. In: MSDN Magazine. Kephart, J.O., 2005. Research challenges of autonomic computing. In: Proceedings of the 27th International Conference on Software Engineering, ACM, St. Louis, MO, USA. Salehie, M., Tahvildari, L., 2005. Autonomic computing: emerging trends and open problems. SIGSOFT Software Engineering Notes 30, 1–7. Parashar, M., Hariri, S., 2005. Autonomic computing: an overview. In: Unconventional Programming Paradigms. Springer Verlag, pp. 247–259. Ranganathan, A., Campbell, R.H., 2007. What is the complexity of a distributed computing system? Research articles. Complex 12, 37–45. Ooi, B.Y., Chan, H.Y., Cheah, Y.-N., 2010. Dynamic service placement and redundancy to ensure service availability during resource failures. In: International Symposium in Information Technology (ITSim), pp. 715–720. A. Andrzejak, S., Graupner, V. Kotov, and H. Trinks, Algorithms for Self-Organization and Adaptive Service Placement in Dynamic Distributed Systems, HP Labs Technical Report, September 2002. Nogueira, L., Pinho, L.M., 2009. Time-bounded distributed QoS-aware service configuration in heterogeneous cooperative environments. Journal of Parallel and Distributed Computing 69, 491–507. Famaey, J., Cock, W.D., Wauters, T., Turck, F.D., Dhoedt, B., Demeester, P., 2009. A latency-aware algorithm for dynamic service placement in large-scale overlays. In: IFIP/IEEE International Symposium on Integrated Network Management (IM 2009), New York, pp. 414–421. Karve, A., Kimbrel, T., Pacifici, G., Spreitzer, M., Steinder, M., Sviridenko, M., Tantawi, A., 2006. Dynamic placement for clustered web applications. In: Proceedings
2061
of the 15th International Conference on World Wide Web, ACM, Edinburgh, Scotland. Adam, C., Stadler, R., 2007. Service middleware for self-managing large-scale systems. IEEE Transactions on network and Service Management 4, 50–64. Adam, C., Stadler, R., Tang, C., Steinder, M., Spreitzer, M., 2007. A service middleware that scales in system size and applications, in integrated network management. In: 10th IFIP/IEEE International Symposium, IM’ 07, Munich, pp. 70–79. Ardagnaa, D., Trubian, M., Zhang, L., 2007. SLA based resource allocation policies in autonomic environments. Journal of Parallel and Distributed Computing, 259–270. Famaey, J., Wauters, T., Turck, F.D., Dhoedt, B., Demeester, P., 2008. Towards efficient service placement and server selection for large-scale deployments. In: Proceedings of the 2008 Fourth Advanced International Conference on Telecommunications, IEEE Computer Society. Abd-El-Barr, M., 2007. Design and Analysis of Reliable and Fault-tolerant Computer Systems. Imperial College Press, London. Siewiorek, D.P., Swarz, R.S., 1992. Reliable computer systems. In: Design and Evaluation, 2nd ed. Digital Press. Sun Microsystems, Inc., 2002, October. SunTM ONE Grid Engine Administration and User’s Guide. Shooman, M.L., 2002. Reliability of Computer Systems and Networks: Fault Tolerance, Analysis and Design. John Wiley & Sons, Inc. Availability Solutions Overview, Microsoft MSDN, High (11/11/2010), http://msdn.microsoft.com/en-us/library/ms190202.aspx 2010. Barish, G., Obraczke, K., 2000. World wide web caching: trends and techniques. IEEE Communications Magazine 38, 178–184. Bryhni, H., Klovning, E., Kure, O., 2000, August. A comparison of load balancing techniques for scalable web servers. IEEE Network 14, 58–64. Improving Business Continuity with VMware Virtualization, white paper, VMWare, 2008. Nagaraja, K., Krishnan, N., Bianchini, R., Martin, R.P., Nguyen, T.D., 2003. Quantifying and improving the availability of high-performance cluster-based internet services. In: Supercomputing, 2003 ACM/IEEE Conference, pp. 27–27. Roehm, B., Chen, G., Fernandes, A.d.O., Ferreira, C., Krick, R., Ley, D., Peterson, R.R., Poul, G., Wong, J., Zhou, R., 2005. High Availability and Caching in WebSphere Application Server V6: Scalability and Performance Handbook. IBM Redbooks. Dreibholz, T., Zhou, X., Rathgeb, E.P., 2007. A performance evaluation of RSerPool server selection policies in varying heterogeneous capacity scenarios. In: 33rd EUROMICRO Conference on Software Engineering and Advanced Applications, 2007, pp. 157–166. Bozinovski, M., Schwefel, H.P., Prasad, R., 2007. Maximum availability server selection policy for efficient and reliable session control systems. IEEE/ACM Transactions on Networking 15, 387–399. Dreibholz, T., Rathgeb, E.P., 2007. On improving the performance of reliable server pooling systems for distance-sensitive distributed applications. In: Brauer, W. (Ed.), Kommunikation in Verteilten Systemen (KiVS). Springer Berlin Heidelberg, pp. 39–50. Rosenberg, F., Platzer, C., Dustdar, S., 2006. Bootstrapping performance and dependability attributes of web services. In: International Conference on Web Services, ICWS’ 06, pp. 205–212. Van, H.N., Tran, F.D., Menaud, J.-M., 2009. Autonomic virtual resource management for service hosting platforms. In: Proceedings of the 2009 ICSE Workshop on Software Engineering Challenges of Cloud Computing, IEEE Computer Society. Verma, A., Ahuja, P., Neogi, A., 2008. Power-aware dynamic placement of HPC applications. In: Proceedings of the 22nd Annual International Conference on Supercomputing Island of Kos, ACM, Greece. Oikonomou, K., Stavrakakis, I., Xydias, A., 2008. Scalable service migration in general topologies. In: International Symposium on a World of Wireless, Mobile and Multimedia Networks, 2008, WoWMoM 2008, pp. 1–6. Oikonomou, K., Stavrakakis, I., 2010. Scalable service migration in autonomic network environments. IEEE Journal on Selected Areas in Communications 28, 84–94. Menasce, D.A., Ewing, J.M., Gomaa, H., Malex, S., Sousa, J.P., 2010. A framework for utility-based service oriented design in SASSY. In: Proceedings of the First Joint WOSP/SIPEW International Conference on Performance Engineering, ACM, San Jose, California, USA. Menasce, D.A., Gomaa, H., Malex, S., SousaF J.P., 2011. SASSY: a framework for selfarchitecting service-oriented systems. IEEE Software, 1. Herrmann, K., 2010. Self-organized service placement in ambient intelligence environments. ACM Transactions on Autonomous and Adaptive Systems 5, 1–39. Negnevitsky, M., 2002. Artificial Intelligence A Guide to Intelligent Systems, 2nd ed. Addison Wesley. Kulkarni, A.D., 2001. Fuzzy neural network models. In: Computer Vision and Fuzzy Neural Systems. Prentice Hall, pp. 313–345. Jang, J.S.R., 1993, June. ANFIS: adaptive-network-based fuzzy inference system. IEEE Transactions on Systems, Man and Cybernetics 23, 665–685. Coleman, J., Lau, T., Lokande, B., Shum, P., Wisniewski, R., Yost, M.P., 2008. The autonomic computing benchmark. In: Kanoun, K., Spainhower, L. (Eds.), Dependability Benchmarking For Computer Systems. John Wiley & Sons, Inc. McClave, J.T., Sincich, T., 2003. The hypergeometric random variable. In: Statistics. Prentice Hall, pp. 208–210. Haggard, G., Schlipf, J., Whitesides, S., 2006. The hypergeometric distribution. In: Discrete Mathematics for Computer Science. Thomson Brooks/Cole.
2062
B.-Y. Ooi et al. / The Journal of Systems and Software 85 (2012) 2048–2062
Boon-Yaik Ooi is a PhD candidate from the School of Computer Sciences, (USM) Universiti Sains Malaysia. He is currently working as a Lecturer at Faculty of Information and Communication Technology (FICT), Universiti Tunku Abdul Rahman (UTAR), Malaysia. He received his B.Sc. and M.Sc. degrees from Universiti Sains Malaysia in 2004 and 2007, respectively. His research interests include to grid computing and intelligence systems.
Huah-Yong Chan is an associate professor in the School of Computer Sciences, Universiti Sains Malaysia (USM), Malaysia. He is also the Chief of the research group of grid computing at USM. He received his Ph.D, degree from the Université de Franche-Comté, France, in 1999. He is actively involved in grid computing research activities, both, at the national and international level. His research spans from grid computing, to a more specific issues such as resource allocation, load balancing, software agents, and middleware engineering.
Yu-N Cheah received his Ph.D. degree from Universiti Sains Malaysia in 2002. He is currently an associate professor at the School of Computer Sciences, Universiti Sains Malaysia and his research interests include knowledge management, intelligent systems, semantic technologies, and health informatics.