An evaluation of multi-model self-managing control schemes for adaptive performance management of software systems

An evaluation of multi-model self-managing control schemes for adaptive performance management of software systems

The Journal of Systems and Software 85 (2012) 2678–2696 Contents lists available at SciVerse ScienceDirect The Journal of Systems and Software journ...

1MB Sizes 4 Downloads 43 Views

The Journal of Systems and Software 85 (2012) 2678–2696

Contents lists available at SciVerse ScienceDirect

The Journal of Systems and Software journal homepage: www.elsevier.com/locate/jss

An evaluation of multi-model self-managing control schemes for adaptive performance management of software systems Tharindu Patikirikorala a,∗ , Alan Colman a , Jun Han a , Liuping Wang b a b

Swinburne University of Technology, Victoria, Australia RMIT University, Melbourne, Australia

a r t i c l e

i n f o

Article history: Received 30 April 2011 Received in revised form 16 May 2012 Accepted 16 May 2012 Available online 3 June 2012 Keywords: Feedback control Adaptive control Reconfiguring control Self-managing systems Quality of service Multi-model

a b s t r a c t Due to the increasing complexity of software systems and the dynamic unpredictable environments they operate in, methodologies to incorporate self-adaptation into these systems have been investigated in recent years. The feedback control loop has been one of the key concepts used in building self-adaptive software systems to manage their performance among other quality aspects. In order to design an effective feedback control loop for a software system, modeling the behavior of the software system with sufficient accuracy is paramount. In general, there are many environmental conditions and system states that impact on the performance of a software system. As a consequence, it is impractical to characterize the diverse behavior of such a software system using a single system model. To represent such highly nonlinear behavior and to provide effective runtime control, the design, integration and self-management (automatic switching) of multiple system models and controllers are required. In this paper, we investigate a control engineering approach, called Multi-Model Switching and Tuning (MMST) adaptive control, to assess its effectiveness for the adaptive performance management of software systems. We have conducted a range of experiments with two of the most promising MMST adaptive control schemes under different operating conditions of a representative software system. The experiment results have shown that the MMST control schemes are superior in managing the performance of the software system, compared with a number of other control schemes based on a single model. We have also investigated the impact of the configuration parameters for the MMST schemes to provide design guidance. A library of MMST schemes has been implemented to aid the software engineer in developing MMST-based self-managing control schemes for software systems. © 2012 Elsevier Inc. All rights reserved.

1. Introduction The growing complexity and criticality of software applications in business operations and the dynamic unpredictable nature of their operating environments demand effective runtime management in order to achieve the required performance objectives and reduce administration costs. The traditional methods that are widely used to manage software systems at runtime, such as manual tuning, have proven to be costly and error-prone (Diao et al., 2003; Salehie and Tahvildari, 2009). The fixed and ad hoc threshold based management policies designed to cope with peak demands are difficult to design due to the lack of a formal or systematic design process (Diao et al., 2003; Dutreilh et al., 2010). To achieve effective runtime performance management for software systems, autonomous or self-adaptive control is required. Such self-adaptive control can be categorized into parameter tuning and architectural reconfiguration (Salehie and Tahvildari, 2009; McKinley et al., 2004).

∗ Corresponding author. 0164-1212/$ – see front matter © 2012 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.jss.2012.05.077

In parameter tuning, a set of predefined system parameters (tuning knobs) are adjusted to maintain the required performance metrics, under dynamic and unpredictable environmental changes (e.g., Hellerstein et al., 2004; Lu et al., 2006; Kandasamy et al., 2004). Architectural reconfigurations, on the other hand, change the system structure and its elements in order to achieve the performance objectives under environmental changes (e.g., Garlan et al., 2004; Sykes et al., 2008). Performing such reconfigurations at runtime can be more costly than parameter tuning (Salehie and Tahvildari, 2009). Control engineering techniques which utilize the feedback control loop have been applied successfully for decades to achieve control objectives of physical plants. While the feedback control loop has also been identified as one of the enabling techniques to incorporate parameter tuning and architectural reconfiguration decision making capabilities into self-adaptive software systems (Salehie and Tahvildari, 2009; Brun et al., 2009), control engineering techniques have not been widely adopted in software engineering practice (Zhu et al., 2009; Kokar et al., 1999). To effectively apply control engineering techniques to software performance

T. Patikirikorala et al. / The Journal of Systems and Software 85 (2012) 2678–2696

management via parameter tuning in particular, a number of challenges need to be addressed (Brun et al., 2009; Zhu et al., 2009; Hellerstein, 2004). They include the inherent nonlinear behavior of software systems, the need for rigorous mathematical analysis in control system design, and the consequent difficulties in developing accurate models to characterize the system behavior in order to design a controller to manage the system performance at runtime. Building an accurate model to capture the behavior of a software system has proved to be problematic. First, a complex software system is typically composed of a number of layers, including hardware, operating systems, execution platforms and the application itself. The system performance at the application level is affected by the dynamic and unpredictable behavior of these underlying layers (e.g., garbage collection, compiler optimizations and memory management). Second, the workload conditions at the application level, which are also highly dynamic and unpredictable, affect the system performance as well. For instance, an e-commerce system may face sudden intensive workloads when promotional offers are run or when referenced by a high-traffic site (the socalled ‘slash-dot’ effect). The workloads may also vary dramatically depending on the time of day (e.g., stock market applications) or the time of year (e.g., tax office sites). Third, if the application evolves due to new feature additions, bug fixes or system configuration changes, the constructed model has to be changed as well for accurate representation. The resultant behavior of the software application therefore depends on a combination of the operating states or conditions of the underlying software layers, the current environmental workloads, and the current configuration of the application architecture. Depending on what combination of the above factors occurs, the behavior of the software application can be characterized by a number of distinct operating regions. However, the complexity of interactions between these factors means that creating a single model of the system is very difficult if not virtually impossible. Existing research efforts that investigate the use of control engineering approaches for software performance management tend to model the system behavior by focusing on a particular operating region under certain operating conditions that can be characterized using a single linear model. Such linearization is inherently problematic because a software system has to work in a spectrum of operating regions with changing conditions and un-modeled system dynamics. In these different operating regions, different control regimes may be needed to achieve the performance objectives effectively. This observation is further confirmed by various studies in the existing literature. For instance, two linear models were implemented in Karlsson et al. (2005b) for cases where data is retrieved from, respectively, cache or disk. The authors have pointed out that these two models were significantly different, and consequently designing a controller to satisfy both conditions was difficult. They further mentioned that a single fixed model cannot handle dynamically changing concurrent workloads, different operating conditions or application configuration changes. Similar arguments are made in Lu et al. (2002) for the case of a cache server system. Lu et al. (2006) implemented a control theoretic approach to achieve absolute and relative delay guarantees for Apache web servers. In their experiments, three different models for three different workload conditions were constructed. However, the control system was designed using a model generated by a one workload condition. Such design decisions may lead to performance issues when the other two workload conditions exist. Similarly, the authors of Zhu et al. (2006) and Wang et al. (2005) discussed the bimodal (under-loaded and overloaded) characteristic in timevarying workloads of data centers, and concluded an intelligent switching control approach is needed for these conditions. We refer to the nature of a software system needing to operate across multiple regions and therefore needing multiple models to

2679

characterize the system behavior as the multi-region or multi-model characteristic of the system. To capture the multi-model characteristic of a software system, the possible solutions could be: (i) design multiple static models that can capture and cope with the system behavior in different operating regions; (ii) develop models and algorithms that can adapt and learn at runtime; or (iii) some combination of (i) and (ii). After capturing the behavior, multiple controllers have to be designed and integrated to the control loop or system, including the ability to detect the change of the operating regions and to switch between the appropriate controllers at runtime. These solutions demand control (decision making) systems that can dynamically reconfigure themselves and select the suitable model and controller. Furthermore, such runtime reconfigurations should achieve the desired performance objectives while maintaining the stability of the system (Salehie and Tahvildari, 2009; Brun et al., 2009). In this paper we call such control systems self-managing control systems. A self-managing control system provides a high-level of adaptive capability, because it reconfigures the structure of the control system (or the decision making component) at runtime, according to the changes in operating regions. In this paper, we propose to use a control engineering approach, called Multi-Model Switching and Tuning (MMST) adaptive control, as the basis to implement and integrate self-managing control systems into software systems. This approach uses multiple system models and controllers, and automates the required parameter tuning and model/controller switching tasks to achieve the desired performance management in a way that takes account of the different operating regions of a software system. The designed self-managing control system autonomously detects the change of operating regions and then selects the most suitable controller to provide control decisions in that particular operating region at runtime. With the use of multiple system models and controllers, it also offers greater design flexibility than single model approaches. To assess its effectiveness in achieving performance management for software systems, we have conducted a range of experiments by applying two MMST adaptive control schemes to a representative software system with the multi-region characteristic. The performance of the two suitable MMST adaptive control schemes are compared to that of other single model control regimes under different operating conditions. The impact of configuration parameters for the MMST self-managing control schemes on the overall performance of the system is also investigated, providing guidance to the control system design. A reference model and class library based on the reference model is also introduced to support the implementations of MMST adaptive control schemes, which can be used by the software engineers in developing self-managing control systems. The remainder of the paper is organized as follows. Section 2 provides a background of control engineering techniques and the related work. In Section 3, we describe a motivating scenario and prototype software system which will be used in the rest of this paper. After giving an overview of MMST adaptive control schemes in Section 4, we introduce a reference model and tool in Section 5 to aid the implementation of such MMST adaptive control schemes for a software system. Using the tool support, the design and implementation details of two self-managing control systems are covered in Section 6. The experiment results are presented in Section 7. Finally, a discussion, threats to validity and conclusions with an outline of challenges and future work are provided in Sections 8, 9 and 10, respectively. 2. Background and related work In Section 2.1, we provide a background to different control system configurations that have been proposed in existing

2680

T. Patikirikorala et al. / The Journal of Systems and Software 85 (2012) 2678–2696

Model parameters Specifications

Control input (u)

Control error (e)

Set point (r)

Controller

-

Controller parameters (gains) Set point (r) Controller

Measured output (y)

Target system

A

S

Scheduling variables

Gain Scheduler

Controller

A

Target system

Measured output (y)

S

Sensor

(b) Self-tuning adaptive control system Specifications

-

Control input (u)

Actuator

(a) Control system

Controller parameters (gains)

Estimation

Sensor

Actuator

Set point (r)

Controller design

Control input (u)

A

Target system

Reconfiguration layer

Measured output (y)

Control layer

S Target system

A Actuator

Sensor

(c) Gain scheduling control system

S

(d) Reconfiguring control system

Fig. 1. Block diagrams of different feedback control schemes.

literature for performance management in software systems. Given that background Section 2.2 presents the related work.

model of the system (Hellerstein et al., 2004; Lennart, 1997). The standard form of the ARX model is as follows:

2.1. Background

y(k) =

n  i=0

Feedback control system: Fig. 1(a) shows a block diagram of a feedback control system. The software system controlled by the controller is referred to as the target system. The target system provides a set of performance metrics as properties of interest (e.g., response time) referred to as measured output or simply output. Sensors monitor the output of the target system, while the control input (e.g., resource allocation) can be adjusted through an actuator to change the behavior of the system. The controller is the decision-making unit of the control system. The main objective of the controller is to maintain the output of the system sufficiently close to the desired value, by adjusting the input. This desired value is translated in control system terms as the set point signal, which gives the option for the control system designer to specify the goal or value of the output that has to be maintained at runtime. Apart from the main objective, several other performance objectives such as the stability, settling time, overshooting and steady state behavior have to be achieved as well (interested readers are referred to Hellerstein et al., 2004; Ogata, 2001). These performance objectives have to be considered when tuning the controller. Fixed-gain control: The structure of a fixed-gain control scheme is shown in Fig. 1(a). The fixed gain controller design generally consists of two main steps. Firstly, a formal relationship between the input and output has to be constructed. In control theory this relationship is referred to as the behavioral model of the target software system. System identification (SID) is a widely used method to construct the model of the system. A SID experiment is conducted offline by feeding a specially designed input signal to the system and measuring output data for a sufficient period of time. The measurements of input and output data are then used to estimate the model (Lennart, 1997). Typically, autoregressive exogenous input (ARX) models are used to describe the behavioral

ai y(k − i) +

m 

bj u(k − d − j)

(1)

j=0

where n and m are the order of the model, ai and bj are the parameters of the model, d is the delay (time intervals taken to observe a change of input in the output) and k stands for the current sample instance. The order and other parameters of this model are derived using linear regression techniques (Lennart, 1997). Secondly, using the model created in the model identification phase, the controller is designed with the objective to reduce the variation between the desired set point and the actual output signal. A sufficiently accurate model is required to design a controller that achieves the control objectives. Although there will be model inaccuracies and uncertainties, a good feedback control design can manage the performance effectively (Hellerstein et al., 2004). The different variations of the proportional integral (PI) controller are widely used as fixed gain controllers due to their robustness against modeling errors, disturbance rejection capabilities and simplicity (Dorf and Bishop, 2000). The control algorithm of the PI controller is shown in Eqs. (8) and (9). Kp and Ki are the parameters of the controller called gains. These gains are derived offline to achieve the desired performance specifications, however remain fixed at runtime (consequently, the name, fixed gain controller). Fixed gain control design has a number of advantages. The fixed gain controllers are useful to automate parameter tuning tasks of the software system and to achieve performance goals with less human intervention. In addition, such controllers are relatively easy to design and formal techniques exist for stability analysis and tuning of the controller. They can deliver the desired performance to a limited extent under varying workloads and changing operating conditions. However, fixed-gain controllers have some limitations. The model of the system often captures the dynamics in a particular operating region under certain operating conditions that can be

T. Patikirikorala et al. / The Journal of Systems and Software 85 (2012) 2678–2696

characterized using a single linear model (Hellerstein et al., 2004). This region is selected around the desired value of the output and called the nominal operating region. However, a target system will usually exhibit complex behavior and behave differently under different operating conditions. If there are a number of dimensions across which operating conditions can vary, constructing a signal model and selecting gains to satisfy all operating conditions can be difficult (Karlsson et al., 2005b). Consequently, under dynamic and unpredictable variations, the performance of a fixed-gain controller can degrade because the control algorithm and the gains remain unchanged at run time. Thus, the single fixed gain controller alone cannot provide an effective solution to a software system operating in multiple regions. Self-tuning adaptive control: Self-tuning adaptive control addresses some of the limitations of fixed gain controllers by dynamically estimating the model parameters and gains of the controller to achieve the high-level design objectives. As shown in Fig. 1(b), adaptive controllers have a parameter adjustment loop, ˚ which derives these required parameters at runtime (Astrom and Wittenmark, 1995). The parameters of the target system’s behavioral model are estimated by the Estimation component, while the Controller design component uses these estimated model parameters and high-level control objectives provided by the designer to compute the gains of the controller. The self-tuning regulators (STR) ˚ (Hellerstein et al., 2004; Astrom and Wittenmark, 1995) have been applied as an adaptive control scheme in software systems. There are two types of STR designs. The indirect-STR uses the estimations of the system model to derive the controller parameters. In contrast, direct-STR reformulates the model estimation algorithm to ˚ compute controller parameters directly (Astrom and Wittenmark, 1995; Karlsson et al., 2005b). In this sense adaptive control captures the behavior in multiple operating regions of the target system. However, a basic assumption of adaptive control is that the model parameters remain ˚ constant or vary slowly over time (Astrom and Wittenmark, 1995; Narendra and Driollet, 2000). This means adaptive control does not cope well with rapid or large changes in operating conditions (Zhu et al., 2009; Narendra and Balakrishnan, 1993). Fast changing conditions can be seen in software systems, such as sudden workload spikes, ‘slash-dot’effects, component failures and the garbage collection process. Adaptive control also has other limitations, such as computational cost due to online estimation and design. The startup performance may not be satisfactory since it takes time to come up with the estimations. Furthermore, these methods require the input signal to contain a sufficient range of frequencies to excite the system (a so-called persistently exciting condition) for fast ˚ and accurate model estimation (Karlsson et al., 2005b; Astrom and Wittenmark, 1995). Gain scheduling: Gain scheduling is also regarded as an adap˚ and Wittenmark (1995). Fig. 1(c) tive control mechanism in Astrom shows the block diagram of a gain scheduling control system. The operating regions or states are decided based on the scheduling variables exposed by the target system. For instance, the request arrival rate can be used as the scheduling variable and rules can be formed to describe the operating regions and the controller gains for a particular operating region (Hellerstein et al., 2004). Then, these rules are implemented in the gain scheduling component. At runtime when the rules are satisfied the relevant controller gains are updated in the controller by the gain scheduling component. In contrast to adaptive control, gain scheduling does not have a model estimation component. Instead, it uses a predefined logic/rule based evaluation to change the controller online. Thus, computational load may be less. Furthermore, some design flexibility can be achieved by changing the gains of the controller depending on the operating regions of the target system compared to fixed gain control. However, an issue with this technique is that the target

2681

system has to provide the required useful scheduling variables. The performance of a software system is influenced by complex interactions between different factors (e.g., workload arrival rates and the CPU usage), and their relationship to different operating regions makes it difficult to establish reliable heuristics, rules and thresholds that can determine which gains are appropriate at any given time. In addition, there are no systematic or well defined ways to implement the scheduling logic or rules (Hellerstein et al., 2004). Moreover, the stability of the implemented system is hard to guarantee. Reconfiguring control: The adaptive control schemes provide more flexibility compared to the non-adaptive scheme by adjusting the controller parameters online. However, the controller algorithm and the organization of the components in the loop stays fixed overtime (Kokar et al., 1999). For different operating conditions and disturbances different control algorithms or loop organizations may provide better control (Solomon et al., 2007). A reconfiguring-control scheme is a conceptual approach with the main idea to change the control algorithms, models and architecture of the control system to deal with the changing operating regions of the target system. Fig. 1(c) illustrates the layered architecture of reconfiguration control. The control layer consists of the control system (including the controller) which provides the control in the current time instance. The responsibility of the reconfiguration layer is to reconfigure the architecture of the control layer so that the control objectives of the target system can be achieved under requirement or environmental changes. This approach is useful to provide control under multiple operating regions of the software system by selecting appropriate control loop configurations at runtime. For instance, multiple fixed gain controllers can be integrated into this scheme with a mechanism for selecting the most suitable one at runtime. However, there are tradeoffs between the design complexity and runtime overhead on the system due to the additional reconfiguration layer (Hellerstein et al., 2004). In addition, to come up with different control schemes prior information about the system and environmental conditions may be required. Chattering is another issue that can occur in reconfiguring control, i.e., the system frequently changes between controllers or different loop configurations without providing desired control. This could lead to drastic performance degradations (Kokar et al., 1999). Any reconfiguring-control approach needs to have mechanisms to address these challenges. 2.2. Related work There have been a number of research efforts that have used the control engineering techniques presented in Section 2.1 to manage performance in software systems. They include the use of feedback control techniques to manage web server systems (Hellerstein et al., 2004; Lu et al., 2006; Wang et al., 2005; Zhu et al., 2006; Gandhi et al., 2002; Diao et al., 2004; Abdelzaher and Bhatti, 1999; Liu et al., 2005), cache/storage systems (Karlsson et al., 2005b,a; Lu et al., 2002), data centers/server clusters (Dutreilh et al., 2010; Kusic and Kandasamy, 2006; Padala et al., 2009) and multi-client class systems (Lu et al., 2006; Kusic and Kandasamy, 2006; Padala et al., 2009). Many of these existing approaches (e.g., Hellerstein et al., 2004; Lu et al., 2006; Wang et al., 2005; Zhu et al., 2006; Gandhi et al., 2002; Diao et al., 2004; Abdelzaher and Bhatti, 1999) have used fixed gain PI controllers in their design to manage different QoS management issues. There are a number of approaches that have used adaptive control. Lu et al. have used indirect-STR in the differential caching problem (Lu et al., 2002). They point out that adaptive controllers provide design flexibility, and due to online model estimation these controllers can be ported to different technical environments without re-modeling overhead. Karlsson et al. (2005b) have built a direct-STR to regulate throughput of

2682

T. Patikirikorala et al. / The Journal of Systems and Software 85 (2012) 2678–2696

different workloads in storage systems. Other use of adaptive control in software system design include Liu et al. (2005), Padala et al. (2009) and Karlsson et al. (2005a). Gain scheduling has been adopted for the performance management of software systems in Hellerstein et al. (2004). There are only a few approaches that realize reconfiguration control techniques. The self-controlling software approach proposed by the Kokar et al. discusses the concept of a reconfiguration loop in Kokar et al. (1999). Some other architectural approaches were proposed in Solomon et al. (2007) and Goel et al. (1998) to realize the concept of reconfiguration control for software systems. However, these approaches concentrate on state management when the components of the loop are removed or replaced. In addition, the reconfiguration is triggered using ad hoc rules, which could cause design issues while formulating the rules when there are a large number of interacting operating regions. Furthermore, they do not provide stability guarantees or offer a systematic or generalized design process, both of which are vital to the control system design. Apart from the above control engineering techniques which primarily support the parameter tuning tasks, architectural reconfiguration techniques are also proposed in Garlan et al. (2004), Cheng (2008), Sykes et al. (2008), Magee and Kramer (1996) and the references therein. Cheng (2008) implemented an architectural approach to tradeoff multiple conflicting objectives and to select strategies to reconfigure the architecture of a software system at runtime. A three layer architectural reconfiguration approach was proposed in Sykes et al. (2008) to achieve adaptation goals of software systems. These approaches change the architecture of the software system to achieve the objectives under changing conditions rather than supporting parameter tuning objectives. In addition, the architecture of the runtime decision making unit (control system) is predefined and does not change at runtime. In summary, the above single model approaches that use fixed gain or adaptive control engineering schemes do not sufficiently capture the behavior of multiple operating regions of software systems, and consequently cannot provide an adequate solution to their performance management due to the many limitations discussed in Section 2.1. There is a clear need for effective ways to integrate multiple models and controllers into a software system and achieve the desired system objectives through their selection and self-management depending on its operating region at runtime. Such integration should provide accurate, fast and robust decisions under varying conditions, without adversely affecting the stability of the system. In this work, we investigate the use of a reconfiguring non-linear feedback control approach, called the Multi-Model Switching and Tuning (MMST) adaptive control, in the performance management of software systems. The MMST adaptive control is a control scheme that has capabilities to integrate multiple models and controllers into the control system. It has been applied in domains such as robotic manipulators, process control and flight control systems (Narendra and Driollet, 2000; Karimi and Landau, 2000; Solmaz et al., 2006). It is an extension of the adaptive control and reconfiguration control schemes, and has the potential to improve the performance management of the systems showing multi-region characteristics.

3. Motivating scenario To highlight the existence of multiple operating regions in software systems, let us consider a travel (flight) reservation system or server that serves different travel agents (see Fig. 2). The travel agent websites provide functionalities to their clients to check or book flights online. The travel reservation system implements the business logic needed by agents depending on the functional and QoS requirements. The travel reservation system also depends on

a 3rd party supplier who holds the up-to-date travel reservation information/data. When a request is generated using the website by a client of a travel agent, it will reach the travel reservation system. After executing the required business logic, the travel reservation system will invoke the necessary services of the 3rd party supplier to respond to that request. However, the 3rd party supplier has limited resources and can only provide a limited number of invocation sessions to the travel reservation system. For simplicity and without compromising its generality for our purpose, we assume that two client classes A and B (corresponding to travel agents A and B) are interested in the services provided by this server and the number of sessions provided by the 3rd party supplier is 20. The objective of the server is to allocate the 20 sessions between agents A and B, and achieve the response times for the requests relative to the priority levels (or values) of the agent, under varying workloads. These relative importance levels are determined according to the response time requirements specified in the service level agreements. Additionally, there is also a constraint or policy (declared in the service level agreement) regarding the minimum number of sessions that have to be maintained for a specific client class to avoid starvation of resource (a minimum of 4 sessions in this case). A prototype of such a reservation system was developed implementing the architecture shown in Fig. 2. The client can access the services of the reservation system by connecting to the server socket. After a connection is made, the clients can send different messages invoking different service methods. When a message is received at the message queue, a time stamp (t1 ) is applied and then the request is classified according to the client class and put to the relevant client queue. The scheduler accesses these queues in a first-come first-serve (FIFO) fashion, and assigns these messages to a virtual application instance with client specific method pointers and virtually partitioned resources (i.e., session handlers) to be sent to the 3rd party supplier. When the server receives the response from the 3rd party supplier, it is sent back to the client through the socket. Another time stamp (t2 ) is applied before the response is written to the client socket. In this work we design a relative guarantee feedback control system to allocate sessions depending on the varying workloads of A and B so that the response time ratio could be maintained within an acceptable range depending on the relative importance of the client. The response time for a single request is the time difference between t2 and t1 measured in seconds. An average response time of the workloads over a 2 s sampling window is used as the measured output as a trade-off between monitoring overhead and reactivity to sudden disturbances. Workload intensity is measured in requests per second (req/s). Let us denote the response time of workloads A and B is denoted as Ra and Rb , respectively. Similarly, the session allocation between workloads A and B as Sa and Sb where, Sa + Sb = 20. To incorporate the relative importance between client classes in a relative guarantee scheme, the control input is the ratio of session allocations, Sa /Sb , and the output variable is the ratio of the average response time of the workloads, Rb /Ra . For notational simplicity let us denote Sa /Sb and Rb /Ra as u and y, respectively. The control objective of this control system is to maintain a constant response time ratio between the two workloads (A and B) depending on the relative importance of these workloads. The costs, penalties and response time requirements specified in SLAs can be used to decide this relative importance. In addition to the main control objective, the scheduler is constrained by an ad hoc policy/constraint, namely, the minimum number of sessions Sa , Sb ≥ 4. (For more details on relative guarantee control scheme (Lu et al., 2006, 2001; Patikirikorala et al., 2011b).) When the above requirements and policies are embedded in the design, the control input Sa /Sb , can only take certain discrete values. With Sa = 20 − Sb and Sa , Sb ≥ 4, the possible operating points

T. Patikirikorala et al. / The Journal of Systems and Software 85 (2012) 2678–2696

2683

Message queue (FIFO) Socket

Request

A

B

Queue

Queue

Request / Reply

Reply

Classifier

A,B Workload generators Typically, customized web sites of different travel agent

Tomcat + Axis 2

Scheduler SOAP messages

Virtual Instances with session handlers

3rd party supplier

Fig. 2. Conceptual structure of the target software system.

of the system belong to (Sa /Sb = 4/16, 5/15, . . ., 15/5, 16/4). In a situation where the workloads of both the client classes are similar and they are equally important, both client classes are getting equal sessions (Sa /Sb = 10/10). We will consider this as the nominal operating point. Furthermore, we will call the operating region where class A gets more resources as region A (Sa /Sb = 11/9, 12/8, . . ., 15/5, 16/4). Similarly, when B gets more resources will be called region B (Sa /Sb = 9/11, 8/12, . . ., 5/15, 4/16). Fig. 3 shows the operating points (Sa /Sb ). Note that the operating points are unequally spaced. In region A, spacing decreases towards the nominal operating point whereas in region B spacing increases towards the nominal operating point. The objective of maintaining the importance levels according to the relative guarantee scheme leads to such highly discontinuous operating points, which in turn affects the linearity of the system. As a consequence, the performance of a linear controller may degrade significantly when it is operating away from the nominal operating point. The main problem is that with the discontinuous operating points, it is hard to decide the gains for a single fixed gain controller. The selection of the gains is required to achieve the desired performance when the controller is operating in both regions A and B (see Section 7). When the controller is optimized to operate in region A, performance in region B drastically degrades and vice versa. In reality, the underlying behavior of the software platform changes due to large workload fluctuations or other factors (e.g., communication delays and garbage collection processes), making this system under consideration hard to model with a single linear model. Consequently, designing a single model and controller will lead to performance issues and loss of design flexibility. As such, this prototype system shows the multiple operating regions, and demands multiple models to capture the behavior of the system and then calls for adaptive or multi-model based self-managing control schemes. In the following sections, we investigate the effectiveness of the self-managing control systems based on MMST adaptive control. We use the prototype system with different control schemes for resource (session) allocation, and to compare the performance management of two MMST adaptive control schemes against other approaches under various operating conditions. 4/16

10/10

Region B Nominal region

4. Adaptive control with Multi-Model Switching and Tuning In this section, we provide an overview of the adaptive control approach with Multi-Model Switching and Tuning (MMST) (Narendra and Balakrishnan, 1997) which supports various multimodel schemes. The MMST adaptive control was proposed by Narendra and Balakrishnan (1993) to improve the transient response of adaptive control systems in the presence of model uncertainties. It is a concept inspired by biological systems (Narendra and Driollet, 2000). Biological systems have the ability to select an appropriate action for a specific situation from a collection of behaviors. MMST uses the same concept by selecting the most suitable controller for the current environment that the system is in. Fig. 4 shows the main components of MMST. The target system has control input u and measured output y. There are n number of models (M1 , M2 , . . ., Mn ) describing the relationship between u and y for different operating conditions, which provide estimations for the system model, simultaneously. The estimates from these n models are denoted by yˆ 1 , yˆ 2 , . . . , yˆ n . Similarly, there may be n controllers, with each corresponding to a model. Although there are multiple controllers, only a single controller can be

16/4 Region A

Fig. 3. The operating points and regions.

Fig. 4. Block diagram of MMST adaptive control system.

2684

T. Patikirikorala et al. / The Journal of Systems and Software 85 (2012) 2678–2696

connected in the control loop to make the control decisions at a given time instances. Thus, the most appropriate model and controller for the system and environment condition have to be selected to make the control decisions at runtime. The responsibility of the switching algorithm is to select the appropriate model and corresponding controller based on some criteria that will improve the performance of the controlled system. There are multiple switching algorithms discussed in Narendra et al. (2003). All of these algorithms are based on the prediction errors of the models (e = y − yˆ ). The prediction error indicates which model best fits the current operating conditions of the system at a given instant. Hence, the integration of this model and the corresponding controller in the control loop should improve the performance of the control system (Narendra et al., 2003). The model evaluation and selection steps of the switching algorithm are summarized as follows: Model evaluation:

˛e2 (k)

Ji (k) =

+

  

Instantaneous component

ˇ



k  r=0

e2 (r)



,



Long − term component

∀i = 1, 2, . . . , n

(2)

Model selection: Jmin (k) = min{ji (k)},

i = 1, 2, . . . , n

(3)

k is the time instance. ˛, ˇ ≥ 0 are parameters that should be carefully decided by the designer. If ˛ > 0, ˇ = 0, only the instantaneous part is utilized. In this case switching may be frequent, leading to performance degradation (Narendra and Balakrishnan, 1993). If ˛ = 0, ˇ > 0 the long term component is active, hence the switching may be infrequent, again possibly leading to performance degradation (Narendra and Balakrishnan, 1993). In the model evolution step, prediction errors are used to calculate the J index of all the models with Eq. (2). In the model selection step, the model that produces the minimum J index (Jmin ) is selected and the corresponding controller is integrated into the control loop. Tmin is another parameter, called waiting time period, which specifies the time that the control system has to wait before selecting the next controller to control the system (Narendra and Balakrishnan, 1993, 1997). Depending on the application characteristics, the designer can decide on the above parameters to achieve the desired performance. The index in Eq. (2) is most suitable in time-invariant environments (i.e., systems that do not depend on absolute time; Lennart, 1997). For time-varying environments long term error accumulation will affect the J indexes. In such conditions the performance of the MMST adaptive control systems can be improved by calculating the performance index in a finite window (T ≥ 0) as illustrated in Eq. (4). ji (k) = ˛e2 (k) + ˇ

k 

e2 (r),

∀i = 1, 2, . . . , n

(4)

r=k−T +1

The above discussion presents the general concepts of the MMST adaptive control. Going further, different types of multi-model schemes have been evaluated and formal stability proofs are provided in Narendra and Balakrishnan (1997) and Narendra and Xiang (2000). These multi model schemes are as follows. Type 1: All adaptive models In this scheme all the system models (Mi , i = 1, 2, . . ., n) are estimated online by adaptive ˚ identification (estimation) algorithms (Lennart, 1997; Astrom and Wittenmark, 1995). The corresponding controllers use the

parameter estimations to come up with the control input u. This scheme is computationally inefficient. In addition, if the environment remains unchanged for a long time, all the adaptive models will converge to the same parameter neighborhood. Consequently, when a sudden disturbance occurs, all models may not react to it rapidly enough, which reduces the advantage of having multiple models (Narendra and Driollet, 2000). Type 2: All fixed models This scheme addresses some of the limitations in the type 1 scheme by integrating fixed models and fixed gain controllers. Fixed gain controllers are generally not regarded as an adaptive technique, but because of the switching capabilities this scheme can be considered as an adaptive reconfiguring control technique. However, fixed models can only represent a finite number of operating regions or conditions. As such, this scheme assumes that there is always one of the models that closely approximates the system behavior. Therefore, to satisfy this assumption and the stability requirements we may have to build a large number of fixed models. Type 3: One adaptive model and one fixed model In this scheme, initially the fixed model may be selected since the adaptive model takes time to converge at the startup. However, when the adaptive model converges, it will often outperform the fixed model. This scheme is simple and addresses some of the limitations in the above two schemes. However it still suffers from the limitations of adaptive control exist under fast changing conditions. Consequently, in a volatile environment there may not be much improvement in the performance compared to a single adaptive model based control. Type 4: Adaptive models and fixed models Different types of schemes can be formulated combining adaptive and fixed models. Two main configurations are discussed in Narendra and Driollet (2000) and Narendra and Balakrishnan (1997). The first configuration includes n − 1 fixed models and one adaptive model. From prior knowledge of the system’s operating and environment conditions, n − 1 number of fixed models can be designed. Then, the adaptive model is run free of interference to capture the system dynamics that is not captured by the fixed models. On the other hand, the second configuration includes another adaptive model, i.e., involving n − 2 fixed models and 2 adaptive models. The second adaptive model is re-initialized with the parameters of the best model in the current time instance. The main purpose of this adaptive model is that after re-initialization it may converge fast to the new model parameters so that the transient responses may be improved under sudden disturbances. To achieve effective performance under these two configurations, the design of the fixed models has to be done after carefully analyzing any available prior knowledge about the system and its environment. The above discussion provides the objectives, features and some limitations of different MMST adaptive control schemes. To characterize the MMST adaptive control in terms of the discussion in Section 2.1, it is an indirect adaptive control scheme since it depends on the model selection or estimation before providing the control decisions. Furthermore, it is a form of reconfiguring-control because MMST adaptive control changes the models, controllers, components and the architecture of the control system at runtime, depending on the changing conditions. In contrast to the logic or rule based reconfiguration control techniques where the appropriate variable selection and rule formation are open problems, the MMST adaptive control provides more rigorous design technique as well as switching algorithms to aid the implementation. More importantly it provides the ability to represent the software system with multiple models while providing formal stability proofs. Furthermore, it can self-manage the control system without any human intervention, reducing the system administration effort. Interested readers are referred to Narendra and Balakrishnan (1997) and Narendra et al. (2003) for the details of simulation studies and the stability proofs of these multi-model schemes in

T. Patikirikorala et al. / The Journal of Systems and Software 85 (2012) 2678–2696

Controller

Model

+CalculateControlInput() +Reintialize()

+PredictOneStepAhead() +GetCurrentPerdictionError()

TuningParameterContainer +Reset() +Change() 1

1

1

1

SwitchingBox MMSTSwitchingBoxT 2 +EvaluateModelPredictions() +ReconfigureControlLoop()

MMSTSwitchingBoxT 1

MMSTSwitchingBoxT 3

1

1

1

ModelControlPair +Model +Controller

MMSTSwitchingBoxT 4

Fig. 5. The reference model.

guaranteeing that the system will not be unstable due to the switching and tuning behavior. Analysis: The main questions arising from the above schemes are: (i) which schemes to use, (ii) how many fixed/adaptive models and how to identify them and (iii) how to configure the switching algorithm. Answers to these questions are application or requirement dependent. At the design time, suitable models and controllers have to be formulated, depending on the available knowledge about the operating conditions, system behavior and physical analysis (Narendra and Driollet, 2000). However, precise/accurate knowledge is not required, because the models are an estimation of system dynamics, which are not 100% accurate. Considering the characteristics of the software systems, type-2 and type-4 schemes may be most suitable. For systems where some prior knowledge is available about the operating conditions, different fixed models can be approximated and integrated utilizing the MMST type-2 scheme. For instance, two models can be designed to represent regions A and B in the scenario of Section 3. A two-model design may also be suitable, for example, in an e-commerce system where one model captures the system behavior during normal operating conditions while another model caprtures the behavior during high workloads due to promotional offers. The MMST type-2 scheme may also be useful in the cases cache and disk-based models need to be integrated (Karlsson et al., 2005b), or when integrating models representing three workload conditions in Lu et al. (2006). If there is little or no prior knowledge about the system or else the system frequently changes behaviour, there is the need to integrate adaptive models with fixed models using the MMST type-4 scheme. These could be systems that frequently change due to new feature additions, bug fixes and other unknown environmental conditions. The number of fixed models depends on the knowledge of the conditions that can be used and simulated to design the system identification experiments. However, if there is little prior knowledge, fixed models can be uniformly spread through the model parameter space as proposed in Narendra et al. (1995). It is also recommended to start with small number of models and include more models depending on the performance observed. Since software systems are typically nonlinear and time-varying (Hellerstein et al., 2004), the switching scheme in Eq. (4) may be most suitable by setting ˛ = 0, ˇ = 1 to achieve predictable and consistent switching. It is recommended to be set T to a small value if the conditions are fast varying, in order to avoid instabilities. 5. Reference model In this section a reference model to design and develop MMST based self-managing control systems for software systems is provided. Fig. 5 shows the reference model at a high-level of abstraction. This reference model can be used in a number of ways

2685

to design self-managing control systems. Below, we first discuss the structure of the reference model. Then we show how this reference model can be used to integrate different control algorithms and switching schemes, before presenting the realization of this reference model as an extendable class library. The SwitchingBox class is an abstract class which is extended to implement the switching and selection algorithms. As shown in Fig. 5, among many methods EvaluateModelPredictions and ReconfigureControlLoop methods implement various switching algorithms and the reconfiguration techniques. The SwitchingBox class requires a collection of model-controller pairs representing the different operating conditions. The ModelControlPair serves as a container for a model and the corresponding controller. The Model class is the super class of different models that can be integrated into this reference model. For instance, first order, high-order and adaptive models are specializations of this class. It provides facilities to calculate the look-ahead predictions and prediction errors, which are vital for the switching schemes. The Controller class is the super class of all the controllers that need to be integrated to this reference model. For instance, fixed-gain controllers and STRs are the child classes of this class. These child classes must implement CalculateControlInput method depending on its control law. Finally, the TuningParameterContainer class abstracts away different tuning parameters that need to be provided to the controller. The controllers that need different tuning parameters need to extend this class. As shown in the bottom of Fig. 5 MMSTSwichingBoxT2 class extends the SwitchingBox class and implements general functionalities which is either extended or overridden by T1,T3 and T4 classes to achieve the required objectives of the four MMST-schemes presented in Section 4. Tool support: After the design of the MMST control scheme, which includes design of the models, controllers and parameters of the MMST scheme, the next step is to implement the multimodel self-managing control scheme as a software component, then integrate it into the target software system. However, implementation of such schemes from scratch takes a large engineering effort on coding and testing. In addition, currently such implementations require the practitioner to have a sound knowledge on control engineering and its mathematical foundations. Consequently, tool support is essential to reduce the knowledge barriers and costs of developing such multi-model self-managing control schemes. We have implemented a class library conforming to the reference model, to support the implementation and integration of multi-model self-managing control schemes. The class library provides off-the-shelf implementations to build MMST adaptive control schemes and other types of self-managing control systems (e.g., gain scheduling). In addition, it provides a rich library of standard control algorithms including • PID controller - which can be configured as 6 different types of controllers (algorithm Ogata, 2001; Hellerstein et al., 2004), • Model predictive controller (MPC) (algorithm Wang, 2009), • Indirect self-tuning Wittenmark, 1995),

regulator

(algorithm

˚ Astrom

and

• Self-tuning PID controller (algorithm Hellerstein et al., 2004). First order, higher order and adaptive models (implementing the ˚ recursive least squares algorithm; Lennart, 1997; Astrom and Wittenmark, 1995) are also available. In particular, the library also includes all four MMST schemes with configurable switching algorithms. Consequently, an engineer can use this class library to select and implement a self-managing control system into the

2686

T. Patikirikorala et al. / The Journal of Systems and Software 85 (2012) 2678–2696

software system under consideration, without implementing such schemes from scratch. Although the selection of appropriate control schemes and configuration parameters depends on the system concerned, a substantial engineering cost is reduced because this library consists of control scheme implementations which are configurable, extensible and well tested. This was further validated by an empirical study conducted with a group of software engineers, which showed a reduction of 146 lines of code on average when a MMST type 2 control system was implemented with the use of tool support compared to implementing it from scratch. More details about the empirical study can be found in Patikirikorala et al. (2011a). This class library can be extended to integrate different types of controller. For example, if a linear quadratic controller (Hellerstein et al., 2004) needs to be integrated into a software system, the implementation of the controller can be done by extending the class library. Similarly, if nonlinear models (e.g., neural network) are needed, the same approach can be taken to enrich the class library. The advantage of this class library implementation is that the extensions can be easily integrated, without affecting the rest of the implementation provided. The class library is currently implemented using Java and C#.Net (available from http://www.ict.swin.edu.au/personal/tpatikirikorala/Downloads). It can readily be integrated into a software system implemented in Java or any of the programming languages supported by the.Net framework. 6. Implementation of MMST adaptive control schemes To quantify and assess the applicability and effectiveness of applying the MMST adaptive control schemes (types 2 and 4 in particular) to software systems, in this section we provide a systematic process to design, implement and integrate the MMST schemes using the prototype system introduced in Section 3 as a specific case study system. In Section 6.1, the model identification process to capture the different operating regions will be presented. Utilizing the class library introduced in Section 5, Sections 6.2 and 6.3 provide implementation details of the MMST type 2 (MMST-T2 from here on) and MMST type 4 (MMST-T4 from here on) schemes. 6.1. Model identification process As pointed out at the end of Section 4, a decision needs to be made on how many models of the system are required. The system has many layers (e.g., hardware, operating system and the 3rd party services) and may face dynamic unpredictable workloads from different client classes. The behavior of the underlying layers and components is hard to approximate because of the lack of knowledge about their regions. However, after the analysis in Section 3, the prototype system has two regions because of the resource demands of the client classes and discontinuous control inputs. We demonstrate that the improvements in performance of the system can be gained with the use of multiple models, in particular using the prototype system where two models are approximated to represent the behavior and dynamics of the two regions A and B of the system. In the model identification process we treat the application layer (including the hardware, operating system and 3rd party services) as a black-box, and consequently the models will approximate the behavior of the overall system with respect to the aforementioned regions. The general design of a system identification (SID) experiment requires identification of the input–output variables of the system and the operating regions to build the model. Then, suitable workload settings have to be determined that will maintain the output of the system in the particular region of interest after the

application of a selected range of input. The selection of the input signal is important to excite the system sufficiently to capture the behavior in that region. The pseudo binary, sinusoidal and pseudo random signals are well known input signals in SID. Depending on the selected signal, a range of inputs and switching periods have to be determined. The SID experiment is then conducted to gather a sufficient number of input–output data samples. Then, the gathered data is divided in to two sets called the estimation set and test set. The data in the estimation set is fitted to a model such as ARX (Eq. (1)), while the test set is utilized to validate the model (for more on SID experiment design, see Lennart, 1997). Firstly, for the prototype system we have designed a SID experiment, simulating the workloads that are suitable to capture the dynamics of region A. The control inputs under region A, (Sa /Sb = 10/10, 11/9, 12/8, . . ., 15/5, 16/4) are used to design a pseudo random signal for the SID experiment. The nominal point (10/10) was also added to this region without loss of generality, to reduce the modeling effort (i.e., to avoid representing a single operating point as a region or model). 30 req/s and 10 req/s were selected to simulate A and B workloads, respectively, indicating high workload and resource demands for A. Then the system output was monitored by applying a randomly selected control input from the aforementioned set for 4 consecutive sample periods. This experiment was carried out for 600 sample periods and the gathered input–output (u–y) data was used to create the model for region A. The data samples until 400 period were included in the estimation set and the rest of the data samples formed the test set. A first order ARX model was used to fit the data with sufficient accuracy (which we call model-A). Although higher order models may provide a better fit we opted to use this first order model for simplicity, to impose less computational demand and to avoid over-fitting (Hellerstein et al., 2004). A similar experiment was carried out in the B region as well. In this case the input signal was constructed by the control inputs of B region, Sa /Sb = 9/11, 8/12, . . ., 5/15, 4/16. The workloads of 30 req/s and 10 req/s were used to simulate B and A workloads, respectively. Again, a first order ARX model was used to fit the data with sufficient accuracy (which we call model-B). Finally, another experiment was conducted to capture the dynamics of the system in the entire operating spectrum with the set (Sa /Sb = 4/16, 5/15, . . ., 15/5, 16/4). For this experiment we used workloads of 20 req/s to simulate both A and B and fitted another first order ARX model (which we call model-AB). This model will be used to design the controllers and for performance comparisons later on. The model parameters and structure of model-A, model-B and model-AB are shown in Eqs. (5), (6) and (7), respectively. y(k + 1) = 0.67y(k) + 0.13u(k)

(5)

y(k + 1) = 0.65y(k) + 0.79u(k)

(6)

y(k + 1) = 0.84y(k) + 0.59u(k)

(7)

6.2. Implementation of MMST-T2 scheme As shown in Fig. 6(a), we require integration of fixed models representing different operating regions and the respective controllers to implement the MMST-T2 scheme. After the models are identified (see Section 6.1), the next step is to construct the suitable controllers. The PI control algorithm was selected for our prototype system. Two types of discrete PI control laws are available in the control literature, and they are shown in Eqs. (8) and (9), which are also provided by our class library. e(k) is the control error, and is the difference between the set point and the measured output of the current sample instance. Kp (propositional gain) and Ki (integral gain) are the tuning parameters of the PI controller, which have to be selected to achieve the desired performance metrics such as stability, settling time and overshooting (Hellerstein et al., 2004). For

T. Patikirikorala et al. / The Journal of Systems and Software 85 (2012) 2678–2696

instance, when the system faces a disturbance, if we use large values for Kp and Ki , the control error is weighted heavily. Consequently, the controller step/effort will be large, and hence the controller is aggressive. That is, the controller will respond fast, reducing the settling time. However, the downside is that the controller may over-react to noisy disturbances and induce oscillations and overshooting in the system output. In contrast, if we use small gain values the error is weighted less, the controller step/effort towards reacting to a disturbance will be low. Hence, the controller will be less aggressive, showing low settling time, but the vulnerability for noisy disturbances may be low. Eq. (8) illustrates the position/full-value form of the PI control law, which is widely used to manage software systems (e.g., see Hellerstein et al., 2004; Gandhi et al., 2002; Parekh et al., 2002). However, in switching control systems this control law may cause performance issues. For instance, when the control is switched from one controller to another due to a large disturbance, the integral term (the second term) in Eq. (8) may have different values, which may lead to a different or unsuitable control input being used by the new controller. As a consequence, there may be large transient output responses, called bumpy transfers (Youbin et al., 1996). In contrast, the velocity/incremental form of the PI law illustrated in Eq. (9) is well known in implementing bumpless transfers in the case of mode and controller switching systems (Youbin et al., 1996). This is because the control input of the previous sample (u(k − 1)) is used as a relative point and the incremental part (computed by summation of the second and third terms) is added to it. Consequently, u(k) cannot have large deviations from u(k − 1) even after the controllers are switched. In this work, the velocity form of PI control is implemented to achieve smooth or bumpless transitions when controllers switch in the control system. In addition, a formal design methodology called the pole-placement design (Hellerstein et al., 2004) was used to derive the gains of the controllers given the design specifications.

2687

u(k) = u(k − 1) + (Kp + Ki )e(k) − Kp e(k − 1)







(9)

Incremental term(u)

Typically, the control system designer would use the model-AB and the high-level performance specifications (e.g., settling time and overshooting) to implement a single fixed gain controller. However, as seen from the experimental results and observations in Section 7, it is hard to satisfy the stability and performance of the system under many different conditions. An aggressive controller (with Kp = 1.15, Ki = 0.61, called controller-A) is needed to achieve the performance objectives in region A, whereas a comparatively less aggressive controller (with Kp = 0.47, Ki = 0.11, called controllerB) is needed to provide control in region B. The aggressive controller will take large steps in region A and reach the desired control inputs in region A. However, in region B because the spacing between the control inputs is small (see Fig. 3), the large steps taken by the aggressive controller will create instabilities. In contrast, the less aggressive controller taking small steps settles to appropriate operating points in region B, avoiding such instabilities (see Section 7.2). {controller-A, model-A} and {controller-B, model-B} were grouped and given as the input model and controller set to the MMSTSwichingBoxT2 class available from the class library. The model evaluation algorithm with a finite window (T = 3) illustrated in Eq. (4) was used, because typically the operating conditions of the software systems are time-varying. ˛ and ˇ were set to 0 and 1, respectively. We set controller-A as the startup controller without loss of generality. Then the sensor and actuator of the prototype system were connected to the instance created out of the MMSTSwichingBoxT2 class before the class library was compiled as a component in the prototype system. The final architecture of the self-managing control system is shown in Fig. 6(a). 6.3. Implementation of MMST-T4 scheme

u(k) =

Kp e(k)

   Proportional term

+

Ki

k 

e(j)

j=1

   Integral term

(8)

In contrast to the MMST-T2 scheme, the MMST-T4 implementation requires integration of adaptive models, fixed models and adaptive controllers. For our prototype system, a free running adaptive model is integrated together with the two fixed models estimated in Section 6.1. Hence, this is a MMST-T4 configuration

Fig. 6. The block diagrams of implemented self-managed control system architectures.

2688

T. Patikirikorala et al. / The Journal of Systems and Software 85 (2012) 2678–2696

with n − 1 fixed models and an adaptive model. The controller is a single self-tuning PI controller (with a similar structure to Fig. 1(b)), which computes the controller gains given the parameters of the model and the design specifications. Fig. 6(b) shows the block diagram of the designed scheme similar to the structure found in Narendra et al. (2003), Gregorcic and Lightbody (2000) and Filev and Larsson (2001). The two fixed models and the adaptive model are evaluated by the switching algorithm and a model is selected at each switching period. Then, the parameters of the model are given to the controller design component to compute the gains of the self-tuning PI controller. We used the MMSTSwitchingBoxT4 class provided by the class library to implement this self-managing control system (Fig. 6(b)). The velocity form algorithm was used in the self-tuning PI controller, which takes the first order model parameters and computes the Kp and Ki gains online (see Hellerstein et al., 2004 for more details). Model-A, Model-B and a first order adaptive model (implementing the recursive least squares algorithm ˚ Lennart, 1997; Astrom and Wittenmark, 1995) were given as the input models to the MMSTSwitchingBoxT4 class by setting the selftuning PI controller as the controller in operation. The switching algorithm configuration parameters were set to be the same values as specified in Section 6.2, apart from T which was set to 4, to achieve consistent switching performance. The forgetting factor of the adaptive model was set to 0.94 (around the standard value rec˚ ommended by Astrom and Wittenmark, 1995). The specifications to the control design component is the desired pole locations of the self-tuning PI controller, which was placed at 0.65 and 0.65, respec˚ tively (see Hellerstein et al., 2004; Astrom and Wittenmark, 1995 for details).1 7. Experimental results In this section, the performance of MMST-T2 and MMST-T4 schemes as designed in Sections 6.2 and 6.3 is investigated in different operating conditions and business scenarios using the prototype travel reservation system with relative guarantee control. In order to do a comparative performance analysis, we also use two fixed gain controllers (controller-A and controller-B) and a self-tuning PI regulator (with the same settings as the MMST-T4 adaptive model and controller specification). The prototype system was deployed on a machine with a Intel CoreTM 2duo E8400 [email protected] GHz 2.99 GHz processor and 2 GB memory. To simulate two client workloads we used tailor made workload generators which were deployed on a separate machine with CoreTM 2duo E6550 [email protected] GHz 2.33 GHz processor and 3 GB memory. The target software system and the controllers were built in C#.Net and the 3rd party supplier component was designed using Java as a web service deployed in a Tomcat 5.5 with Axis 2 web service engine. The two machines were connected via 1 Gps Ethernet. With these settings the prototype system can handle 40 req/s workload without overloading the system. In Section 7.1, we present the performance of the controllers in nominal operating conditions as specified in Section 3. Then, in Section 7.2, the performance of the controllers are evaluated by forcing the control systems to operate away from the nominal operating conditions by applying high workloads of A and B alternatively. In addition, in Section 7.3 we compare the performance differentiation capabilities of the controllers when the software system is persistently overloaded. In these experiments we use the step input signals or workloads, which are typically used in

1 The specifications are decided based on the performance metrics such as overshooting, settling time and steady state error. Due to the multiple regions of operation, we placed the desired poles at these locations mentioned to achieve the performance metrics in both regions.

control engineering to validate the implemented control system. Many existing works (e.g., Padala et al., 2009; Lu et al., 2006; Karlsson et al., 2005b) have taken the same approach. However, in order to investigate the performance under highly stochastic real world workloads, Section 7.4 presents the results of an experiment conducted with the workload traces of a production web server. Finally, the impact of MMST configuration parameters on performance is examined in Section 7.5. 7.1. The nominal region In this section, we investigate the performance of the control systems when the target system operates in the nominal region with the workloads typical in the normal business hours. In addition, the set points of the control systems are fixed at 1, indicating that both client classes are equally important. In the nominal region, both classes get the same amount of resources and have similar workload conditions. We used the workload settings where A and B start off by sending 10 req/s each until the 50th sample and afterwards both classes increase their workloads to 20 req/s simultaneously. The performance of all the control systems is satisfactory and similar. Even at the disturbance after the 50th sample, no overshooting was observed. This is because the operating point selected by all control systems is around Sa /Sb = 10/10 in the entire experiment. As such, when we increase the workload simultaneously there is no need to change the operating point. In such nominal operating conditions, the performance of the single linear model based controllers is much better because their corresponding models were designed under such conditions. Similar performance can be seen in the adaptive controller and both MMST control systems as well. Fig. 7 shows the switching behavior of MMST schemes. For both MMST schemes there is no model that represents the nominal conditions explicitly, since model-A and model-B are biased towards A and B regions, respectively. However, model switching in Figure 7(a) shows that model-A was utilized in the entire experiment in the case of MMST-T2, indicating model-A as the closest model for the nominal operating conditions. Consequently, controller-A provided satisfactory performance without any instability. In contrast, MMST-T4 switching behavior is different when Fig. 7(a) and (b) is compared. Although the MMST-T4 scheme starts off with model-A similar to the MMST-T2 scheme, at the 8th sample, the adaptive model takes over the control of the system. This reflects one of the limitations in adaptive control, i.e., the time taken to converge at the start-up. As a consequence, model-A that is close to the operating condition provides control until the 8th sample. After the convergence, however, the adaptive model out-performs the two fixed models in the model evaluation step. This case yields an interesting conclusion that the adaptive model captures the dynamics not captured by the fixed models. 7.2. Outside the nominal region The experiment starts off in the nominal region with A and B sending 10 req/s. To enable the control system to operate away from the nominal region, at the 30th sample the A workload increases from 10 to 30 req/s. This could represent a scenario where travel agent A has advertised special travel plans/fares for a limited period of time, so that there is a sudden increase of system workload. Then at the 80th sample the A workload reduces to the nominal request rate of 10 req/s. Then at the 100th sample, the B workload increases to 30 req/s from 10 req/s. The set point for this experiment is also fixed at 1, indicating both clients are equally important. Fig. 8(a) and (b) shows the performance of controller-B and controller-A when they are deployed separately. Both controllers reject the high workload disturbances applied at the 30th and 100th

T. Patikirikorala et al. / The Journal of Systems and Software 85 (2012) 2678–2696

2689

Adaptive Model

Model

A

A

B B

20

40

60

80

100

20

40

60

80

100

Sample Id

Sample Id

(a) MMST-T2 model switching

(b) MMST-T4 model switching

Fig. 7. The model switching of MMST schemes 2 and 4 in the nominal conditions.

4

2 1

50

100 Sample Id

150

y Set point

3 2 1 0

0

50

Response time ratio (y)

y Set point

2 1 0

150

3

y Set point

2 1 0

(b) controller-A

4 3

100

4

50

100 150 Sample Id

Sample Id

(a) controller-B

Response time ratio (y)

0

50

100 Sample Id

150

y Set point

4 3 2 1 0

50

(d) MMST-T2

100 Sample Id

150

(e) MMST-T4

Adaptive

A

B

A

B

50

100 Sample Id

150

(f) MMST-T2 model switching

200

200

(c) Adaptive PI controller

5

Model

0

Response time ratio (y)

Response time ratio (y)

3

0

achieves better steady state behavior. These results indicate that neither of the controllers can individually provide effective control across the entire operating region. Hence, using a single controller to operate across both regions sacrifices the performance in one or both of those regions. Fig. 8(c) shows the performance of the self-tuning PI adaptive controller. The disturbance rejection capability of the self-tuning PI controller is significantly poor and inconsistent compared to the fixed gain controllers. This is because with the large sudden disturbance at the 30th sample instance, the estimated model parameters

4

y Set point

Model

Response time ratio (y)

samples. After the disturbance at the 30th sample the aggressive controller-A settles down quickly with less overshooting compared to the less aggressive controller-B. The performance of controller-A is significantly better in region A. However, the steady state performance of controller-A is oscillatory and unstable compared to controller-B after the 100th sample. This is because of the discontinuous control inputs in region B. The aggressive controller-A does not settle to a suitable operating point, leading to controller induced oscillations. controller-B, in contrast, settles down to a desired operating point in 32 samples after the 100th sample and

50

100

150

Sample Id

(g) MMST-T4 model switching

Fig. 8. The response of the control systems and the model switching of MMST schemes 2 and 4 outside the nominal region.

2690

T. Patikirikorala et al. / The Journal of Systems and Software 85 (2012) 2678–2696

vary drastically, leading to the computation of inappropriate gains for the controller. Consequently, the resource allocation decisions are completely unstable and unsuitable for the workload conditions, leading to high transient responses and deviations from the set point. Similar behavior is observed after the 100th sample as well, however the controller achieves the set point eventually at the 220th sample. Such significantly poor performance is observed because adaptive controller’s assumption of slowly varying conditions was not satisfied with the disturbance at the 30th and 100th samples. Fig. 8(d) shows the performance of the MMST-T2 scheme which combines controller-A and controller-B together with their corresponding models representing regions A and B, respectively. It rejects the disturbance at the 30th and 100th samples and settles down with low steady state error in both regions. The MMST-T2 settling time after the disturbance at the 30th sample is approximately 12 sample periods, which is similar to controller-A’s (12 sample periods) but better than controller-B’s (46 sample periods). At the 100th sample, controller-B and MMST-T2 shows low steady state error, whereas controller-A shows significantly high steady state error. The overshooting of controller-B is significantly higher than MMST-T2 as well at the 100th sample (4.1 and 3.3, respectively). The results demonstrate that the MMST-T2 scheme, which combines controller-A and controller-B, provides effective performance in both regions. This performance improvement can be explained by the model switching behavior shown in Fig. 8(f). The MMST-T2 switching algorithm selects the appropriate model and controller depending on the operating conditions. Although there is no model designed to represent the startup workload conditions (until the 30th sample), at the 6th sample model-A is selected, identifying it as the closest model to the operating conditions. After the 30th sample the control system operates in region A, (represented by model-A), so no switching is required. Similarly, after the 100th sample when the control system is operating in region B, model-B is selected at the 108th sample. However, no chattering is observed during the switching because of the utilization of the velocity form of the PI controller, which provides bumpless transfers. Although the self-tuning PI adaptive controller failed to provide the performance required, the MMST-T4 scheme, which combined the performance of adaptive model and fixed models, provided significantly better performance (see Fig. 8(e)). This is because of the disturbances at the 30th and 100th samples results in the MMSTT4 control system switching to the fixed models representing the particular operating conditions thus avoiding any large transient responses that would otherwise be caused due to delay of estimation of the new model parameters. Fig. 8(g) shows the model switching behavior of MMST-T4. At the nominal workload conditions until the 30th sample the adaptive model is used. However due to the large disturbance at the 30th sample model-A is selected in the 32nd sample. This is because model-A estimates the system dynamics under this condition. The adaptive model did not outperform the model-A until the 80th sample. From the 92nd to 104th sample some chattering was observed. Then at the 108th sample model-B was selected leading to significantly better performance of MMST-T4 compared to a single adaptive model based self-tuning PI controller. However, MMST-T2 provides less chattering and overshooting compared to the performance of MMST-T4 scheme. Providing fast convergence to the model parameters after a sudden disturbance is one of the main design objectives of MMST adaptive control. The MMST schemes provide significantly better performance than the self-tuning PI adaptive controller because of the fast convergence to the neighborhood of the model parameters was achieved. For instance, MMST-T2 and MMST-T4 schemes reach the desired model parameter neighborhood in approximately 8 sample periods, compared to the adaptive controller which takes

over 100 samples to converge after the disturbance at the 100th sample. 7.3. Persistently overloaded condition In the above sections we compared the performance when the shared resource system is running with full workload capacity or lower. In this section we compare the differentiation capabilities in a severely overloaded case, where queues of one or more agents start to grow in a rapid rate affecting the performance of the shared resource environment. We specify a queue length of 20 for each queue. This is done to avoid the unbounded growth of queues due to extreme overloads for a long time period. This in turn could lead to an uncontrolled increase in the response time and instabilities in the system. When a queue has reached the specified limit, further requests will be rejected. This is the practice of current web systems (e.g., acceptCount attribute in Apache server). The queue limit has to be carefully selected based on acceptable response time levels and control objectives, because such implementation introduces nonlinearities in response time signals. In this case we assume that the workload of agent A is more important than agent B by fixing the set point at 3 (i.e., Rb ≈ 3 × Ra ). Then, we apply a 20 req/s workload for agent A and a 50 req/s workload for agent B until the 50th sample. With these settings the workload of agent B has significantly overloaded the system. Therefore, the desirable behavior according to the control objectives is to see fewer or no request losses of agent A and more request losses of agent B. At the 50th sample, we increase the workload of agent A to 60 req/s maintaining workload of B at the same level, in order to investigate the performance under changing conditions. Fig. 9 shows the outputs of the control systems. The two linear control systems show significantly different responses in this condition (see Fig. 9(a) and (b)). Controller-B shows satisfactory performance until the 50th sample. However, it takes around 30 samples to reach the set point after the disturbance at the 50th sample. In contrast, controller-A shows highly unstable and poor performance until the 50th sample, but recovers fast after the disturbance at the 50th sample compared to controller-B. The poor performance of controller-A is because it has to operate in region-B due to the overloaded workload of agent B. When workload of A overloads at the 50th sample, controller-A reaches the steady state in less than 5 samples without showing any instability. This is because after the high workload of agent A, the control systems have to operate in region A, where controller-A provides better performance. The same disturbance affects the settling time of controller-B because it is less aggressive in that region. In addition, controller-A shows more request losses of the important agent A, from the startup to 50th sample compared to controller-B. However, controllerB rejects more workload of agent A than controller-A because of the time taken to settle down after the disturbance at the 50th sample. The performance of the adaptive PI controller was variable in this condition as well. Fig. 9(c) shows the common behavior observed in the multiple runs of the same experiment. The main issue is due to the sudden disturbance at the 50th sample performance of the adaptive controller degrades drastically, leading to a large deviation from the set point. This is because of the delay in model estimation. Such delay in reaction under severe overload can affect the integrity of the system adversely. The performance of the MMST-T2 control system (see Figure 9(d)) shows better disturbance rejection capabilities without any instability. However, at the startup it takes around 15 samples to reach the set point. This is because the MMST-T2 control system starts up with model-A and controller-A, which shows significant performance issues at the startup. When the switching algorithm

T. Patikirikorala et al. / The Journal of Systems and Software 85 (2012) 2678–2696

4 2

20

40 60 Sample Id

80

100

y Set point

6 4 2 0

0

20

(a) controller-B

40 60 Sample Id

y Set point

4 2

0

80

100

y Set point

6 4 2 0

0

(b) controller-A

6

0

In the previous sections we used workloads with instantaneous changes, which is typically used to validate a standard control system in control engineering. In this case we apply a workload trace from a real world web system. Many workload traces from real world applications can be accessed from http://ita.ee.lbl.gov/ html/traces.html. These workload traces are also used by existing literature to evaluate performance management capabilities of software systems. For this experiment we use workload traces of EPA web server (http://ita.ee.lbl.gov/html/contrib/EPA-HTTP.html). These workload traces were selected because the workload intensities can be directly applied on the system under study, without any modifications or scaling to the workloads. After decoding the workload rates from the log files, the finalized workload conditions for 2250 samples for agents A and B are illustrated in Fig. 10. These workload patterns contain multiple operating conditions including the nominal, outside nominal and overloaded operating conditions. The workload scripts were input to the workload generator to simulate the workloads of two classes. Again, we set the set point at 1, which enables us to investigate the issues in performance management and chattering in the control systems under highly variable workload conditions. Fig. 11 shows the performance of each control systems in the 1.2 h of operations. We have not shown the performance of the adaptive and the MMST-T4 because of the poor and variable performance under these workload conditions.

Response time ratio (y)

0

Response time ratio (y)

0

7.4. Real world workloads

Response time ratio (y)

Response time ratio (y)

y Set point

6

20

40 60 Sample Id

80

100

20

40 60 Sample Id

y Set point

6 4 2 0

0

20

40 60 Sample Id

80

100

(e) MMST-T4

Adaptive

A

A

B

B

0

20

40 60 Sample Id

80

(f) MMST-T2 model switching

100

0

80

(c) Adaptive PI controller

(d) MMST-T2

Model

Response time ratio (y)

makes the first decision at the 3rd sample (see Fig. 9(f)), operations of controller-A affects the performance of MMST-T2 control system leading to a large settling time at the startup. However, the MMST-T2 control system shows satisfactory performance similar to the performance of controller-B in Fig. 9(a) without any instability. Then, at the 50th sample the MMST-T2 control system shows significantly better disturbance rejection capabilities by reaching the steady state in less than 5 samples. In addition, the MMST-T2 control system shows the lowest workload rejection rate for agent A (i.e., the most important agent), compared to the both linear control systems, indicating better management of resources according to the specified control objectives. Consequently, the MMST-T2 control system provides much better performance compared to controller-A, and much better disturbance rejection characteristics compared to controller-B. Fig. 9(e) shows the performance of the MMST-T4 scheme. The variability in performance under multiple runs was observed in this case due to the inclusion of the adaptive model. Compared to the adaptive controller, the MMST-T4 control system does not show large deviations from the set point, because the switching algorithm selects model-B at the 8th sample and model-A at the 52nd sample (see Fig. 9(g)) until the adaptive model estimates the new model parameters. As a consequence, the MMST-T4 control system shows better performance compared to the adaptive controller. However, the performance is poor compared controller-B and a MMST-T2 scheme.

2691

20

40 60 Sample Id

80

100

(g) MMST-T4 model switching

Fig. 9. The response of the control systems and the model switching of MMST schemes 2 and 4 in the overloaded condition.

100

2692

T. Patikirikorala et al. / The Journal of Systems and Software 85 (2012) 2678–2696

A

not cause significant issues due to the selected parameters of the switching algorithm and bump-less transfers implemented by the PI control algorithm utilized in the implementations of MMST schemes.

B

60 7.5. Impact of different MMST tuning parameters on performance

40

In this section, we evaluate the impact of the MMST configuration parameters on the response of the control system and model switching behavior. For this purpose, we use the MMST-T2 scheme because the results in Sections 7.1–7.3 indicated consistent predictable switching behavior of all the experiments. In addition, the experimental conditions in Section 7.2 are used in this evaluation because it forces the control system to operate in the nominal, A and B regions. Firstly the effect of the short finite time window (low T value) is examined, secondly the effect of the startup controller is presented, and thirdly we show the performance of the control system with ˛ = 1 and ˇ = 0.

20

0 0

500

1000

1500

2000

Sample Id Fig. 10. Real world workloads for A and B.

Response time ratio (y)

3 y Set point 2

1

0

0

500

1000

1500

3 y Set point 2

1

0

2000

7.5.1. Effect of short finite time window (low T value) For this experiment, the time window T considered was set at 1, which means in every sample instance the J indexes are evaluated, followed by a selection of a model and controller. Fig. 12 shows the performance and switching behavior. Compared to the results in Section 7.2, which used T = 3, the performance under low T causes frequent chattering. Surprisingly, the performance in region A (after the 30th sample) is much better compared to region B (after the 100th sample). This is mainly because less chattering occurred during the time period from the 30th sample to the 100th. However, due to highly discontinuous/constrained operating points in region B, controller switching occurs frequently, leading to high steady state error and oscillatory behavior. It is evident that model-B was used most of the time from the 100th to 150th samples. However model-A was also selected in-between. The high switching period used in Section 7.2 leads to the use of model-B after the 108th

0

500

1000

1500

2000

Sample Id

Sample Id

(a) controller-B

3 y Set point 2

1

0

0

500

1000

(b) controller-A

(c) MMST-T2

A

B

0

500

1000 1500 Sample Id

1500

Sample Id

Model

Response time ratio (y)

From Fig. 11 it is evident that controller-B and the MMST-T2 scheme provides better performance in the entire experiment. The average performance of controller-B and MMST-T2 control systems are similar. This is because the workload conditions selected did not have sharp workload increments (e.g., slash-dot). ControllerA provides similar performance to two other controllers until the 1200th sample, but output becomes oscillatory when the workload of agent B is high. This is due to discontinuous operating points in region B, which leads to performance degradation in controllerA. The switching behavior of MMST-T2 scheme (see Fig. 11(d)) also corresponds to the workload pattern, where between the 500th and 1200th samples, model-A is selected which represents the regionA when the workloads of agent A are high. Similarly, when the workload of agent B is high, the switching algorithm selects modelB during 1500th to 2000th samples most of the time. Although, some chattering was observed due to highly varying workloads, the performance of the MMST-T2 control system did not show any instability unlike controller-A. The short term chattering did

Response time ratio (y)

Workload (requests/sec)

80

2000

(d) MMST-T2 model switching Fig. 11. The response of the control systems and the model switching of MMST-T2 under real world workload.

2000

Response time ratio (y)

T. Patikirikorala et al. / The Journal of Systems and Software 85 (2012) 2678–2696

2693

5 y Set point

A Model

4 3 2

B

1 0

0

50

100

50

150

100

150

Sample Id

Sample Id

(a) controller-B

(b) controller-A

Fig. 12. Performance and model switching of MMST adaptive control when Tmin = 1.

sample because it accumulates the error for 3 samples according to the J index settings of Eq. (4). The general recommendation is that if the system has fast varying dynamics it is better to have a low T, but if it is not that fast varying then a relatively high T can be used. Simulation studies may provide a means to evaluate such performance characteristics before setting these parameters in a real system. 7.5.2. Effect of the startup controller For the experiments in Section 7.2, controller-A is used as the startup controller. In this section we set controller-B as the startup controller to check if it has an effect on switching behavior initially or afterwards. The performance and switching behavior shown in Fig. 13 is similar to the results of Section 7.2. At the 6th sample model-A is selected and maintained without any difference to the switching behavior observed in Section 7.2. Consequently, there is no effect of the startup controller for this case, apart from the initial switching behavior.

The experiment results under the different operating conditions in Section 7 indicate that the MMST schemes (in particular, the type 2 scheme) are a promising approach to capture the behavior and to provide control in multiple operating regions of the software system by implementing self-managing (reconfiguring) control at runtime (for the case of relative guarantee over different client classes). Depending on the operating conditions, a suitable controller can be selected from the available set without any human intervention. In addition, the control system designer can combine the performance of the different controllers without considering the performance tradeoffs involved in designing a single controller, leading to flexibility in designing a control system. The MMST-T2 scheme integrates the fixed controllers into the control system and overcomes the limitations of adaptive control. It also overcomes the inability of single model-based fixed-gain controllers in providing control in multiple operating regions. Although short-term chattering was observed, it did not cause any instability. As such, the MMST-T2 scheme has provided significantly better performance (for the relative guarantee scheme). The MMST-T4 scheme, which integrates an adaptive model with fixed models, provided performance similar to the MMST-T2 scheme under two conditions. In the nominal region, we observed that the system behavior which was not captured by the fixed models was estimated online by the adaptive model. Outside the normal region (Section 7.2), the utilization of fixed models led to significantly better performance compared to the adaptive (self-tuning PI) controller. Under persistently overloaded conditions, the performance was variable due to chattering and convergence delays of the adaptive model. Although the simulation studies in Narendra et al. (1995, 2003), and Narendra and Xiang (2000) suggested that MMST-T4 was the best scheme, it did not hold for our case of a relative guarantee scheme based performance management system. However, MMST-T4 may be more effective when there is a lack of prior knowledge about the operating regions/conditions,

4 y Set point

3

A

Model

Response time ratio (y)

7.5.3. Effect of using only the instantaneous component of the switching algorithm (when ˛ = 1 and ˇ = 0) In this experiment we only use the instantaneous component of the switching algorithm compared to those in Section 7.2, which used the long-term component. The output of the system (see Fig. 14(a)) does not show any noticeable difference compared to the performance in Section 7.2. However, the switching behavior in Fig. 14(b) is different from that of Section 7.2. Especially, at the 100th sample, chattering could be observed causing slight deviations in the output. In addition, there is an inconsistent model switch from the 18th to 24th samples. Such inconsistent and unpredictable switching behavior is caused because of the utilization of only the instantaneous component. However, when the long-term component is added to the switching logic, this effect can be effectively removed. Experiments that have ˇ = 1 and small values of ˛ (=0.1–0.3) show model switching and performance similar to Section 7.2.

8. Discussion

2 1 0

B

0

50

100 Sample Id

(a) controller-B

150

50

100 Sample Id

150

(b) controller-A

Fig. 13. Performance and model switching of MMST adaptive control with controller-B as the startup controller.

T. Patikirikorala et al. / The Journal of Systems and Software 85 (2012) 2678–2696

4 y Set point

3

A

Model

Response time ratio (y)

2694

2 1

B

0

50

100

150

50

100

Sample Id

Sample Id

(a) controller-B

(b) controller-A

150

Fig. 14. Performance and model switching of MMST adaptive control when ˛ = 1 and ˇ = 0.

and the systems evolve due to changes (such as components additions and replacements, bug fixes and modifications due to new requirements). Furthermore, the configuration parameters ˛, ˇ and T of MMST have to be selected carefully. From our experiments, having small ˛ value with ˇ = 1 provides consistent switching. Higher ˛ values may improve the performance but with variable switching performance. Setting ˛ > 0, ˇ = 0 may cause chattering, leading to performance degradation. T is recommended to be set as low as possible to avoid instabilities due to the use of unsuitable controller for a long time. However, very small T (= 1) may cause significant chattering. Besides, the startup controller may not affect the performance apart from the initial switching behavior. Our extendable class library available in http://www.ict.swin. edu.au/personal/tpatikirikorala/Downloads provides a good basis for designing MMST schemes for real systems or research purposes, saving the engineering cost of developing them from scratch. In addition, it provides a rich set of standard control algorithms which can be used to design and test various reconfiguring control systems with less engineering effort. Furthermore, the off-the-shelf implementations of all the MMST adaptive control scheme enables the designer to explore suitable schemes and variations. Based on our experiments and according to the existing literature, Table 1 summarizes the capabilities and requirements of the control schemes presented in Section 2.1 against a list of criteria. (Reconfiguring control is excluded because it is a rather conceptual approach which depends on the application requirements.) We have used three flags in the evaluation. Yes is used when a criterion is strongly satisfied by the approach. When the control mechanism does not satisfy the criterion the no flag is used. The partial flag denotes that there is no clear agreement or disagreement with the criterion. This could be because the control mechanism may or may not satisfy the criterion for all systems, environmental conditions or design requirements.

9. Threats to validity and challenges The experimental evaluations presented in this paper are limited to a single case study system, even though it is based on a typical real world scenario deployed in a physical environment. In addition, the management problem investigated is based on the relative performance management scheme with two client classes. Therefore, the results presented in this paper are only strictly valid under these settings. In order to further confirm and generalize the validity, further experiments for different case study systems under different control schemes are required. In this regard, we have also applied the MMST adaptive control schemes described in this paper to another domain, namely a case study with a

production business process server.2 This study further confirms the effectiveness of MMST for software systems compared to existing approaches. The details of this case study have not been included here because of space constraints, but can be found in our technical report (Patikirikorala et al., 2012). Additional case studies are planned for future investigation. A challenge in using the MMST control approach is that it requires the identification and division of the entire operating region into multiple subregions. In the case of the relative performance management system described in this paper, the regions could be analytically determined. This may not be the case where other types of control objectives need to be achieved. Due to the diversity of operating environments and control requirements in different software systems, it is difficult to provide definitive general guidelines for determining a system’s operating regions. However, several factors can be considered in deciding on the subregions. For example, regions can be identified based on the workload conditions, i.e., low, high, slash-dot-like workloads or the regions where the set points are likely to be placed in the output region. In addition, the application level operating conditions (e.g., workloads served from cache or database/disk) can be also used as such heuristics to characterize the operating regions. Depending on the special conditions, events or modes of the system, different control schemes or strategies may be required. For the scenario in Section 3, for instance, the relative performance management scheme becomes uncontrollable or undefined if the workload from one client class becomes zero. Capturing these special situations and implementing desired reconfiguration strategies is important. Combining features of discrete event systems with reconfiguring control systems may provide solutions for such requirements. Since MMST is a reconfiguring control scheme, limitations discussed in Section 2.1 exist, such as the computational overhead and chattering. The computational overhead is imposed by the addition of the switching scheme. As a consequence, the number of models and types of models (adaptive versus fixed) used in the MMST switching scheme affect the computational overhead. To mitigate the issues of chattering the models and parameters of the switching algorithm have to be carefully selected after simulation studies. Reducing the design cost of conducting multiple SID experiments to capture the behavior in different regions is another challenge. To reduce the design cost, one of the possible solutions is to investigate online model learning and retention techniques that could manage a database of models used in the switching algorithm (for example, see work in Narendra et al., 1995; Zhai et al., 2006). Finally, a practical limitation of the MMST control schemes is their complexity relative to simple linear control. They should only be chosen

2

http://wso2.com/products/business-process-server/

T. Patikirikorala et al. / The Journal of Systems and Software 85 (2012) 2678–2696

2695

Table 1 √ Comparison between fixed, adaptive, gain scheduling and MMST control (( ) yes; (∼) partial; (⊗) no). Criterion Ability to handle multi input multi output systems Availability of formal stability proofs Can provide fast response to dynamic disturbances Can handle multiple operating regions Low design time effort Portability (deploy in new environments without any modifications) No prior information needed Can provide acceptable start-up performance No possibility of chattering Low computational overhead

Fixed control √ √

Adaptive control √ √

⊗ ⊗ ∼ ⊗ ∼ ∼ √

∼ ∼ ∼ √





when a single fixed or adaptive controller cannot provide the effective performance in the entire spectrum of the system’s operating conditions.

10. Conclusions It is vital to incorporate self-adaptive capabilities into software systems in order to achieve performance objectives in dynamic and unpredictable environments. Control engineering techniques have been explored in the past few years to automate performance management tasks in software systems. However, existing control techniques based on single fixed or adaptive models are unable to provide satisfactory control in multiple operating regions of a software system under changing operating conditions. In this paper, we have evaluated a control engineering technique, called MultiModel Switching and Tuning (MMST) adaptive control, in order to assess its effectiveness in capturing the system dynamics and provide control in multiple operating regions under changing conditions. We have designed two types of MMST control systems for a real world scenario and a prototype software system that exhibited multiple operating regions. Then, the performance of the MMST control systems was compared to single fixed and adaptive control schemes under different experimental conditions. Furthermore, the impact of MMST adaptive control configuration parameter settings on the response of the control system was evaluated. The experimental results indicate that MMST adaptive control can provide superior performance and effective self-management of the control system without any human intervention by selecting the appropriate model and controller for particular operating regions and conditions. In addition, the MMST schemes also provide flexibility in control system design. To reduce the engineering cost in designing self-managing control systems, we have also implemented a library of MMST control schemes following standard control algorithms. It should be noted that the experimental results presented in this paper have been obtained from one case study system. Even though the results are further confirmed by another case study for a different domain, other case studies are required to confirm and generalize these results, which are planned as a part of future work. Furthermore, we have also planned to investigate runtime model learning mechanisms using the MMST-type 4 scheme, which would reduce the design effort and prior knowledge required in system modeling.

References ˚ Astrom, K.J., Wittenmark, B., 1995. Adaptive Control. Addison-Wesley Publishing Company. Abdelzaher, T.F., Bhatti, N., 1999. Web server QoS management by adaptive content delivery. In: International Workshop on Quality of Service, pp. 216–225. Brun, Y., Marzo Serugendo, G., Gacek, C., Giese, H., Kienle, H., Litoiu, M., Müller, H., Pezzè, M., Shaw, M., 2009. Engineering self-adaptive systems through feedback loops. Software Engineering for Self-Adaptive Systems, 48–70.

∼ ⊗ √

Gain scheduling √ ⊗ ∼ ∼ ⊗ ⊗ ⊗ ∼ ⊗ ∼

MMST-T2 √ √

MMST-T4 √ √

∼ ∼ ⊗ ⊗ ⊗ ∼ ⊗ ∼

∼ √ ⊗ ∼ ⊗ ∼ ⊗ ⊗

Cheng, S.-W., 2008. Rainbow: cost-effective software architecture-based selfadaptation. Ph.D. Thesis, Carnegie Mellon University, Pittsburgh, PA, USA. Diao, Y., Hellerstein, J.L., Parekh, S., Bigus, J.P., 2003. Managing web server performance with autotune agents. IBM Systems Journal, 136–149. Diao, Y., Hellerstein, J.L., Storm, A.J., Surendra, M., Lightstone, S., Parekh, S., GarciaArellano, C., 2004. Using mimo linear control for load balancing in computing systems. In: American Control Conference, Vol. 3, pp. 2045–2050. Dorf, R.C., Bishop, R.H., 2000. Modern Control Systems. Prentice-Hall, Inc. Dutreilh, X., Rivierre, N., Moreau, A., Malenfant, J., Truck, I., 2010. From data center resource allocation to control theory and back. In: International Conference on Cloud Computing, CLOUD, pp. 410–417. Filev Sr., D., Larsson, T., 2001. Intelligent adaptive control using multiple models. In: Proceedings of the 2001 IEEE International Symposium on Intelligent Control (ISIC’01), pp. 314–319. Gandhi, N., Tilbury, D., Diao, Y., Hellerstein, J., Parekh, S., 2002. Mimo control of an apache web server: modeling and controller design. In: American Control Conference, 2002, Vol. 6, pp. 4922–4927. Garlan, D., Cheng, S.-W., Huang, A.-C., Schmerl, B., Steenkiste, P., 2004. Rainbow: Architecture-based self-adaptation with reusable infrastructure. Computer 37, 46–54. Goel, A., Steere, D., Pu, C., Walpole, J., 1998. Swift: a feedback control and dynamic reconfiguration toolkit. Tech. rep., Oregon Graduate Institute School of Science and Engineering. Gregorcic, G., Lightbody, G., 2000. A comparison of multiple model and pole-placement self-tuning for the control of highly nonlinear processes. In: Proceedings of the Irish Signals and Systems Conference, pp. 303–311. Hellerstein, J.L., Diao, Y., Parekh, S., Tilbury, D.M., 2004. Feedback Control of Computing Systems. John Wiley and Sons. Hellerstein, J.L., 2004. Self-managing systems: A control theory foundation. In: Annual IEEE Conference on Local Computer Networks, p. 708. Kandasamy, N., Abdelwahed, S., Hayes, J.P.,2004. Self-optimization in computer systems via on-line control: Application to power management. In: Proceedings of the First International Conference on Autonomic Computing. IEEE Computer Society, pp. 54–61. Karimi, A., Landau, I., 2000. Robust adaptive control of a flexible transmission system using multiple models. IEEE Transactions on Control Systems Technology 8, 321–331. Karlsson, M., Zhu, X., Karamanolis, C.,2005. An adaptive optimal controller for nonintrusive performance differentiation in computing services. In: IEEE Conference on Control and Automation. ICCA. Karlsson, M., Karamanolis, C., Zhu, X., 2005. Triage: Performance differentiation for storage systems using adaptive control. Transactions on Storage, 457–480. Kokar, M.M., Baclawski, K., Eracar, Y.A., 1999. Control theory-based foundations of self-controlling software. IEEE Intelligent Systems 14 (3), 37–45. Kusic, D., Kandasamy, N.,2006. Risk-aware limited lookahead control for dynamic resource provisioning in enterprise computing systems. In: IEEE International Conference on Autonomic Computing. ICAC, pp. 74–83. Lennart, L., 1997. System identification: theory for the user. Prentice-Hall, Inc. Liu, X., Liu, X., Zhu, X., Zhu, X., Singhal, S., Singhal, S., Arlitt, M., Arlitt, M., 2005. Adaptive entitlement control of resource containers on shared servers. In: Proceedings of the IFIP/IEEE International Symposium on integrated Network Management. Lu, C., Abdelzaher, T.F., Stankovic, J.A., Son, S.H., 2001. A feedback control approach for guaranteeing relative delays in web servers. In: IEEE Real-Time Technology and Applications Symposium, pp. 51–62. Lu, Y., Abdelzaher, T., Lu, C., Tao, G., 2002. An Adaptive Control Framework for QoS Guarantees and its Application to Differentiated Caching Services. Lu, C., Lu, Y., Abdelzaher, T.F., Stankovic, J.A., Son, S.H., 2006. Feedback control architecture and design methodology for service delay guarantees in web servers. IEEE Transactions on Parallel and Distributed Systems 101, 4–102, 7. Magee, J., Kramer, J., 1996. Dynamic structure in software architectures. SIGSOFT Software Engineering Notes 21, 3–14. McKinley, P.K., Sadjadi, S.M., Kasten, E.P., Cheng, B.H.C., 2004. Composing adaptive software. Computer 37, 56–64. Narendra, K.S., Balakrishnan, J., 1993. Improving transient response of adaptive control systems using multiple models and switching. In: Conference on Decision and Control, vol. 2, pp. 1067–1072.

2696

T. Patikirikorala et al. / The Journal of Systems and Software 85 (2012) 2678–2696

Narendra, K., Balakrishnan, J., 1997. Adaptive control using multiple models. IEEE Transactions on Automatic Control 42, 171–187. Narendra, K.S., Driollet, O.A., 2000. Adaptive control using multiple models, switching, and tuning. In: Adaptive Systems for Signal Processing, Communications, and Control Symposium, pp. 159–164. Narendra, K., Xiang, C., 2000. Adaptive control of discrete-time systems using multiple models. IEEE Transactions on Automatic Control 45, 1669–1686. Narendra, K.S., Balakrishnan, J., Ciliz, M.K., 1995. Adaptation and learning using multiple models, switching, and tuning. IEEE Control Systems Magazine 15 (3), 37–51. Narendra, K.S., Driollet, O.A., Feiler, M., George, K., 2003. Adaptive control using multiple models, switching, and tuning. International Journal of Adaptive Control and Signal Processing, 1–16. Ogata, K., 2001. Modern Control Engineering, 4th ed. Prentice Hall PTR, Upper Saddle River, NJ, USA. Padala, P., Hou, K.-Y., Shin, K.G., Zhu, X., Uysal, M., Wang, Z., Singhal, S., Merchant, A., 2009. Automated Control of Multiple Virtualized Resources. Parekh, S., Gandhi, N., Hellerstein, J., Tilbury, D., Jayram, T., Bigus, J., 2002. Using control theory to achieve service level objectives in performance management. Real-Time Systems 23, 127–141. Patikirikorala, T., Colman, A., Han, J., 2011. An off-the-shelf class library to implement selfmanaging control systems for adaptive software. Tech. rep., Swinburne University, http://www.ict.swin.edu.au/personal/tpatikirikorala/docs/ TharinduCOTS.pdf. Patikirikorala, T., Wang, L., Colman, A., Han, J., 2011. Hammerstein-wiener nonlinear model based predictive control for relative QoS performance and resource management of software systems. Control Engineering Practice 20 (1), 49–61. Patikirikorala, T., Colman, A., Han, J., Wang, L., 2012. A study on managing the performance and resources of a business process engine using nonlinear and switching control system. Tech. rep., Swinburne University, http://www.ict.swin.edu.au/ personal/tpatikirikorala/docs/TharinduBPSTechRep.pdf. Salehie, M., Tahvildari, L., 2009. Self-adaptive software: landscape and research challenges. ACM Transactions on Autonomous and Adaptive Systems 4, 1–42. Solmaz, S., Akar, M., Shorten, R., 2006. Online center of gravity estimation in automotive vehicles using multiple models and switching. In: Control, Automation, Robotics and Vision, ICARCV’06, pp. 1–7.

Solomon, B., Ionescu, D., Litoiu, M., Mihaescu, M., 2007. A real-time adaptive control of autonomic computing environments. In: Proceedings of the 2007 Conference of the Center for Advanced Studies on Collaborative Research, CASCON’07, ACM, New York, NY, USA, pp. 124–136. Sykes, D., Heaven, W., Magee, J., Kramer, J., 2008. From goals to components: a combined approach to self-management. In: Proceedings of the 2008 International Workshop on Software Engineering for Adaptive and Self-Managing Systems, ACM, pp. 1–8. Wang, Z., Zhu, X., Singhal, S., 2005. Utilization and SLO-based Control for Dynamic Sizing of Resource Partitions. Wang, L., 2009. Model Predictive Control System Design and Implementation Using MATLAB. Springer Publishing Company, Incorporated. Youbin, P., Vrancic, D., Hanus, R., 1996. Anti-windup, bumpless, and conditioned transfer techniques for pid controllers. IEEE Control Systems Magazine 16 (4), 48–57. Zhai, J., Fei, S., Da, F., 2006. Intelligent control using multiple models based on on-line learning. Journal of Control Theory and Applications 4, 397–401. Zhu, X., Wang, Z., Singhal, S., 2006. Utility-driven workload management using nested control design. In: American Control Conference, p. 6, http://dx.doi.org/10.1109/ACC.2006.1657688. Zhu, X., Uysal, M., Wang, Z., Singhal, S., Merchant, A., Padala, P., Shin, K., 2009. What does control theory bring to systems research? ACM SIGOPS Operating Systems Review 43, 62–69. Tharindu Patikirikorala is with Faculty of Information and Communication technology, Swinburne University of technologies, Australia (email: [email protected]). Alan Colman is with Faculty of Information and Communication technology, Swinburne University of technologies, Australia (email: [email protected]). Jun Han is with Faculty of Information and Communication technology, Swinburne University of technologies, Australia (email: [email protected]). Liuping Wang is with Faculty of Electrical and Computer Engineering, Royal Melbourne Institute of Technology, Australia (email: [email protected]).