Journal of Computer and System Sciences 81 (2015) 258–287
Contents lists available at ScienceDirect
Journal of Computer and System Sciences www.elsevier.com/locate/jcss
Locks: Picking key methods for a scalable quantitative analysis ✩ Christel Baier a , Marcus Daum a , Benjamin Engel b , Hermann Härtig b , Joachim Klein a , Sascha Klüppelholz a,∗ , Steffen Märcker a , Hendrik Tews b , Marcus Völp b a b
Institute for Theoretical Computer Science, Technische Universität Dresden, 01062 Dresden, Germany Operating-Systems Group, Technische Universität Dresden, 01062 Dresden, Germany
a r t i c l e
i n f o
Article history: Received 24 May 2013 Received in revised form 13 January 2014 Accepted 18 May 2014 Available online 23 July 2014 Keywords: Probabilistic model checking Measure-based quantitative analysis Low-level operating system code Test-and-test-and-set spinlock Conditional long-run probabilities Quantile-based queries Symmetry reduction
a b s t r a c t Functional correctness of low-level operating-system (OS) code is an indispensable requirement. However, many applications rely also on quantitative aspects such as speed, energy efficiency, resilience with regards to errors and other cost factors. We report on our experiences of applying probabilistic model-checking techniques for analysing the quantitative long-run behaviour of low-level OS-code. Our approach, illustrated in a case study analysing a simple test-and-test-and-set (TTS) spinlock protocol, combines measure-based simulation with probabilistic model-checking to obtain high-level models of the performance of realistic systems and to tune the models to predict future system behaviour. We report how we obtained a nearly perfect match of analytic results and measurements and how we tackled the state-explosion problem to obtain model-checking results for a large number of processes where measurements are no longer feasible. These results gave us valuable insights in the delicate interplay between lock load, average spinning times and other performance measures. © 2014 Elsevier Inc. All rights reserved.
1. Introduction The profile of requirements for operating-system code is manifold and ranges from functional correctness providing guarantees on the reliability of OS-primitives to a variety of non-functional, quantitative properties. Indeed, many application programs rely on good performance of the OS-code with respect to quantitative aspects such as speed, energy consumption, and other cost factors. Worst-case execution-time analysis (see e.g. [1–3]) is able to provide guarantees for hard timing constraints, but only in the form of upper bounds on the execution times of all involved components, which hold even in the most extreme situations. Results on the worst-case timing behaviour are crucial for time-critical applications, but often they are too pessimistic as they do not provide any information on the average performance. Many computer systems are not
✩ This work was in part funded by the German Research Council (DFG) through the project QuaOS, the collaborative research centre 912 HighlyAdaptive Energy-Efficient Computing (HAEC) and the cluster of excellence Centre for Advancing Electronics Dresden (cfAED), through the DFG/NWO-project ROCKS, the EU-FP7 project MEALS (295261) and by the EU and the State Saxony through the ESF young researcher groups IMData (100098198) and SREX (100111037). Corresponding author. E-mail addresses:
[email protected] (C. Baier),
[email protected] (M. Daum),
[email protected] (B. Engel),
[email protected] (H. Härtig),
[email protected] (J. Klein),
[email protected] (S. Klüppelholz),
[email protected] (S. Märcker),
[email protected] (H. Tews),
[email protected] (M. Völp).
*
http://dx.doi.org/10.1016/j.jcss.2014.06.004 0022-0000/© 2014 Elsevier Inc. All rights reserved.
C. Baier et al. / Journal of Computer and System Sciences 81 (2015) 258–287
259
time-critical, or at least contain components that are not time-critical. Their usefulness in practice also relies on a series of soft constraints that typically focus on good performance in average over the time, while hazarding expensive computations in rare cases. Indeed, complex computer architectures such as x86 are optimised according to their average performance. Some systems integrate for example imprecise real-time computing techniques to deal with transient overload [4,5]. A major challenge is to find a reasonable balance between different cost factors that takes application-specific demands into account, for example average or maximum energy consumption or performance metrics such as latency. For instance, depending on the application and the expected frequency and impact of hard- or software failures, the integration of expensive, sophisticated fail-safe mechanisms might be required or might be considered unnecessary. For another example, depending on the type of competing processes, the use of a simple test-and-test-and-set (TTS) lock implementation might yield better performance characteristics than more advanced and complex ticket [6] or queue-based locks [7]. Within a common research project, the formal-methods group and the operating-systems group at the Technische Universität Dresden cooperate to establish quantitative properties of low-level operating-system code using probabilistic model checking. By low-level OS code, we mean drivers, the kernel of monolithic operating systems, microkernels or microhypervisors and similar code that directly interacts with hardware devices and that is therefore often optimised to fully exploit the intrinsic behaviour of modern processor architectures. The ultimate aim of our joint project is to predict quantitative properties for hardware which is not yet on the market, for instance for x86 CPUs with 50–100 cores. While measurementbased techniques are obviously impossible for such future hardware, model-based quantitative analysis, such as probabilistic model checking, is potentially applicable and might provide evidence on how well certain pieces of OS code work on future hardware or determine the influence of certain hardware features (such as the interconnect of the CPU) on various performance measures. Such results on formal models may guide the design of adaptive OS-primitives that switch between implementation variants with the same functionality but different performance characteristics. 1.1. Contribution Many researchers performed case studies with probabilistic model checkers for mutual exclusion protocols and other coordination algorithms for distributed systems (see, for example, the list of case studies performed with the Markov chain engine of the probabilistic model checkers PRISM [8–12], MRMC [13], or the CADP toolbox for performance evaluation [14,15]). While some of these case studies address the analysis of randomised protocols, we deal with non-randomised operatingsystem primitives. The quantitative analysis of OS-code by measurements is standard, and so is the algorithmic formal analysis of Markovian models for mutual exclusion protocols, resource management algorithms and the like, with stochastic distributions modelling the workload. Our approach is based on probabilistic model checking as well as measure-based techniques. The novelty of our approach is the methodology to use measurement-based simulation techniques for the fine-tuning of the abstract stochastic model to be analysed by probabilistic model checking (see Section 1.4). We are not aware of experiments that have been carried out where the modelling and evaluation process was accompanied by measure-based techniques in the same way. In this sense, our approach complements related work that incorporates reasonable but idealised stochastic assumptions, e.g., [15] where a comparative study of mutual exclusion protocols is presented. In this paper, we illustrate our approach and the scalability of our analysis for low-level operating-system code by means of a case study with a simple test-and-test-and-set (TTS) spinlock protocol. For this we increased the number of processes which are using the TTS spinlock beyond the point where the TTS spinlock would saturate and hence be used in practice on current hardware systems. In this sense, we see the TTS spinlock here as an example to illustrate our approach, investigate its scalability, and to report on the type of problems (see Section 1.3) that we had to address and that should be expected when studying more interesting low-level OS-protocols. We also like to stress that our case study makes no claim to be an exhaustive investigation of state-of-the-art probabilistic model checking techniques to tackle the state-explosion problem. Instead, we used selected state-of-the-art techniques and tools. We hereby identified some deficits in the type of quantitative queries supported by prominent probabilistic model checkers (see below). Another contribution of this paper relates to the type of properties that are analysed. Formal quantitative analyses often consider worst-case execution times only, but for a TTS lock typical performance measures of interest are the average spinning time or the probability that a waiting process will acquire the lock within a fixed time interval on the long run. In this context the long-run probabilities play an important role, as the these probabilities refer to situations where the system reached its steady state already, rather than the probability when looking at the initial phase. Another important class of properties are quantile properties, where we are interested in the minimum (or maximum, respectively) of amount of time such that some (long-run) probability reaches a certain threshold. Both conditional-long run as well as quantile properties are of general interest for the quantitative system analysis regarding the energy consumption and other cost functions. To the best of our knowledge, there are no case studies that consider conditional steady-state probabilities or quantile-based queries. As there is hardly any tool support for neither conditional steady-state probabilities nor quantile-based queries, in this paper we will report about the necessary extensions of the prominent probabilistic model checkers PRISM [8] and MRMC [13].
260
C. Baier et al. / Journal of Computer and System Sciences 81 (2015) 258–287
1.2. Summary of the case study with the TTS spinlock Before addressing the challenges of the case study, we provide a brief summary of the main steps. We first identified a list of performance measures that are most significant from a practical point of view. The quantitative analysis has been carried out first for small numbers of competing processes with both model-checking techniques and measurements. The measurements supported the model generation by stepwise refinement to represent cache effects in an adequate way, without zooming into the details of the complicated behaviour of caches which have been studied extensively, e.g., in [16,17]. With this refinement based approach we obtained a stochastic model (finite-state Markov chain) for the spinlock protocol such that for the selected quantitative properties, the results obtained by the probabilistic model checker almost perfectly match with the results obtained by the measurements. We interpreted this as evidence in favour of the model and then addressed the challenge to apply probabilistic model checking for the spinlock protocol when increasing the number of processes into the hundreds and thousands. To combat the state-explosion problem, we could exploit the symmetric structure of the spinlock protocol that treats all processes identically. This allowed us to apply the well-known concept of symmetry reduction for Markov chains [18–20], which turned out to be very efficient for the spinlock example. Among others, the model-checking results for the symmetry-reduced model provide insights in the interplay between the lock load and the average spinning times (and other performance measures). 1.3. Challenges Despite its simplicity, the spinlock protocol is well suited to illustrate the challenges that have to be addressed for the prediction of quantitative properties using probabilistic model checking. Given the wide range of application areas where probabilistic model checking has been applied successfully the applicability of probabilistic model-checking techniques was expected to work in principle. Nevertheless, we had to deal with several non-trivial problems that will be explained in the following paragraphs. Modelling The first challenge is to find a reasonable abstraction level for the formal model based on the probabilistic analysis that will be carried out. E.g., there are several details on the realisation of hardware primitives (such as caches, busses and controllers of the memory subsystem) that have impact on the timing behaviour of low-level OS code. The model must cover all features that dominate the quantitative behaviour, while still being compact enough for the algorithmic analysis. The latter requires to abstract away from details that have negligible impact on the quantitative behaviour or that would render the model unmanageable. The abstraction of many of these details is indispensable because only little information on the hardware realisation is available, and even if these details are known, too fine-grained hardware models make the state-explosion problem unscalable and lead to quantitative results that are too hardware specific. Instead, we incorporate hardware timing effects in the distributions for the execution times and use measurement-based simulation techniques to obtain empirical evidence for the models and the model-checking results. The methodology we applied to adjust the model and to align the model-checking results and measurements will be sketched in Section 1.4. Another crucial question is the type of stochastic model to be used for the formal analysis. Exponential distributions are widely used as a stochastic estimate for timing behaviour (execution times, delays, and so on) when the mean value is known (see e.g. [21–23]). However, the memoryless property of exponential distributions is often a too simplistic assumption and unrealistic. This, for instance, applies if the execution times of certain activities are known to be values of a compact interval, as it is the case for the spinlock protocol considered in this paper. The use of continuous-time distributions with compact support, such as uniform distributions for a given time interval, might yield more realistic stochastic models. However, the algorithmic analysis for them is known to be much harder and tool support for complex system properties is rare. Statistical model checking [24,25] is an option, but it is best suited for reasoning about finite-horizon or time-bounded properties, while we are mostly interested in the long-run behaviour. For these reasons, we use discrete clock variables and custom distributions with finitely many sampling points obtained by measurements. This leads to a finite-state discrete-time Markov chain as operational model for the spinlock protocol. Scalability A drawback of the discretisation approach is that the state-explosion problem is not only caused by the number of processes that compete for the lock, but also by the range of the discrete clock variables. Indeed, the size of the model directly depends on the distributions that model the timing behaviour, especially on the maximal value of the discrete clock variables. Several model checkers for Markov chains integrated sophisticated methods to tackle the state explosion problem. However, the application of these advanced techniques might not be straightforward, e.g., due to syntactic restrictions on the models imposed by the implementations. We will report on such difficulties and our approach to symmetry reduction in more detail in Section 6. Measurement-based simulation Providing measurements for the fine tuning of the stochastic model is difficult, because the quantities of interest are in a range where measurements significantly disturb the normal system behaviour and where instrumentation-induced noise blurs the results. For instance, the update rate and resolution of CPU-internal energy sensors necessitate a statistical analysis over a multitude of measurements to extract an energy profile for a single system call [26]. To counteract these effects and to obtain empirical evidence for the models and model-checking results, we construct
C. Baier et al. / Journal of Computer and System Sciences 81 (2015) 258–287
261
microbenchmarks that place the to-be-measured code into a manageable environment and that mimic the formalisation as close as possible. Quantitative properties A third major step is the identification of the types of quantitative properties that are relevant for low-level OS code. At a first glance, it seems that quantitative queries such as “What is the probability for threads to find the requested resource locked for longer than 1 microsecond?” can be expressed as probabilistic queries of the form P=? (ϕ ) using comparably simple patterns of path formulas ϕ in standard temporal logics, such as probabilistic computation tree logic (PCTL) [27]. (The notation P=? (ϕ ) refers to the probability for the event specified by ϕ .) However, the main interest is in deducing probabilities of this kind for the long run rather than for the (distribution specifying the) initial system configuration. Typically, the long-run behaviour of programs shows fundamentally different characteristics when compared to the behaviour in the initialisation phase. For instance, resource management protocols might show a different behaviour when shared resources are requested for the first time than on the long run, i.e., when the shared resources have been used by all processes several times. These differences are caused by the fact that the system had time to learn and adjust to the program characteristics, e.g., by warming up the disk or processor caches or by adjusting the scheduling parameters of the program to meet its responsiveness and interactivity demands. For finite-state Markov chains, methods to compute the steady-state (or long-run) distribution are well-known (see e.g. [21,28,23]) and implemented in many probabilistic model checkers. These provide for each state s the portion of time that the model spends in state s when time tends to infinity. However, queries for questions of the above type ask for long-run probabilities of path formulas under the condition that the system is in a certain set of states. The above question is indeed only of interest for those states on the long run where some thread just has requested the shared resource. That is, instead of standard steady-state distributions, we need conditional steady-state distributions that we take as basis for answering the above query. We refer to this type of queries as conditional long-run queries. From a mathematical point of view, the algorithmic treatment of conditional long-run queries in finite-state discrete-time Markov chains is obvious, as long as the path properties are specified using standard logics (LTL or PCTL/PCTL* [27,29–31]). However, we are not aware of any probabilistic model checker that provides direct support for conditional long-run queries. A second class of important queries for low-level OS code asks for the value of a quantity that is not exceeded in the majority of all cases: the quantile. Two examples of important quantile-based queries are “How long does a thread wait for a resource in 99.9% of all cases?” and “What is the energy that must remain in the battery to guarantee the complete playback of a certain video with a probability greater than 95%?” To our surprise, quantile-based properties have not been addressed so far in the context of probabilistic logics or probabilistic model checkers. Only recently and motivated by our project, algorithmic and complexity-theoretic aspects of quantiles in discrete Markovian models have been studied in [32,33]. In the context of the case study in this article, we extended the model checker PRISM [8] by implementing algorithms for the calculation of conditional long-run queries and quantile queries of the form arising in the case study. Our approach to achieve better scalability via symmetry reduction relies on a custom generator for the symmetry-reduced model and utilises the MRMC model checking engine [13] for carrying out the actual computations. We have likewise extended MRMC to perform the calculation of conditional long-run queries of the form considered in this article. 1.4. Aligning the model-checking results and measurements Clearly, the results obtained by a probabilistic model checker can be at most as good as the probabilistic model is. If the probabilistic model checker reports that a certain path property holds with a probability 0.9, then up to the tolerance guaranteed by the model checker for numerical computations, this is an absolute truth for the model and the path property (assuming the correctness of the probabilistic model checker). However, measurement-based methods performed with the concrete system might show a different behaviour. According to the measurements, the path property might hold, e.g., only in 80% or even 95% of the cases. To avoid such discrepancies between the model and the system, we use the following approach: 1. The targeted operating-system code is formalised at a suitable level of abstraction in the input language of a probabilistic model checker. 2. The parameters of the model, especially the timing behaviour and the probability distributions for state transitions, are determined with the help of measurements in the targeted setting. If this is not possible due to unduly high interference with the instrumentation code, we extract the relevant code and measure it in the form of a microbenchmark in a controlled environment. 3. The queries of interest are evaluated both by the probabilistic model checker and with a microbenchmark that again executes the code in a manageable environment. The model is viewed to be satisfactory, when the model-checking results and the measurements agree up to the required precision. If there are significant differences, their cause must be found and corrected. As we report later in this paper such differences might stem from an incorrect measurement or from small differences between the behaviour of the model and the real system.
262
C. Baier et al. / Journal of Computer and System Sciences 81 (2015) 258–287
Fig. 1. Time-abstract control flow graphs for the spinlock protocol.
An important point in our approach are the measurements in point 2 and 3. As known in the operating-systems community [34,35], it is a challenging and time consuming task to do these measurements right. The important points are thereby to measure a sufficiently general case (instead of an abnormal special case) and to preserve the original timing behaviour when inserting the instructions for the measurement. We devoted much effort in order to ensure that our measurement data is reliable and correct, see Section 3. 1.5. Outline Section 2 presents the spinlock and its relevant quantitative properties. Section 3 discusses the challenges for measuring these properties on real hardware. Section 4 introduces the necessary notions for discrete-time Markov chains and defines the operators needed for our spinlock properties. Section 5 presents our spinlock model, the formalisation of the properties and discusses several refinements of the model and the measuring technique to finally ensure the applicability of the spinlock model. Section 5 also presents the results that we obtained with the model checker PRISM (in Subsection 5.6). In Section 6, we describe our application of symmetry reduction to our model and discusses the results and the improved scalability that we achieved using the model checker MRMC for the reduced model. Section 7 concludes this paper. Preliminary versions of the material presented in this paper were published at the workshops FMICS’12 and SSV’12 [36,37]. At http://wwwtcs.inf.tu-dresden.de/ALGI/PUB/JCSS-Locks/, we provide models, as well as the enhancements to PRISM and MRMC for the calculations in this article. 2. A test-and-test-and-set lock We evaluate our novel measurement-assisted approach for the formal quantitative analysis of low-level operating system code on the example of a simple test-and-test-and set (TTS) lock. Although TTS locks are quite fundamental, they are still used in low-contention scenarios where they offer unique properties that are difficult to reproduce in more complicated variants. For example, allowing contenders for the lock to retract their request is trivially solved in a TTS lock whereas such operations are difficult if not impossible in fair variants such as ticket locks or queue-based locks. We expect to see many of the aspects that we discuss in this paper also for other locks and are confident that our approach extends also to these protocols and to protocols from other domains. 2.1. Abstract behaviour of spinlocks Throughout the paper we consider n parallel processes that compete for the same resource and synchronise themselves with a test-and-test-and-set lock. For brevity, we will often refer to this lock simply as spinlock. Fig. 1 shows the abstract control flow graphs of the n processes (Fig. 1(a)) and the lock (Fig. 1(b)). Each of the processes has a critical and a noncritical section (locations crit i and ncrit i , respectively). Before entering the critical section a process might have to wait before it can acquire the lock (location wait i ). When leaving the critical section the lock is released. We model the spinlock by n + 1 locations: location unlock indicating that the lock is free and one location locki for each process P i indicating that either process P i is in its critical section and has acquired the lock or process P i will acquire the lock and enter its critical section in the next step. The time-abstract control flow graphs in Fig. 1 will be refined in Section 5 by adding discrete clock variables and distributions that model timing assumptions for the processes. In the Markov chain model for the system composed by the spinlock and the processes P 1 , . . . , P n , the location switches and time passage in the locations will be modelled by discrete steps synchronised over all processes and the spinlock. Uniform probabilistic choices will be used to resolve the competition between several waiting processes when the lock is free or released. In the systems we consider, uniform probabilistic choices adequately characterise the behaviour of the interconnect for single socket systems, which controls the order in which the write operations of the individual cores happen. Even multi-socket systems try to maintain a balanced access scheme to ensure progress of the individual cores on the different sockets.
C. Baier et al. / Journal of Computer and System Sciences 81 (2015) 258–287
263
Fig. 2. Simple TTS spinlock.
2.2. Test-and-test-and-set lock implementation Fig. 2 shows the C/C++ code of a TTS lock. To acquire the lock, the requesting process executes the atomic swap operation in Line 3 to atomically read the value of the shared variable occupied and to then set it to true. The loop exits if the process was first to perform this swap after another process has released the lock by setting occupied to false. For as long as in Line 4 occupied is true, the process only reads this variable to avoid unnecessary contention on the core-to-core interconnect. Let us take a closer look at the timing behaviour of one of the processes. The time a process spends in its non-critical section varies, depending on the state of the application. It will therefore be modelled with a suitable distribution. When the process tries to acquire the lock, it first executes the atomic swap operation (line 3 in Fig. 2). Its execution time depends on the location of the cacheline that contains the variable occupied and how far it has to travel over the internal interconnect of the processor. If this first swap was unsuccessful the process spins in the nested loop in lines 3–4 until it can acquire the lock. The time needed for releasing the lock depends on whether other processes are spinning. If no other process is spinning, the variable occupied is still in the core-local cache and the assignment in line 7 is very fast. Otherwise, the cacheline must first be transferred into the local cache, which takes considerably longer. For modelling the spinlock we will simplify the timing behaviour and distinguish only between the spinning time, the critical section and the interim time. The spinning time starts with the first atomic swap operation when the process tries to acquire the lock and ends with the last (successful) atomic swap when the process has obtained the lock. For the model, the critical section starts after the successful atomic swap and ends when the lock has been released. The remaining time until the next lock acquisition is the interim time. In accordance with typical operating-systems code, we assume here that the length of the critical section is more or less constant, while the interim time varies [38]. Further, critical sections are typically very short in comparison to the times when no lock is held. For the distribution of the length of the interim section we draw inspiration from a video decoding example. For video decoding, the different frame types (I- and P-frames) lead to clusters of the interim time in certain small intervals. We model time with discrete clock variables that are decreased with each tick, where tick is a global action that is performed synchronously by all processes in the model. The clock variables are initialised at suitable places from the chosen distributions. We therefore use discretisations with finitely many sampling points of these distributions. For the (constant) critical section we use a Dirac distribution with 1 sampling point. For simplicity, the distribution for the interim time has between 2 and 4 sampling points. To keep the size of the model small, we introduce a scaling factor that describes the relationship between time units in the model and processor cycles in reality. A scaling factor of 1000 means 1 time unit corresponds to 1000 processor cycles. A smaller scaling factor increases both the precision and the size of the model. In this article we use two scaling factors: 1000 and 200. The modelled lock-acquisition pattern allows for the derivation of results about the common-case behaviour of applications when they use a certain operating-system functionality, but gives also rise to extreme-case analyses where one assumes malicious or erroneous applications to attack the operating system. For example, by setting the interim time to the execution time of a system call minus its critical sections it is possible to deduce the contention of locks that protect these sections under the assumption that malicious applications invoke this system call as fast as they can. 2.3. Selected performance measures For our case study, we identified the following four representatives of performance measures that are most relevant to obtain useful insights in the quantitative behaviour of the spinlock protocol: (A1) probability that a process finds a free lock when it seeks to acquire this lock; (A2) probability of acquiring a lock without spinning after releasing the lock without other processes having acquired the lock in the meantime; (A3) expected amount of time a process waits for a lock; (A4) the 95% quantile of the time processes wait for a lock.
264
C. Baier et al. / Journal of Computer and System Sciences 81 (2015) 258–287
In addition, we consider the following variants of (A2) and (A3): (A2 ) probability of acquiring a lock twice without spinning, permitting other processes to hold the lock in the meantime; (A3 ) expected amount of time a process waits for a lock under the condition that it has to wait for the lock As stated in the introduction, the formalisation of all performance measures in Section 5.2 will rely on conditional long-run probabilities. Performance measures such as (A1)–(A4) are highly relevant to guide design decisions and optimisation of low-level OS code. For instance, high probabilities in (A1) and (A2) justify the use of less complex lock implementations, respectively of simpler execution-time analyses. An analysis which assumes low fixed costs for acquiring and releasing a lock is justified by a high probability of (A2) because cache eviction of the occupied variable is unlikely. In case of contention, the probability of (A2) will quickly drop to a value close to 0 because the likelihood that another process could acquire the lock in the mean time increases. In this situation, (A2 ) reveals the probability that a process has to wait for other processes. In case of contention, an execution-time analysis can no longer assume low fixed costs for acquiring a lock but a high value of (A2 ) still justifies optimistic assumptions about how other processes influence the execution time of the lock acquiring process. The expected waiting time (A3) is important to judge whether a lock implementation is suitable for the common cases of a given scenario. In the case of busy-waiting, such as for the spinlock in this article, the waiting time correlates as well with a high-level of energy consumption. (A3 ) gives further insights into the expected waiting time if at the time of acquisition, the lock is still held by another process. In the case of low contention on the lock, this fact is concealed in (A3) by the waiting time of the large number of processes that obtain the lock immediately, that is without waiting. The quantile in (A4) replaces the worst-case lock acquisition time in imprecise real-time systems [4,5] and in systems with a fail safe override in case of late results. It returns an upper bound t for the lock-acquisition time that will be met with the specified probability (here 95%). 3. Measurements Spinlocks similar to the one we consider here, are used, for instance, in Linux. They usually protect very short critical sections with a low likelihood for contention. For such cases other locking or synchronisation schemes have a much bigger overhead. It is, however, cumbersome to measure spinlocks in complex operating systems. Depending on the state of the caches, the necessary instrumentation code can run several times longer than the complete critical section. This makes precise measurements very difficult. We therefore decided to implement and measure an isolated spinlock, because we can then arrange the instrumentation code such that it has almost no influence on the spinlock itself. The precise measurement of the execution time of such small code fragments on today’s highly optimised CPU architectures is a challenging task [34,35], posing various problems. The most important problem is that the instructions necessary for measuring (i.e., reading the time stamp counter and writing it into main memory eventually) can significantly interfere with the code to be measured. For measurements on multiple cores, the second problem is that there is no hardware support for starting a program at exactly the same time on several cores. Any operating system has a small but unpredictable influence on the timing behaviour of programs. The measurement must therefore be performed without an operating system. But even without an operating system, the BIOS on x86 platforms performs sporadic management tasks (system management mode). As a last point we note, that there is no guarantee that the time stamp counters increase with the same rate on different cores. Therefore, clocks may have different speed on different cores. Luckily, in our experiments, the time stamp counters were running at apparently the same speed. Current x86 CPUs have also properties that simplify measurements, at least in our case. Minimal speed differences in the execution of different cores and the arbitration implemented in hardware in the cache coherence protocol create uncertainty in the choice who gets the cacheline next, when two or more cores compete for the same cacheline regularly. We exploit this behaviour for the choice when several CPU cores compete for the lock and model it as a uniform probabilistic choice. For the measurement we run a loop that acquires and releases the lock in parallel on n cores of a multicore CPU. The critical and uncritical section mainly consist of a delay loop rep; dec %eax, which has a very regular timing behaviour. Besides the delay loop there is also the instrumentation code for storing time stamp counters in memory and a simple pseudo-number generator. The latter permits the pseudo random choice of different delays according to the chosen parameter distributions for the length of the critical and uncritical section. We insert a flag into the spinning loop (see Line 4 in Fig. 2) in order to check whether or not the process was able to obtain the lock without spinning. We pass this flag in a processor register not to disturb the timing of the TTS code. The actual instrumentation consists of three reads of the per core time stamp counter, before and after the invocation of the lock() function in Line 2 and after the unlock() function returns. We realised early variants of our measure program as a Linux user-level application. Even after disabling all obvious sources of interference and raising the priority of this application into the otherwise empty real-time priority band, the obtained data contained strange and inexplicable fluctuations. We therefore decided to run the measure program without an operating system. We linked the measure program to the bootstrap code of one of the microkernel operating systems developed at Dresden. This bootstrap code just initialises virtual
C. Baier et al. / Journal of Computer and System Sciences 81 (2015) 258–287
265
memory and the CPU cores. Together with the measure program it functions as a degenerated kernel and is started by a standard boot loader, such as Grub. We perform 10 measurements in a row, where in each measurement each core performs the acquire-release loop 1000 times. The first core that finishes this loop terminates the whole measurement. The measure data, which cannot be written to disk for the lack of an operating system, is then copied via a serial console to a different computer. Such a row of measurements takes about 15 minutes, where most of the time is spend on the transmission of the measurement results over the serial console. The length of the critical section is calibrated to take 1000 CPU cycles (which are about 376 ns on the measurement system) when the lock is free. For unknown reasons, the necessary delay loop counter to achieve 1000 cycles depends on the number of initialised CPU cores. 1000 cycles correspond to 1 time unit for scaling factor 1000 and to 5 time units for scaling factor 200 in the model. The length of the interim time is calibrated in the same way via its delay loop to match the chosen distribution. To start the measure program simultaneously on all cores we use a simple barrier, which works as follows. The first core writes the number of active cores into the barrier variable. Then, each core decrements the barrier once (with an atomic decrement) and subsequently polls the barrier variable. The measure program starts as soon as the barrier reaches zero. With such a barrier, the last core that decrements the barrier starts about a 100 cycles earlier, because the cacheline with the barrier variable has to travel to the other cores before they can see the value zero. This head start is unproblematic, because the advance is smaller than the length of the critical section. We performed the measurements on an Intel i7 920 quadcore machine at 2.67 GHz. The data we obtained is very regular in the sense that independent measurements show the same distribution of spinning times and the likelihood to obtain the lock without spinning. Except for the case that few measurements have one data point that shows an absurd long critical section or interim time. We attribute these spikes to a system management interrupt that blocked one core during the measurement. After discarding the spikes, the measurement is stable and reproduces the same values. The standard deviation for the performance measure (A1) is, for instance, below 0.7%. The effort for precise measurements is easily underestimated. It took a skilled operating-systems programmer from our group about 5 person months to produce the measurements that are contained in this paper. 4. Preliminaries: discrete-time markov chains We model the spinlock protocol by a discrete-time Markov chain (DTMC) with a reward function that serves to reason about the spinning time, i.e., the sojourn time in the location waiti after the first trial to enter the critical section. The reason why we have chosen a DTMC model is manyfold. The clear demand for probabilistic guarantees about the long-run behaviour requires a model where steady-state probabilities are mathematically well defined and supported by model checking tools. This, for instance, rules out probabilistic timed automata and other stochastic models with nondeterminism (e.g., Markov decision processes). Continuous-time Markov chains are not adequate given that the distribution specifying the duration of the critical and interim sections are not exponential. Approximations with phase-type distributions lead to large and unmanageable state spaces. From a mathematical point of view, we could use semi-Markovian models with continuous-time uniform distributions, but we are not aware of tools that provide engines for the queries to compute the performance measures (A1)–(A4) of Section 2.3. For these reasons, we used discrete probability distributions with a finite number of sampling points to specify the execution times of the critical sections and the interim times. We now briefly summarise the relevant concepts and explain our notations for DTMCs. For further details, we refer to standard textbooks, see e.g. [23,21]. A (probabilistic) distribution on a countable set X is a function μ : X → [0, 1] such that x∈ X μ(x) = 1. The value μ(x) is called the probability for x under μ. The support supp(μ) of μ consists of all elements x ∈ X with positive probability under μ, i.e.,
supp(μ) = x ∈ X : μ(x) > 0 .
μ is called a Dirac distribution if its support is a singleton. If C ⊆ X then μ(C ) = x∈C μ(x). We write μ = [x1 , . . . , xn ] for the uniform distribution μ(x1 ) = . . . = μ(xn ) = n1 and μ = x as shorthand for the Dirac distribution μ = [x]. In our approach, a DTMC is a tuple M = ( S , P, μinit , rew) where • • • •
S is a finite state space,
μinit : S → [0, 1] is a distribution, called the initial distribution,
P : S × S → [0, 1] is a function, called the transition probability matrix, rew : S → N is the reward function.
For the transition probability matrix it is required that for fixed state s ∈ S, P(s, ·) is a distribution on S, i.e., u ∈ S P( s , u ) = 1 for all s ∈ S. The states in the support of P(s, ·) are called successor states of state s. Intuitively, μinit (s) is the probability for state s being the first state of a sample run, while P(s, u ) is the probability to move from s to u within one (time) step. The states in supp(μinit ) are called initial states. The intended meaning of the reward function is that whenever entering state s, then reward rew(s) will be earned.
266
C. Baier et al. / Journal of Computer and System Sciences 81 (2015) 258–287
A path in M is a finite or infinite sequence π = s0 s1 s2 . . . of states with P(si , si +1 ) > 0 for all i. Let π (k) = sk be the (k+1)-st state, π ↓ k = s0 s1 . . . sk the prefix consisting of the first k+1 states of π and π ↑ k = sk sk+1 sk+2 . . . the suffix starting after the k-th state. We write Paths for the set of all infinite paths (i.e., Paths = S ω ) and Paths(s) for the set of infinite paths starting in state s. The accumulated reward for a finite path π = s0 s1 . . . sk is defined as def
Rew(π ) = rew(s0 ) + rew(s1 ) + . . . + rew(sk−1 ). If π is a finite path then the cylinder set Cyl(π ) spanned by π consists of all infinite paths π where π is a prefix of π . Using well-known concepts of measure theory, there exists a unique probability measure PrM on the σ -algebra generated by the cylinder sets of finite paths such that
PrM Cyl(s0 s1 . . . sk ) = μinit (s0 ) ·
P(si , si +1 ).
0i
Recall that μinit is the initial distribution of M. We write Prs or PrM for PrMs where Ms arises from M by declaring s s as the unique initial state, i.e., by replacing the initial distribution μinit with the Dirac distribution that assigns probability 1 to state s. For specifying measurable path events (i.e., sets of infinite paths that belong to the σ -algebra generated by the cylinder sets of finite paths), we use the standard notations of linear temporal logic (LTL) with the symbols (next), U (until) and 3 (eventually) and time-bounded variants thereof. Let X and Y be sets of states, i.e., X , Y ⊆ S, and k ∈ N. Then:
def X = π ∈ Paths : π (1) ∈ X def X U =k Y = π ∈ Paths : π (n) ∈ X \ Y for 0 n < k and π (k) ∈ Y def def X U h Y = X U =k Y , XU Y = X U =k Y k∈N
k∈N
0kh
To deal with query (A2), we will also use LTL-like formulas with cascades of until-operators and suppose here the standard LTL-semantics for paths. The notations 3Y , 3=k Y and 3h Y are short forms of S U Y , S U =k Y and S U h Y , respectively. Sets of paths that are specified using such LTL-like notations with sets of states as atoms are indeed measurable. For further details see [39,29,30]. For h ∈ N, θh : S → [0, 1] denotes the state distribution for M after h steps. Formally, θ0 = μinit is the initial distribution and θh+1 = P · θh for h 0. Here, P is regarded as a S × S-matrix and θh as a vector. The function θ : S → [0, 1],
1
θ(s) = lim
k+1
k→∞
·
k
θh (s),
h =0
is called the steady-state distribution for M. It is well-known that almost all paths in M will eventually enter a bottom strongly connected component (BSCC) and visit each of its states infinitely often. (The formulation “almost all paths” means “with probability 1”.) Thus, the BSCCs determine the long-run behaviour and θ(s) > 0 iff s belongs to a BSCC that is accessible from some initial state. If C is a set of states with θ(C ) > 0 and Π a measurable set of paths then the conditional long-run probability for Π (under condition C ) is defined by: def
PM (Π | C ) =
θ(s)/θ(C ) · PrM s (Π)
s∈C
Here, θ(s)/θ(C ) is the conditional long-run probability for state s, again under condition C . The intuitive meaning of θ(s)/θ(C ) is the portion of time spent in state s on long runs relative to the total time spent in states of C . With the factor PrM s (Π), the above weighted sum represents the long-run probability for the event specified by Π under condition C . Note that PM (Π | C ) equals PMC (Π) where MC is the DTMC ( S , P, θC , rew) obtained from M by replacing the initial distribution μinit with θC given by θC (s) = θ(s)/θ(C ) (conditional steady-state distribution under condition C ). Analogously, conditional long-run average values of random variables can be defined as weighted sums. (A3) will be formalised as an instance of conditional long-run accumulated reward for reaching a goal set Y defined by: def
RM (3Y | C ) =
θ(s)/θ(C ) · ExpAccRewM s (3 Y )
s∈C
where ExpAccRewM s (3 Y ) denotes the expected accumulated reward for reaching Y from state s. It is defined by: ∞
r · PrM s
π ∈ Paths : ∃k ∈ N s.t. π ∈ 3=k Y ∧ Rew(π ↓ k) = r
r =0
Recall that
π ↓ k denotes the prefix of π consisting of the first k + 1 states.
C. Baier et al. / Journal of Computer and System Sciences 81 (2015) 258–287
267
Fig. 3. Control flow graph of process P i .
5. Markov chain model for the spinlock protocol In this section, we develop our DTMC model for the spinlock protocol and provide formalisations of the performance measures (A1)–(A4) in Section 5.2. We start with a preliminary version of the model that will be refined in several steps below. To model n processes P 1 , . . . , P n that compete for a spinlock we use a DTMC that results as the synchronous parallel composition of one module for each of the processes P i from Fig. 3 (or one of its refinements) and one module representing the spinlock from Fig. 4. In the following, we present the control flow graphs of these modules in a graphical way. Each module consists of finitely many locations (local control states) and edges between the locations modelling the control flow. Edges are labelled by guards, i.e., Boolean conditions on locations and the values of clock variables, names of synchronisation actions and random or deterministic assignments for the clock variables. The syntax of guards leans on the input language of the probabilistic model checker PRISM [8]. Edge labels with non-trivial guards are written as if-statements. Tautological guards and empty assignments are omitted. Control flow graph for the processes The control flow graph for process P i is shown in Fig. 3. Recall from Section 2 and Fig. 1 that location ncrit i models the non-critical section, location wait i the spinning or waiting time and location crit i the critical section, where P i holds the lock. We introduced an additional location start i to facilitate the initialisation of the discrete clock variable t i for the non-critical section. All processes P i synchronise on the action initialise to ensure they all start in location ncrit i at the same time. tick is the global action on which all processes and the lock synchronise. We use two distributions ν , γ on the positive integers with finite support:
• distribution ν for the interim time and • distribution γ for the duration of the critical section. As pointed out in Sections 2.2 and 3, we simplified the timing behaviour to distinguish only between spinning time, critical section time and interim time. Thus, the time of the critical section includes the time to acquire the lock (via an atomic swap). In our model, each transition corresponds to one time step and a process i requests the lock by entering the location waiti . To model a timeless lock acquisition, we consider the transitions wait i → criti and criti → ncrit i to belong to the duration of the critical section. Similarly, ncrit i → waiti as well as the first loop (wait i ∧ t i = 0) → (waiti ∧ t i = 1) belong to the interim time to simulate a timeless lock release. Therefore, we have to subtract two time units from the counters of the interim loop and critical loop, too. For example, if a process has an interim time of five time units, it has to take the interim loop three times. To achieve this, the operator random chooses a value according to a distribution and subtracts two time units. As a result, a process spins only if it stays for at least two time units in location wait i . Condition locki guarding the transition from the waiting location into the critical section refers to the lock process being in location locki . The other guards and assignments are the obvious ones for a process that is controlled by a discrete clock. Control flow graph for the lock Fig. 4 shows the control flow graph of the spinlock. It has one location locki for each process P i , modelling that P i holds the lock. Conditions wait i and criti in the guards of transitions in Fig. 4 refer to process P i ’s current location. With the direct transitions between locki and lockk lock ownership may get transferred without delay.
268
C. Baier et al. / Journal of Computer and System Sciences 81 (2015) 258–287
Fig. 4. Control flow graph of the spinlock.
DTMC for the composite model The states of the DTMC M modelling the spinlock protocol have the form
s = 1 , . . . , n , m, t 1 =b1 , . . . , tn =bn where i is the current location of process P i , m the current location of the lock and b i the current value of clock variable t i . Thus, if i = ncriti then b i ∈ {0, 1, . . . , N } where N = max supp(ν ), and similarly, if i = criti then b i ∈ {0, 1, . . . , C } where C = max supp(γ ). (Recall that the supports of ν and γ are supposed to be finite subsets of the positive integers.) For the control graphs of Fig. 3, b i = 0 if i = waiti . Later, we will present refined control graphs for the processes where the range of the clock variables t i is {0, 1, 2} when P i is in location waiti . The initial distribution of M is a Dirac distribution for state
sinit = start1 , . . . , startn , unlock, t 1 =0, . . . , tn =0. The initialisation step requires synchronisation of the processes P 1 , . . . , P n over action initialise, while the lock does not change its location. Thus, the transition probabilities from sinit are given by:
P sinit , ncrit1 , . . . , ncritn , unlock, t 1 =b1 , . . . , tn =bn =
ν (bi )
1≤i ≤n
In the sequel, we will use dot notations to refer to the components of states. Let s. P i denote the location of P i in state s, s. L the location of the spinlock in state s and s.t i the current value of the discrete clock variable t i in state s. We use propositional formulas over the locations and conditions on the values of t 1 , . . . , tn to characterise sets of states. For instance, criti is identified with the set {s ∈ S : s. P i = criti }, where S denotes the state space of the DTMC M for the composite system. For those states s where some process P j has reached the end of its critical section and two or more processes are waiting, the switch of the lock location relies on a fair probabilistic choice. That is, if s. P j = crit j , s. L = lock j and s.t j = 0 and k = |{i ∈ {1, . . . , n} : s. P i = waiti }| then for each process P i where s. P i = waiti we have PrM s ( locki ) = 1/k. The precise transition probabilities from state s are given by P(s, u ) = 1/k · ν (b) where u is one of the successor states of s when all processes and the lock synchronise on the action tick with the lock switching from location lock j to locki and P j ’s interim time is set to b when taking the edge from location crit j to ncrit j . That is, s. P i = u . P i = waiti , u . L = locki u . P j = ncrit j and u .t j = b ∈ supp(ν ). For each such state u, we have PrM u ( crit i ) = 1. The spinlock model that we have presented now will be refined in several details in the following subsection. The first refinement in Subsection 5.1 will be necessary in order to formalise the performance measures (A1)–(A4) that we are interested in. Later, we will explain further refinements that turned out to be necessary to model cache effects stochastically and to match the measure-based values for (A1)–(A4) with the values calculated by the probabilistic model checker. 5.1. First refinement: cache-agnostic model Fig. 5 displays the refined control flow graph of the processes P i . For reasons that will become clear in Subsection 5.4 below, we call this refined model the cache-agnostic model. The clock variable t i , which has not been used in location wait i before, serves now as spinning indicator in this location. The lock has just been requested in state s, if s. P i = waiti and s.t i = 0. Process P i is said to be spinning if s. P i = waiti , s.t i ≥ 1 and s. L = locki .
C. Baier et al. / Journal of Computer and System Sciences 81 (2015) 258–287
269
Fig. 5. Refinement of process P i for the cache-agnostic model (changes in bold).
Let us illustrate the behaviour of M when some waiting processes might enter its critical section with or without spinning by means of an example. Suppose n = 3 and that
s = crit1 , wait2 , wait3 , lock1 , t 1 =0, t 2 =0, t 3 =1 is the current state. In state s, processes P 1 releases the lock by firing the edge from crit1 to ncrit1 and the lock switches from location lock1 to lock2 and lock3 with equal probability. For processes P 2 and P 3 , action tick models time passage in their waiting locations. Hence, the successor states of s are:
s2 [b1 ] = ncrit1 , wait2 , wait3 , lock2 , t 1 =b1 , t 2 =1, t 3 =2 s3 [b1 ] = ncrit1 , wait2 , wait3 , lock3 , t 1 =b1 , t 2 =1, t 3 =2 where b1 ∈ supp(ν ) is the randomly chosen interim time for the noncritical activities of process P 1 starting in state s2 [b1 ] or s3 [b1 ]. The probabilities of the transitions from state s in M are given by:
P s, s2 [b1 ] = P s, s3 [b1 ] =
1 2
· ν (b1 )
From state s2 [b1 ], process P 2 will acquire the lock and enter its critical section in the next step without spinning. Thus, the behaviour of M in state s2 [b1 ] is deterministic, except for the probabilistic choice of the duration of P 2 ’s critical section starting in the successor states of s2 [b1 ]:
P s2 [b1 ], ncrit1 , crit2 , wait3 , lock2 , t 1 =b1 −1, t 2 =b2 , t 3 =2 = γ (b2 ) From state s3 [b1 ], however, process P 3 gets the lock and process P 2 ’s spinning phase starts in the next step. The location switch in state s3 [b1 ] is deterministic. The transition probabilities are determined by the probabilistic choice of the length of P 3 ’s critical section:
P s3 [b1 ], ncrit1 , wait2 , crit3 , lock3 , t 1 =b1 −1, t 2 =2, t 3 =b3 = γ (b3 ) 5.2. Formalisation of performance measures (A1)–(A4) The formal representation of performance measures (A1), (A2) and (A4) and the reward function used for the formalisation of (A3) will rely on the following Boolean conditions: def
request i = wait i ∧ t i = 0 def
releasei = crit i ∧ t i = 0 def
spini = wait i ∧ t i ≥ 1 ∧ ¬locki Condition requesti characterises the set of states s in M where process P i has just requested the lock, while releasei indicates that process P i is just performing its last critical actions and the lock is to be released next. To compute the long-run average spinning time (query (A3)), we use a reward function rew_spini , such that rew_spini (s) = 1 for each state s of the model where process P i is spinning, i.e., spini holds. For all other states s, we have rew_spini (s) = 0. The relevant quantitative measures (A1)–(A4) of Section 2 now correspond to the following values.
270
C. Baier et al. / Journal of Computer and System Sciences 81 (2015) 258–287
(A1) PM (ϕ1 | request i ) M
(A2) P
where ϕ1 = locki
(ϕ2 | releasei ) where ϕ2 = (unlock U locki )
(A3) RM (3locki | request i ) (A4)
min t ∈ N : PM 3t +1 locki | request i 0.95
(A1), (A3) and (A4) refer to the conditional steady-state distribution for the condition that process P i has just performed its first request operation. (A1) corresponds to the long-run probability under the condition request i for the path event ϕ1 = locki stating that in the next time step process P i will win the race between the waiting processes, i.e., it will enter its critical section without spinning. (A3) stands for the long-run average spinning time from states where P i has just requested its critical section. Replacing the condition request i in (A3) with waiti ∧ t 1 =1 ∧ ¬locki , we obtain the long-run average spinning time, provided that the first attempt to acquire the lock was not successful. The quantile in (A4) corresponds to the minimal number t of time steps minus one such that the long-run probability from the request i -states for the path event 3t +1 locki , stating that the lock will be granted for process P i in t +1 or fewer steps, is at least 0.95. For the long-run probability of acquiring the lock, again without interference by other processes, there are several reasonable formalisations. We can express the constraint that the look released by P i was not held by another process until P i acquires it again as ϕ2 = (unlock U locki ) under the condition releasei , i.e., that P i is about to release the lock in the next step. The variant (A2 ), where P i has not to spin as well and hence experiences a low overhead but—in contrast to (A2)—other processes may have held the lock in between the critical sections of P i can be formalised as the event specified by the path formula
ϕ2 = locki U ncriti ∧ (¬spini U criti )
given that P i requests the lock, i.e., requesti holds. Note that both ϕ2 and ϕ2 comprise nested temporal operators and thus are LTL-formulas. Hence, the treatment of (A2) and (A2 ) using ϕ2 and ϕ2 , respectively, is rather complex since the standard treatment of LTL-queries relies on a probabilistic reachability analysis of a product construction of a deterministic ω-automaton and the DTMC (see e.g. [31]). However, we can avoid the complexity blow-up implied by the product construction if we refine the control flow graph of P i by duplicating the control loop ncrit i waiti criti in order to differentiate between successive look acquisitions. This modification allows us to replace ϕ2 and ϕ2 with complexity-wise simpler reachability conditions. The duplication can be realised by introducing a Boolean variable b that flips its value after P i leaves its critical section (see Section 5.7 below). This modification is justified by using an appropriate notion of bisimulation. The control flow graphs for the lock and the other processes remain unchanged. Then, instead of ϕ2 and ϕ2 we can then deal with
ψ2 = (locki ∨ unlock) U (criti ∧ b) ψ2 = ¬spini U (criti ∧ b) under the condition releasei ∧ ¬b and requesti ∧ ¬b, respectively. 5.3. Avoiding cascade effects After designing the spinlock model and setting up the measurement we started to compare the results obtained with both methods in order to check the applicability of our model. In an attempt to avoid all possible complications we performed some experiments with fixed execution times, namely 1000 CPU cycles for the critical sections and 5000 CPU cycles for the interim time. In our DTMC model, this corresponds to the Dirac distributions given by γ (1) = 1 and ν (5) = 1, when using scaling factor 1000. As expected, for up to 6 processes, the model checker computes probability 1 for obtaining the lock without the need to spin in the long run (property (A1)). After the first round, the processes always take the lock in a fixed order one after each other. This observation has a simple mathematical explanation: without any variability in the durations of the critical sections and the interim times, the DTMC is degenerated to an almost deterministic transition system. It has some probabilistic choices in the initialisation phase that resolve the competition between several waiting processes when the lock is or becomes free, but its bottom strongly connected components are simple cycles, one for each order of the processes. Thus, along each infinite path of the DTMC, the processes will enter their critical sections in some fixed order. In contrast, with measurement-based simulations this cohort effect is not traceable. Instead, the measurement yields that the relative frequency for waiting processes to obtain the lock directly is only about 0.95. In 5% of the cases, however, waiting processes do not immediately get the lock and have to spin. This measurement-based observation can be explained by minimal variations in the running or signal travelling time. These variations cause a process to try to acquire the lock a few nano-seconds before the lock will be released. The process then has to spin for one or two rounds.
C. Baier et al. / Journal of Computer and System Sciences 81 (2015) 258–287
271
Fig. 6. Probability to grab the lock without spinning (A1).
Fig. 7. Probability to grab the lock twice without spinning (A2 ).
Fig. 8. Average waiting time for spinning processes (A3 ).
5.4. Accounting for cacheline evictions To avoid the cohort effects in the model described in Section 5.3, we did a series of experiments with variable interim times (distribution ν ) and fixed length of the critical section (Dirac distribution γ ). Figs. 6–9 display the measurements and model checking results for our performance measures for the cache-agnostic model (from Figs. 4 and 5). We prefer to display the variants (A2 ) and (A3 ) here, because, for the selected distributions,
272
C. Baier et al. / Journal of Computer and System Sciences 81 (2015) 258–287
Fig. 9. 95% quantile of the waiting time (A4).
Fig. 10. Cache coherency traffic for releasing the lock while no process is spinning.
both (A2) and (A3) are very close to zero. The x-axis always displays the number of processes and the distributions in our custom notation. Here, 2[1][10, 12] stands for 2 processes, 1 time unit for the critical section (i.e., γ = 1) and 10 or 12 time units for the interim time (ν (10) = ν (12) = 1/2). Similarly 4[1][8, 10, 12] stands for 4 processes, 1 time unit for the critical section and 8, 10 or 12 time units for the interim time with probability 1/3 each. The Figs. 6 and 7 display the probability for (A1) and (A2 ), while the Figs. 8 and 9 display the time for (A3 ) and (A4) in multiples of the length of the critical section (which corresponds to time units here, because γ = 1). The plotted data points are independent from each other, they have been connected for better visibility only. One clearly sees that the results from the cache-agnostic model are somewhat similar to the measurements. However, there are also clear discrepancies. For instance, for 4 processes and an interim time of 8, 10, 12 or 14 time units (data point 4[1][8, 10, 12, 14]) the model predicts that the lock is free with probability 0.9 while in our measurement the relative frequency is only 0.77. For the same distribution, the predicted average waiting time for the case of a contented lock is 1.15 time units versus 0.57 time units in the measurement. The search for the discrepancy between the predictions and the measurements was fairly difficult. Finally, the reason turned out to be relatively simple: For the case where a process had to spin before getting the lock, a peculiar cache effect caused the critical section in the measurement to appear to be slightly longer, namely approximately 1200 instead of 1000 CPU cycles. We illustrate this cache effect in the Figs. 10 and 11. Both figures show two processes, P 1 and P 2 , and their respective control location over a period of time, which increases from left to right. The lock variable is shared between P 1 and P 2 . The coloured bottom line in both figures shows its value. Both P 1 and P 2 may contain a copy of the cacheline with the lock variable in their respective core-local caches. We therefore display the state of this cacheline for both processes. At the start of both figures, P 1 is in its critical section, holding the lock. In Fig. 10 no process is spinning while P 1 holds the lock. The lock variable therefore remains in state modified in the cache of P 1 , meaning that P 1 can directly modify the lock variable in its core-local cache. Here, P 2 only tries to acquire the lock after P 1 has released it. The atomic swap (Line 3 in Fig. 2) causes a request for ownership (RfO) to be sent, which triggers the transfer of the cacheline to the core of P 2 . After the transfer, P 2 obtains the lock by changing the lock variable. In Fig. 10 there is only one cache coherency message in the time where the lock is free. In Fig. 11, P 2 starts spinning while P 1 is still in its critical section. The cacheline is therefore transferred to P 2 while P 1 is in its critical section. When P 1 wants to release the lock, it must transfer the cacheline back, invalidating it in the cache of P 2 . At this point, P 2 is in the spinning loop (Line 4 in Fig. 2) and continuously reads the lock variable. It therefore sends a snoop message to the cache of P 1 , in order to transfer the cacheline into the shared state, where both cores may read
C. Baier et al. / Journal of Computer and System Sciences 81 (2015) 258–287
273
Fig. 11. Cache coherency traffic for releasing the lock while another process is spinning.
Fig. 12. Refinement into the cache-aware model (changes in bold).
it. When P 2 finds the lock free, it must again request ownership for the cacheline in order to modify the lock variable. In Fig. 11 there are two cache coherency messages in the time where the lock is free, delaying start and end of the critical section of P 2 . This leads to the observation as if the critical section of P 2 would be longer in case P 2 start spinning while P 1 is holding the lock. 5.5. Second refinement: cache-aware model To reflect the different length of the critical section we refined the control flow graph of the processes again, see Fig. 12. Instead of a single length γ for the critical section length, we differentiate between the two cases and use
• γ0 for the critical section length when the lock was obtained without spinning and • γ1 for the critical section length when the lock was obtained after some spinning period. The resulting model is called cache-aware. As in the experiments with the cache-agnostic model, we dealt with fixed values for γ0 and γ1 , formalised by Dirac distributions. Unfortunately, it is not possible to extract good estimates for γ0 and γ1 directly from the measurements. Inserting instructions such as rdtsc into the spinning loop to read an additional time stamp from the local clock would have changed the timing behaviour of the spin loop significantly. Access to the local clock, that is the running time of rdtsc, takes in the order of 30–60 cycles and is as such well in the range of the effect we would like to measure. We therefore calculate the average costs for acquiring a lock under the condition that a process had to spin, called acquire-spin costs hereafter, by comparing the time stamps of the releasing cores with the time stamps of the lock acquiring cores. However, although the time stamp counter for cores on the same die are derived from the same clock, there is an offset caused by the head start of the process which first passes the synchronisation barrier at the start of the measurement. Fig. 13 displays the collected data. There, the lines plotted with squares 2 and dots • show the acquire-spin costs per core. The figure also shows the acquire-spin costs after normalising the offset between the two clocks (plotted with triangles ), with a big spike between 200 and 250 cycles and an average about roughly 200 cycles. To express the time difference between γ0 and γ1 we were forced to switch to a smaller scaling factor, thereby also increasing the values in the distribution ν . As explained before, the absolute values in ν have a direct impact on the size of the state space of the model and thus on its scalability. Facing a trade-off between preciseness and scalability we decided
274
C. Baier et al. / Journal of Computer and System Sciences 81 (2015) 258–287
Fig. 13. Frequency of acquire free and spin.
Fig. 14. Probability to grab the lock without spinning (A1).
to let γ1 be 20% longer than γ0 and use 200 as scaling factor. The cache-aware model uses therefore 5 and 6 time units for the critical section and fivefold greater values for ν (i.e., ν (50) = ν (60) = 12 instead of 10 and 12 time units for the interim time). Please note, that switching to a smaller scaling factor alone does not provide more accurate results, as the DTMC with scaling factor 200 is (apart from stutter steps in which the system state does not change except for variables modelling the time) isomorphic to the DTMC with scaling factor 1000 as the length of the critical section is a multiple of 1000 cycles. The more accurate results when comparing the model checking results with the measurements are due to the fact that cache effects are considered. 5.6. Results for the cache-aware model In this subsection we describe our results for the cache-aware model, see Figs. 14–17. The detailed values obtained from the model checker are provided in Tables 1, 2 and 3 in the next subsection. The distributions for the interim time used here correspond to the distributions in Figs. 6–9. Because of the reduced scaling factor (200 vs. 1000) one time unit has now 1/5 of the length and therefore the values in the distribution ν are 5 times larger. Similarly, using 5 time units for γ0 corresponds to the value of γ from the cache agnostic model. As discussed in the preceding subsection, we set γ1 to 6 time units. The average waiting time (in Fig. 16) and the 95% quantile of the waiting time (in Fig. 17) are measured, as before, as multiples of the length of the critical section. For the measures (A1) and (A2 ) there is an almost perfect match between the measurement and the model checker result. We see the biggest difference for the distribution ν = [40, 50, 60, 70] and 4 processes, with 76.74% versus 77.48% for (A1) and 58.74% versus 59.97% for (A2 ). For (A3 ) the plots lie slightly more apart, but even for 2 processes and the distribution ν = [40, 60] the absolute difference of 0.041 time units is very small (0.251 versus 0.292 time units). For query (A4) the model checker reproduces the measurement exactly within the possible accurateness. Because of the scaling factor, PRISM computes a multiple of 0.2 for query (A4), while the measurement can result in arbitrary fractional values. The differences between the results obtained for the models with parameters n[5][6][40, 50, 60] and n[5][6][40, 60] (where n ∈ {2, 3, 4}) illustrate that not only the mean value, but also the variance of distribution ν for the interim time has non-negligible impact on (A1)–(A4).
C. Baier et al. / Journal of Computer and System Sciences 81 (2015) 258–287
Table 1 Detailed results and performance for (A1), (A3) and (A3 ) using PRISM (with # proc.
Distribution
ν
System size States
Time build
MTBDD
275
γ0 = 5 and γ1 = 6). Time steady
(A1)
(A3 )
(A3)
Value
Time
2 2 2 2 2
[40, 50] [40, 60] [50, 60] [40, 50, 60] [40, 50, 60, 70]
1342 2202 1852 2506 3350
1445 1687 1628 1029 1227
0.12 0.20 0.18 0.12 0.15
s s s s s
0.29 0.51 0.43 0.25 0.31
s s s s s
0.944 0.983 0.955 0.926 0.925
0.00 0.00 0.00 0.00 0.00
s s s s s
0.086 0.052 0.066 0.129 0.145
0.02 0.03 0.02 0.02 0.03
s s s s s
1.545 3.118 1.462 1.745 1.926
0.01 0.02 0.02 0.02 0.02
s s s s s
3 3 3 3 3
[40, 50] [40, 60] [50, 60] [40, 50, 60] [40, 50, 60, 70]
67 001 142 575 110 013 195 849 306 697
32 772 40 645 42 545 10 892 12 344
4.05 11.57 7.77 3.56 4.06
s s s s s
3.45 9.24 5.94 6.38 8.77
s s s s s
0.872 0.955 0.898 0.850 0.854
0.03 0.05 0.05 0.04 0.06
s s s s s
0.277 0.147 0.209 0.362 0.371
0.57 0.99 0.98 0.67 1.06
s s s s s
2.166 3.229 2.047 2.410 2.547
0.45 0.72 0.87 0.54 0.85
s s s s s
4 4 4 4 4
[40, 50] [40, 60] [50, 60] [40, 50, 60] [40, 50, 60, 70]
4 082 808 8 776 938 8 325 516 13 580 130 25 406 006
569 046 822 186 878 109 117 898 113 008
316.31 819.24 759.38 127.21 144.22
s s s s s
149.10 385.80 313.11 150.41 299.64
s s s s s
0.797 0.889 0.838 0.767 0.775
1.12 s 2.11 s 2.06 s 2.06 s 3.71 s
0.531 0.376 0.397 0.669 0.668
25.34 48.96 48.57 35.78 70.44
s s s s s
2.612 3.395 2.446 2.865 2.967
22.13 41.39 38.44 33.06 65.21
s s s s s
Table 2 Detailed results and performance for (A2) and (A2 ) using PRISM (with
ν
System size
Prob.
Time
Value
Time
γ0 = 5 and γ1 = 6).
Distribution
Prob.
Time
2 2 2 2 2
[40, 50] [40, 60] [50, 60] [40, 50, 60] [40, 50, 60, 70]
2674 4384 3694 5002 6690
1554 1799 1739 1036 1220
0.17 0.28 0.26 0.19 0.21
s s s s s
0.38 0.77 0.59 0.38 0.46
s s s s s
0.056 0.100 0.045 0.085 0.107
0.06 0.09 0.08 0.08 0.11
s s s s s
0.889 0.967 0.909 0.852 0.849
0.14 0.20 0.20 0.23 0.31
s s s s s
3 3 3 3 3
[40, 50] [40, 60] [50, 60] [40, 50, 60] [40, 50, 60, 70]
133 604 284 682 219 558 391 588 613 254
33 875 42 058 45 011 10 523 11 610
5.80 15.53 11.32 6.50 7.84
s s s s s
5.11 14.11 9.05 10.84 14.96
s s s s s
0.000 0.020 0.000 0.008 0.017
1.05 2.12 1.85 1.25 1.99
s s s s s
0.760 0.910 0.806 0.720 0.726
6.59 12.11 10.88 12.66 21.21
s s s s s
4 4 4 4 4
[40, 50] [40, 60] [50, 60] [40, 50, 60] [40, 50, 60, 70]
8 164 362 17 552 628 16 649 628 27 155 404 50 810 662
652 002 939 739 1 037 676 126 006 115 527
394.71 1063.92 1134.80 218.60 278.36
s s s s s
254.47 669.62 529.04 263.96 528.41
s s s s s
0.000 0.003 0.000 0.001 0.003
33.16 68.19 68.11 36.76 74.33
s s s s s
0.639 0.785 0.706 0.590 0.600
247.55 513.10 487.28 453.80 878.63
s s s s s
States
MTBDD
Table 3 Detailed results and performance for (A4) using PRISM (with # proc.
Distribution
ν
Time steady
(A2)
Prob.
Time
γ0 = 5, γ1 = 6 and t max = 5).
System size States
Time build
(A2 )
# proc.
MTBDD
Time build
Time steady
(A4) min t
Prob.
Time script
Time
2 2 2 2 2
[40, 50] [40, 60] [50, 60] [40, 50, 60] [40, 50, 60, 70]
1342 2202 1852 2506 3350
1445 1687 1628 1029 1227
0.12 0.21 0.18 0.12 0.15
s s s s s
0.29 0.52 0.45 0.25 0.32
s s s s s
1 0 0 1 1
0.990 0.983 0.955 0.982 0.977
0.02 0.03 0.02 0.02 0.03
s s s s s
0.04 0.06 0.05 0.07 0.09
s s s s s
3 3 3 3 3
[40, 50] [40, 60] [50, 60] [40, 50, 60] [40, 50, 60, 70]
67 001 142 575 110 013 195 849 306 697
32 772 40 645 42 545 10 892 12 344
4.06 15.25 7.72 3.59 4.04
s s s s s
3.45 9.73 5.85 6.38 8.72
s s s s s
2 0 1 3 3
0.955 0.955 0.961 0.960 0.955
0.59 1.02 0.95 0.72 1.05
s s s s s
2.40 4.63 3.54 6.08 8.72
s s s s s
4 4 4 4 4
[40, 50] [40, 60] [50, 60] [40, 50, 60] [40, 50, 60, 70]
4 082 808 8 776 938 8 325 516 13 580 130 25 406 006
569 046 822 186 878 109 117 898 113 008
305.10 825.78 754.58 131.74 149.04
s s s s s
145.49 394.35 309.86 158.47 313.46
s s s s s
4 3 3 5 4
0.954 0.950 0.957 0.986 0.953
35.48 60.60 65.04 69.82 111.51
s s s s s
141.65 290.65 259.34 422.28 729.37
s s s s s
276
C. Baier et al. / Journal of Computer and System Sciences 81 (2015) 258–287
Fig. 15. Probability to grab the lock twice without spinning (A2 ).
Fig. 16. Average waiting time for spinning processes (A3 ).
Fig. 17. 95% quantile of the waiting time (A4).
5.7. Quantitative analysis with PRISM We will now discuss the quantitative analysis of the cache aware model using the model checker PRISM [8] in more detail. Because the processes P i are completely identical we pick process P 1 as representative and determine the performance measures (A1)–(A4) for process P 1 only, i.e., calculating (A1)–(A4) of Section 5.2 for i = 1. As PRISM has no direct support for computing conditional long-run probabilities we extended PRISM version 4.0.3 with operators that compute conditional long-run probabilities PM (ϕ | C ) and conditional long-run accumulated rewards
C. Baier et al. / Journal of Computer and System Sciences 81 (2015) 258–287
277
Fig. 18. Unfolded P 1 for queries (A2) and (A2 ) (changes in bold).
RM (3Y | C ) where ϕ is a PCTL path formula and Y , C are sets of states. Although there is also no direct support for computing quantiles in PRISM, some quantile queries that refer to the amount of time until some event occurs can be calculated with the help of the iterative bottom-up computation scheme as for bounded reachability properties [32,33]. For this we made PRISM print intermediate results gained during the computation of bounded reachability properties and used a postprocessing script to extract the quantile values. For the experimental results as presented in this section we followed this approach, while in the meantime we investigated a linear programming approach for computing quantiles with lower and upper reward bounds [32] and carried out first experiments with our implementation in PRISM of this alternative approach [33]. In both of our approaches the concrete runtime depends on the result of the computation, i.e., the quantile value. The precise complexity for computing quantitative quantiles is unknown. Although the NP-hardness was shown in [40] there still might exist more efficient algorithms. In (A4), we are interested in the minimal t ∈ N such that the conditional long-run probability is greater than some fixed probability value. We thus modified the implementation of the bounded until operator in PRISM to provide the probabilities PM (3t +1 lock1 ) for all intermediate values 0 t t max up to some reasonable maximal time bound t max . The intermediate results are then weighted with the steady-state probabilities θ(s) which are also provided by PRISM. For the weighting with the steady-state probabilities and finding the minimum time bound t we used a script external to the PRISM model checker. As mentioned in Section 5.2, we compute (A2) and its variant (A2 ) with an until property in an unfolded process P 1 to avoid the possibly costly LTL-query. Fig. 18 shows the unfolded process P 1 , where the Boolean b duplicates the control loop ncrit1 wait1 crit1 . The processes P j with 1 ≤ j ≤ n are of course not unfolded. Tables 1, 2 and 3 provide detailed results and statistics of the PRISM performance for the queries we consider here. The first six columns of each table show the number of processes, the distribution ν , the number of states of the model, the MTBDD size, the time necessary for constructing the internal representation of the model and determining reachability and, finally, the time for calculating the steady-state distribution. The steady-state distribution, used to determine the conditional long-run probabilities and rewards, is calculated once and cached to allow re-use for subsequent queries carried out in the same run. The calculations were obtained using the “sparse” engine of PRISM and applying the backward Gauß–Seidel iteration method for solving linear equation systems, with the relative termination criterion of ε = 10−6 , the PRISM default. Our preliminary investigation has shown that, for our model and queries, the chosen iteration method performs better than the other iteration methods offered by PRISM (Power, Jacobi, SOR, JOR, etc.). Likewise, in our setting the sparse engine outperforms both the hybrid and MTBDD engine offered as alternatives by PRISM. We performed all PRISM calculations on a dual-socket Intel Xeon L5630 (Quad-Core) system at 2.13 GHz with 32 GB of RAM. Table 1 and Table 2 provide in their additional columns the values and the required computation time for the performance measures (A1), (A2) and (A3) and their variants. For the quantile measure (A4), Table 3 shows in column “min t” the minimal t such that PM (3t +1 locki | requesti ) 0.95 and additionally the concrete probability for the conditional-long run query for this t. (For the plot in Fig. 17 and to allow a comparison with the measured data, these values are divided by the length of the critical section, i.e., by 5.) Besides the time PRISM spend to compute the probabilities, Table 3 does also show the time that the postprocessing script required for actually computing the quantile. Due to the unfolding of P 1 for (A2) and (A2 ), the system size as well as the computation times in Table 2 are increased in comparison with Table 1 and Table 3. For (A1), (A3), (A3 ) and (A4) we use the same model and the system sizes in Table 1 and 3 are indeed identical. The running time for building the system and calculating the steady state distribution times in these two tables show minor variations, because the running time of PRISM on our system is not completely stable. We observed similar small variations on identical PRISM runs despite our best efforts to eliminate possible sources of interference.
278
C. Baier et al. / Journal of Computer and System Sciences 81 (2015) 258–287
Fig. 19. Symmetry reduction.
In this setting and with the considered distributions, we were able to build the spinlock model and compute queries (A1)–(A4) for up to 4 processes using PRISM. As can be seen in Tables 1 and 2, the number of states in the model very rapidly increases when additional processes are considered. For 5 processes and more, it turned out to be infeasible to compute the queries for most of our distributions, aborting the calculation either due to running out of memory or excessive run time (more than 10 hours). To improve the scalability we considered the application of symmetry reduction techniques, which are detailed in the next section. 6. Symmetry reduction In the previous section we investigated several performance measures for our spinlock model with the model checker PRISM. We were able to fine-tune the model such that it precisely matches the behaviour of the spinlock in our measurement environment. However, the scalability of the model was relatively poor, with a rapid increase in the number of states. The use of discrete clock variables and the relatively big absolute values in the support of distribution ν are in parts responsible for the limited scalability. It is evident that our spinlock exhibits a high amount of symmetry. The control flow graph of the processes P i is identical except for the indices i of the processes (see Fig. 12 on p. 273). Further, the lock process treats all processes P i in an identical fashion (see Fig. 4 on p. 268). Symmetry is a common phenomenon in many parametrised models. Different techniques of exploiting symmetry for state-space reduction have been extensively studied for non-probabilistic as well as for probabilistic model checking (see, e.g., [18,19,41–45]). The basic idea is to perform the analysis on a smaller quotient model that arises from the identification of symmetrical states in such a way that the relevant behaviour for the considered properties remains unchanged. For the properties considered in this article, the particular identity of the processes other than P 1 is not relevant and states that can be obtained by permutating the local states of these processes, i.e., switching the process identities, can be considered equivalent. The induced equivalence relation is a bisimulation relation, but not necessarily the largest bisimulation relation. However, it has the benefit that the state space of the quotient system can be directly generated without first generating the state space of the unreduced system. Unfortunately, the symmetry reduction built into PRISM for component symmetry [18] is not directly applicable to our model as it requires the symmetric components to be completely identical. However, in our model the processes differ by their process index i, which is required so that the lock process is able to select which of the waiting processes will obtain the lock next and to ensure that only a single process enters its critical section. For similar reasons, we were not able to directly employ language-level symmetry reduction such as provided by the GRIP tool [19,20]. In this section, we thus report on our experience in trying to nevertheless exploit the inherent symmetry in order to improve the scalability of the spinlock model. Fig. 19 shows the basic idea for a symmetry reduced spinlock model using the technique of generic representatives represented by counters, e.g. such as in [44,46,19,20]. As noted before, we single out one process, say P 1 , for the computation of the performance measures. This choice is arbitrary, as all processes behave and are treated identically by the lock. For the purpose of symmetry reduction, we then abstract from the particular identities of the remaining process, as there is no need to distinguish between them for the computation of the queries considered in our case study. To represent the aggregate behaviour of the remaining process, a counter for each of the possible local states of the processes is maintained, counting how many of the remaining processes are currently in the corresponding local state. For example, for the location ncrit there is one counter for each possible value of the clock variable t, while there are 3 counters for location wait (for t = 0, t = 1 and t = 2). A transition then updates the counters by reflecting the changes in the local states in the counters. In the lock process, the locki locations for i > 1 are combined into a single lockx location, representing the situation where some process different from P 1 holds the lock. In the non-reduced model, the lock process uniformly selects among the
C. Baier et al. / Journal of Computer and System Sciences 81 (2015) 258–287
279
waiting processes. Accordingly, the probabilities for choosing between P 1 and one of the remaining processes now depends on the number of remaining processes that are in the wait location. For instance, if the lock is in location unlock and P 1 as well as 4 other processes are in location wait, then the lock moves with probability 1/5 to lock1 and with 4/5 to lockx . Overall, the symmetry-reduced spinlock model consists of the modified lock, the unchanged process P 1 and the counters for the P i with i > 1. We refrain from displaying control flow graphs for the symmetry-reduced model here, as they straightforwardly arise from the non-reduced model. In a first attempt, we specified such a symmetry-reduced model in the PRISM input language. However, the scalability did not significantly improve, as the time spent by PRISM for constructing the internal symbolic MTBDD representation again renders the computation of our queries of interest beyond 5 processes infeasible.1 We therefore decided to implement a special purpose generator of the reachable state space of the symmetry-reduced model, which allows us to directly generate its (sparse) transition matrix. For the calculations we then used the model checker MRMC [13], which takes such a transition matrix (together with the other necessary information such as labels and rewards) as input. To allow the handling of conditional long-run queries, we extended MRMC in a similar fashion to our implementation in PRISM. We consider here only (A1), (A2) and (A3) and their variants. Quantile queries such as (A4) could be calculated using similar strategies as described in Section 5.7, but we did not consider them in our experimental studies with the symmetry reduction. For (A3), we extended MRMC version 1.5 to allow the calculation of reward queries with an unbounded Until operator. As MRMC does not support queries with nested path operators as in (A2) and (A2 ), we first considered a similar modification of the model as detailed in Section 5.2, p. 270, with a trade-off between doubling the state space and a simpler structure of the query. To avoid this blow-up of the state space, we then implemented support in MRMC for a specific variant of the Until path operator prefixed by a finite number i of next operators, i ( X U Y ), which can be computed without changing the state space of the model. Let X and Y be sets of states, i.e., X , Y ⊆ S, and k ∈ N. Then: def
i ( X U Y ) = {π ∈ Paths : π ↑ i ∈ X U Y }. Recall that π ↑ i denotes the suffix of π starting after the i-th state. Given PM s ( X U Y ) for all states s ∈ S, the values i PM s ( ( X U Y )) can be computed easily in a recursive fashion by M
Ps
(XU Y ) =
i
u∈ S
i −1 P(s, u ) · PM ( X U Y )) u (
PM s (XU Y )
if i > 0, if i = 0.
As the path formula considered by query (A2), ϕ2 = (unlock U locki ), is of this form, (A2) can be directly computed. For our model and with the knowledge that γ1 > γ0 , query (A2 ) can be reformulated into this form as well. Recall that (A2 ) is given by PM (ϕ2 | requesti ) for the event specified by the LTL-formula
ϕ2 = locki U ncriti ∧ (¬spini U criti ) , i.e., starting from the state where process P i has just requested the lock we are interested in the paths where process P i in the next step obtains the lock and, once the lock has been relinquished (ncrit i ), will obtain the lock again (criti ) without spinning (¬spini ). For the calculation in MRMC, we reformulate query (A2 ) to PM (ϕ2 | requesti ), with ϕ2 given by
ϕ2 = 3 critti <γ0 ∨ (¬criti ∧ ¬spini ) U critti =γ0 , t <γ0
where criti
t =γ0
and criti
correspond to the following Boolean conditions:
t <γ0 def
crit i
= criti ∧ t i < γ0
t =γ0 def
crit i
= criti ∧ t i = γ0 t =γ
t <γ
In case that P i immediately obtains the lock without spinning, crit i 0 is true 2 steps after request i , and thus crit i 0 is true 3 steps after requesti , remaining true until P i leaves the critical section. Then, (¬crit i ∧ ¬spini ) ensures that the next attempt for the lock is likewise successful without spinning. In the other case, where P i has to spin before obtaining the lock for the first time, P i will be either still spinning 3 steps after requesti or it just enters its critical section, but with t i = γ1 . As γ1 > γ0 , neither the left nor the right side of the until formula will be satisfied in the latter case. For a given number of processes, critical section lengths γ0 and γ1 and interim duration distribution ν , the matrix generator explicitly explores the reachable state space of the symmetry-reduced model and writes the transition matrix as well as the labels and rewards required for the calculation of the relevant queries. As parameters for the subsequent MRMC computations, we chose the iterative Gauß–Seidel method for solving the linear equation systems for the steady state computations, Gauß–Jacobi for the linear equations systems for the reward calculations and a termination threshold
1 A recently released new version of PRISM incorporates an engine for explicitly generating the state space of the model, which avoids the internal MTBDD representation. It should therefore be possible in the future to use this general-purpose engine for the construction of the transition matrix of the symmetry-reduced model instead of using our custom generator, with an approximately similar performance.
280
C. Baier et al. / Journal of Computer and System Sciences 81 (2015) 258–287
Fig. 20. (A1) and (A3) for
ν = [40, 50] ((A3) values for 500–5000 processes not plotted).
Fig. 21. Steady-state probabilities for spinning and
ν = [40, 50].
Fig. 22. Matrix generation and MRMC calculations for
ν = [40, 50].
of ε = 10−15 . This threshold is chosen smaller than the one used for PRISM because MRMC determines convergence by comparing the difference of the absolute values against the threshold, while our PRISM calculations compare the relative difference between values. We have furthermore activated formula-independent lumping [47], where MRMC calculates the bisimulation quotient of the system as a first step and subsequently performs the calculations on this potentially much smaller system. We used a memory limit of 180 GB of RAM. Comparing the values for (A1), (A2), (A2 ), (A3) and (A3 ) as calculated by MRMC with those calculated by PRISM for 2–4 processes, the values agreed at a minimum up to 6 decimal places, some agreed up to 9 decimal places. The calculations using MRMC were performed on a dual-socket Intel Xeon L5630 (quad-core) system at 2.13 GHz equipped with 192 GB RAM. The application of symmetry reduction, direct generation of the transition matrix and calculating using MRMC afforded us a much better scalability. For example, for the distribution ν = [40, 50] (and keeping γ0 = 5 and γ1 = 6 as usual for the cache aware model), we can generate the transition matrix, calculate the steady-state distribution and performance measures (A1)–(A3) for up to 5000 processes. Figs. 20–22 show plots of selected values for the distribution ν (40) = ν (50) = 1/2, i.e., ν = [40, 50], as well the runtime of the tool chain for generating the matrix and calculating (A1)–(A3 ). Detailed results are subsequently provided in Table 4. Fig. 20 depicts the effect of more and more processes saturating the lock, with the probability of acquiring the lock immediately dropping to 0 (A1) and an increase in the expected time spent spinning before acquiring the lock (A3). In fact, as can be seen in Fig. 21, at 10 processes the lock becomes fully saturated, with the fraction of time on the long run, where at least one process is spinning, reaching 1. The fraction of time that process P 1 spends spinning increases with the
C. Baier et al. / Journal of Computer and System Sciences 81 (2015) 258–287
Table 4 Detailed results and performance of MRMC for the distributions
281
γ0 = 5, γ1 = 6 and ν = [40, 50].
Number proc.
(A1) prob.
(A2) prob.
(A2 ) value
(A3) value
(A3 ) value
System size (states) Model
Bisim. quot.
Time bisim.
2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100 500 1000 5000 10 000
0.944 0.872 0.797 0.714 0.630 0.536 0.413 0.250 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.056 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.889 0.760 0.639 0.518 0.410 0.302 0.179 0.062 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.086 0.277 0.531 0.859 1.266 1.795 2.502 3.942 9.000 68.999 129.000 188.999 248.998 309.005 368.980 429.023 488.981 548.985 2949.762 5946.433 29 997.000 –
1.545 2.166 2.612 3.005 3.422 3.869 4.260 5.256 9.000 68.999 129.000 188.999 248.998 309.005 368.980 429.023 488.981 548.985 2949.762 5946.433 29 997.000 –
1343 33 768 694 907 8 606 544 58 911 750 213 497 440 387 320 107 1 211 760 189 311 194 197 195 627 197 057 198 487 199 917 201 347 202 777 204 207 205 637 262 837 334 337 906 337 1 621 337
1339 31 649 588 786 6 294 253 36 006 147 105 122 131 146 479 606 601 804 22 359 13 819 14 269 14 719 15 169 15 619 16 069 16 519 16 969 17 419 35 419 57 919 237 919 462 919
0.00 0.03 1.72 27.78 213.18 803.67 1354.72 1.88 0.08 0.08 0.07 0.08 0.07 0.08 0.08 0.08 0.08 0.07 0.09 0.15 0.51 1.01
Fig. 23. (A1) and (A3) for
Total running time
s s s s s s s s s s s s s s s s s s s s s s
With bisim.
w/o bisim.
0.04 0.52 13.26 166.93 1149.74 3997.22 6747.67 21.78 1.87 2.97 4.06 5.41 6.88 8.36 9.65 11.02 12.71 14.04 130.25 465.77 12 236.01 27 951.51
0.04 0.49 12.41 159.54 1130.24 4565.48 11 104.58 29.26 5.76 31.04 53.01 77.76 104.06 131.49 152.88 176.37 203.61 223.37 1 451.21 3 500.59 44 079.90 93 388.56
s s s s s s s s s s s s s s s s s s s s s s
s s s s s s s s s s s s s s s s s s s s s s
ν = [40, 50, 60] ((A3) value for 500 not plotted).
number of processes. Fig. 22 presents the number of states generated for the model, the number of states after bisimulation quotienting by MRMC and the overall run time for both matrix generation and the calculations by MRMC. The effect of the lock saturation on the number of states as well as the runtime is pronounced. As can be seen, for small numbers of processes there is a rapid increase both in the running time as well as in the number of states of the symmetry-reduced model, peaking just before there are enough processes fully saturating the lock. Once the lock is saturated, the state space for 10 processes is only a small fraction (22 359 states in the quotient) of the state space for 8 processes (146 479 606 states in the quotient). As a further example, Figs. 23–25 depict results and statistics for another of our distributions considered in Section 5.7, ν = [40, 50, 60]. Detailed results are provided in Table 5. Again, the lock becomes fully saturated, with the probability that there is at least one process waiting for the lock reaching 1, at 12 processes. In contrast to ν = [40, 50] however, the state space before lock saturation is so large that our matrix generator aborts because of the memory limit in the case of 7–10 processes. The number of states explored by the generator before the memory limit was reached was about 800 000 000 states. For the symmetry-reduced model we see the interesting effect that building the model for model checking becomes more difficult for relatively small number of processes, but for an increased number of processes becomes more tractable again (in terms of memory and time consumption). As can be seen, this effect is quite sensitive to the concrete distributions used in the model. The collapse of the state space and the resulting improvement in scalability can be explained with the effect of the saturation of the lock on the combinatorial structure of the state space. When the lock is saturated, there is always one process holding the lock while most of the other processes are spinning. Therefore, every 5 (γ0 ) or 6 (γ1 ) time units, the currently critical process releases the lock, enters location ncrit and chooses an interim time from ν . Because the maximal interim time is limited and relatively small, there can only be a certain, small number of processes in location ncrit at any point in time. All other processes are spinning in location wait. Adding a further process to such a system has, in the long
282
C. Baier et al. / Journal of Computer and System Sciences 81 (2015) 258–287
Fig. 24. Steady-state probabilities for spinning and
Fig. 25. Matrix generation and MRMC calculations for
ν = [40, 50, 60].
ν = [40, 50, 60] (time for 500 processes not plotted).
run, only the effect of increasing the number of waiting processes, thereby reducing the chance of P 1 to obtain the lock during each spinning cycle. The observed decrease in the number of states after the lock is saturated is mainly due to the limited number of processes that can be simultaneously in their non-critical section, limiting the combinatorial possibilities. Fig. 26 visualises aspects of the described saturation effect in so-called heat maps. The figure depicts heat maps for distribution ν = [40, 50] in the left column and for ν = [40, 50, 60] in the right column. The first row shows the probability on the long run that a given number y of processes is in location ncrit. The probability is encoded as colour, with darker grey standing for a higher probability. For instance, for distribution ν = [40, 50] and a total number of 50 processes, on the long run there are always either 6, 7, 8 or 9 processes in location ncrit, with a probability of about 46% for exactly 7 processes being non-critical. Again, the effect of the lock saturation is directly apparent in Figs. 26(a) and 26(b), as the maximum number of non-critical processes does not increase anymore despite adding more processes. Apart from the limited number of non-critical processes, there is another effect that limits the size of the state space when the lock is saturated. Because processes release the lock in regular intervals and because there are only certain values for the interim time, the difference of the clock values of two non-critical processes can only take certain numbers, see Fig. 27, leading to more structure in the state space. To provide some intuitive visualisation of this effect, Figs. 26(c) and 26(d) show heat maps for the steady-state probability that there are two neighbouring non-critical processes with a certain distance y, i.e., with their clock variables differing by y, with no other process “inbetween”. It is apparent that there is a difference in the structure of these distances before and after the lock is saturated. Before the lock is saturated, the distances are much more varied, while once the lock is saturated, a stronger pattern of possible distances emerges. This more regular structure is then reflected in a more limited number of states in the symmetry-reduced model. We also see that the greater variability for the distribution ν = [40, 50, 60] gives rise to more irregularity in the model, which explains the steeper growth of state space and model-checking times for this distribution. Tables 4 and 5 provide details of the calculations for the distributions ν = [40, 50] and ν = [40, 50, 60], listing first the calculated values for our queries (A1)–(A3 ). Interestingly, after the number of processes guarantee lock saturation, the value of (A1) for ν = [40, 50], i.e., the probability of immediately obtaining the lock after requesting it, becomes 0. For ν = [40, 50, 60] however the probability for (A1) decreases with an increase in the number of processes but remains above 0. For (A1) to be non-zero for a completely saturated lock, it has to be possible for P 1 to request the lock at exactly the right moment, i.e., when another process is just ready to leave its critical section. Then, there is the possibility that P 1 is chosen for the lock with a small possibility. Due to the regular structure of the processes leaving the critical section, this possibility depends strongly on the concrete distributions ν . As stated above, we used the reduction of the model using bisimulation quotienting provided by MRMC. In the Tables 4 and 5, columns 7 and 8 show the number of states of the symmetry-reduced model as generated by our custom matrix generator, as well as the number of states of the model after bisimulation quotienting. As can be seen, the bisimulation
Number proc. 2 3 4 5 6 7–10 11 20 30 40 50 60 70 80 90 100 500
(A1) prob.
(A2) prob.
(A2 ) value
0.926 0.850 0.767 0.680 0.593
0.09 0.01 0.00 0.00 0.00
0.85 0.72 0.59 0.47 0.36
γ0 = 5, γ1 = 6 and ν = [40, 50, 60]. (A3) value 0.13 0.36 0.67 1.06 1.54
Matrix generation aborted due to memory limit (180 GB of RAM) 0.093 0.00 0.01 9.44 0.023 0.00 0.00 63.57 0.014 0.00 0.00 123.61 0.010 0.00 0.00 183.63 0.007 0.00 0.00 243.63 0.006 0.00 0.00 303.63 0.005 0.00 0.00 363.65 0.004 0.00 0.00 423.63 0.004 0.00 0.00 483.64 0.004 0.00 0.00 543.63 0.001 0.00 0.00 2942.84
(A3 ) value
System size (states) Model
Bisim. quot.
Time bisim.
1.75 2.41 2.87 3.30 3.78
2507 98 883 2 329 670 37 172 256 420 921 293
2503 92 220 1 901 722 24 758 779 214 830 508
0.00 0.13 6.41 130.75 1764.80
10.40 65.05 125.30 185.41 245.46 305.50 365.53 425.53 485.55 545.55 2944.82
40 266 022 41 711 026 41 779 996 41 871 966 41 986 936 42 124 906 42 285 876 42 469 846 42 676 816 42 906 786 70 965 586
4 952 430 2 910 131 2 923 476 2 941 321 2 963 666 2 990 511 3 021 856 3 057 701 3 098 046 3 142 891 8 626 691
46.27 36.57 36.95 36.51 36.74 36.90 37.21 36.97 37.34 37.51 101.03
Total running time With bisim.
w/o bisim.
s s s s s
0.06 1.75 47.54 777.25 8 968.26
s s s s s
0.06 1.68 45.58 808.56 11 674.53
s s s s s s s s s s s
600.25 982.88 1458.76 1876.30 2351.94 2833.90 3275.16 3725.28 4271.08 4638.18 58 141.91
s s s s s s s s s s s
1 652.87 s 7 291.13 s 14 204.82 s – – – – – – – –
s s s s s
C. Baier et al. / Journal of Computer and System Sciences 81 (2015) 258–287
Table 5 Detailed results and performance of MRMC for the distributions
283
284
C. Baier et al. / Journal of Computer and System Sciences 81 (2015) 258–287
Fig. 26. Regular behaviour of the saturated spinlock.
•
Fig. 27. Ladder effect in the non-critical region, displayed here for the distributions ν (23) = ν (30) = 1/2 and γ0 = γ1 = 4. Filled dots represent processes that chose t i = 30 when releasing the lock and empty dots represent processes that chose t i = 23. For these distributions, the difference between the t i ’s of two processes, which follow each other on the stairs, can only take the values 2, 3, 5, 10 or 12. Moreover, only certain differences can occur next to each other. For instance, because t 2 − t 3 = 12, the process that follows p 2 has a distance of either 3 or 5 time units.
◦
reduction provided significant additional reductions to the already symmetry-reduced model at a moderate cost, with the running time for the bisimulation quotienting provided in the next column. The last two columns compare the combined running time for both generating the matrix and the calculations of MRMC, once with bisimulation enabled and once with bisimulation disabled. For ν = [40, 50] and 10 000, it was possible to generate the matrix, perform bisimulation quotienting and calculate (A1), (A2) and (A2 ), but the calculation of (A3) and (A3 ) did not converge after 1 000 000 iterations. As the number of processes increases, the value of (A3) and (A3 ) increases and the computation tends to significantly dominate the overall running time. For ν = [40, 50, 60] we did not perform calculations without bisimulation for more than 30 processes due to the significant running time. To illustrate the effort for the different steps of the computation, Table 6 provides a breakdown of the running time for 50 and for 100 processes and for ν = [40, 50] and ν = [40, 50, 60]. As can be seen, the most effort tends to be spent for calculating (A3)/(A3 ), with the time spent for generating the matrix, the bisimulation quotienting and the steady state computations playing a minor role. The running time in the row marked with “other” represents tasks of MRMC such as reading the generated matrix and constructing the sparse matrix in memory, etc. Spinlocks have very high costs when there is contention on the lock, because the spinning generates a lot of cache coherence traffic that slows down the whole system. Spinlocks are therefore only used in cases where high contention cannot occur or is very unlikely. Nevertheless, we believe that the investigation of the symmetry-reduction in this section is interesting. The ability to scale the probabilistic model checking from a small number of processes up to 5000 processes and beyond clearly demonstrates the usefulness of the symmetry-reduction approach. The ability to analyse models “beyond”
C. Baier et al. / Journal of Computer and System Sciences 81 (2015) 258–287
285
Table 6 Running times for 50 and 100 processes. Processes
ν = [40, 50] 50
ν = [40, 50, 60] 100
50
Generation Bisimulation Steady state (A1) (A2) (A2 ) (A3) (A3 ) Other
1.15 s 0.07 s 0.01 s 0.00 s 0.00 s 0.00 s 2.55 s 2.54 s 0.55 s
1.17 0.07 0.01 0.00 0.01 0.01 6.11 6.11 0.56
s s s s s s s s s
Total time
6.88 s
14.04 s
277.29 36.74 4.26 0.07 0.84 2.28 945.53 943.52 141.41
100 s s s s s s s s s
2351.94 s
280.31 37.51 4.31 0.07 1.11 2.56 2083.65 2084.58 180.08
s s s s s s s s s
4638.18 s
the lock saturation allowed us to gain further insights into the effects of lock saturation and the resulting effects, i.e., the regularity of the long-run behaviour and the dramatic drop of the number of states. But the investigation here is also valuable for operating systems. Highly contented systems and applications do exist and they need means for locking or synchronisation in order to cope with the work load. We expect that the effect of the symmetry reduction that we saw in this section is present in those systems as well. 7. Conclusions This article presents the first steps towards the application of probabilistic model checking techniques for the quantitative analysis of low-level operating-system code. We modelled a test-and-test-and-set spinlock in the model checkers PRISM and MRMC and compared 4 typical (probabilistic) performance measures with measurements from an Intel i7 920 quadcore machine. On such CPUs, the behaviour of the spinlock depends significantly on the caches. However, these caches are far too complex for including them in the spinlock model. Nevertheless, we were able to develop a model that precisely matches the measurements, thus proving that a probabilistic model checker can be used to make precise predictions about the behaviour of the spinlock on recent, highly optimised CPUs. Modelling the aggregated effect of caches into stochastic distributions required some expert knowledge, but allowed for a rather simple model that abstracts from heavily data-dependent behaviour as immanent to caches. We discovered this aspect after a systematic but intensive manual search for all possible causes of the gap between the measurements and the findings in our formal model. The discovery took a substantial amount of time and could only be explained by an in-depth knowledge of cache architectures. It would be interesting to investigate whether this process can be automated to some extend. For example, it might be possible to apply learning techniques that have been developed for probabilistic automata and hidden Markov models [48] to derive stochastic parameters from architectural models or a meaningful set of measured data. One of our performance measures is the 95% quantile of the waiting time for the lock, i.e., the shortest time t such that the probability for acquiring the lock with a waiting time of t or shorter, is 95% or greater. Such quantile queries and also conditional steady-state queries are very important to judge the performance of low-level code. However, PRISM and MRMC do not directly support such queries. To obtain our results, we used adapted versions of PRISM and MRMC and post-processing scripts. Although our spinlock model has at first glance a very simple structure, the model scales only up to a few processes contending for the lock due to the state space explosion. As a secondary result, we show that the scalability can be drastically improved by the application of symmetry-reduction techniques. Using MRMC, we were able to compute some of our queries for up to 5000 processes. The main reason for this impressive number of processes is that the lock becomes saturated around 10 processes and that symmetry reduction works particularly well in the saturated case. Spinlocks are of course only practical in situations where the probability for contention is very low. In practice, this is ensured by applying spinlocks only for objects that are rarely used and only for short critical sections. Nevertheless, our results show the applicability and scalability of our approach. In particular, we expect that our methodology will continue to scale in analyses of locks that protect more contended services and for other low-level operating system code with similar behaviour. Our optimism results from the observation that locks for highly contended resources typically establish an order between the lock acquiring processes that is similar to the order we have seen for the saturated case. Future work It would be interesting to analyse other lock implementations that are used inside operating-system kernel, such as, ticket locks. Probabilistic model checking could then help to choose the right lock implementation for a given number of CPU cores, depending on the expected work load. More generally, there is the possibility to utilise models and stochastic properties obtained by our methodology for optimisation and synthesis problems. This approach might be highly relevant to construct systems optimised with respect to properties like energy-utility or failure recovery. Also, the abstract models could prove useful for the synthesis of synchronisation of concurrent programs [49]. For example, given a parallel program that is not properly synchronised, performance
286
C. Baier et al. / Journal of Computer and System Sciences 81 (2015) 258–287
models obtained from the formal analysis of synchronisation primitives could be used to construct a program that is correct and performs optimal with respect to the implied measure as proposed in [50]. References [1] G. Bernat, A. Colin, S. Petters, WCET analysis of probabilistic hard real-time systems, in: Proceedings of the 23rd Real-Time Systems Symposium (RTSS), IEEE Computer Society, 2002, pp. 279–288. [2] S. Knapp, W. Paul, Realistic worst-case execution time analysis in the context of pervasive system verification, in: Program Analysis and Compilation, Theory and Practice, Essays Dedicated to Reinhard Wilhelm on the Occasion of his 60th Birthday, in: Lecture Notes in Computer Science, vol. 4444, Springer, 2007, pp. 53–81. [3] R. Wilhelm, J. Engblom, A. Ermedahl, N. Holsti, S. Thesing, D. Whalley, G. Bernat, C. Ferdinand, R. Heckmann, T. Mitra, F. Mueller, I. Puaut, P. Puschner, J. Staschulat, P. Stenström, The worst-case execution-time problem — overview of methods and survey of tools, ACM Trans. Embed. Comput. Syst. 7 (3) (2008) 1–53. [4] W.K. Shih, J.W.-S. Liu, J.-Y. Chung, Algorithms for scheduling imprecise computations with timing constraints, SIAM J. Comput. 20 (3) (1991) 537–552. [5] C.-J. Hamann, J. Löser, L. Reuther, S. Schönberg, J. Wolter, H. Härtig, Quality-assuring scheduling — using stochastic behavior to improve resource utilization, in: 22nd Real-Time Systems Symposium (RTSS’01), IEEE Computer Society, 2001, pp. 119–128. [6] T.E. Anderson, The performance of spin lock alternatives for shared-memory multiprocessors, IEEE Trans. Parallel Distrib. Syst. 1 (1) (1990) 6–16. [7] J. Mellor-Crummey, M. Scott, Scalable reader-writer synchronization for shared-memory multiprocessors, in: Proceedings of the 3rd Symposium on Principles and Practice of Parallel Programming (PPOPP), ACM, 1991, pp. 106–113. [8] M.Z. Kwiatkowska, G. Norman, D. Parker, Probabilistic symbolic model checking with PRISM: a hybrid approach, Int. J. Softw. Tools Technol. Transf. 6 (2) (2004) 128–142. [9] S. Irani, G. Singh, S.K. Shukla, R. Gupta, An overview of the competitive and adversarial approaches to designing dynamic power management strategies, IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 13 (12) (2005) 1349–1361. [10] G. Norman, D. Parker, M.Z. Kwiatkowska, S. Shukla, R. Gupta, Using probabilistic model checking for dynamic power management, Form. Asp. Comput. 17 (2) (2005) 160–176. [11] M.Z. Kwiatkowska, G. Norman, D. Parker, Stochastic model checking, in: Proceedings of the 7th International School on Formal Methods for the Design of Computer, Communication, and Software Systems, Formal Methods for Performance Evaluation (SFM), in: Lecture Notes in Computer Science, vol. 4486, Springer, 2007, pp. 220–270. [12] M.Z. Kwiatkowska, G. Norman, D. Parker, The PRISM benchmark suite, in: Proceedings of the 9th International Conference on Quantitative Evaluation of Systems (QEST), IEEE Computer Society, 2012, pp. 203–204. [13] J.-P. Katoen, I. Zapreev, E. Hahn, H. Hermanns, D. Jansen, The ins and outs of the probabilistic model checker MRMC, Perform. Eval. 68 (2) (2011) 90–104. [14] N. Coste, H. Garavel, H. Hermanns, F. Lang, R. Mateescu, W. Serwe, Ten years of performance evaluation for concurrent systems using CADP, in: Proceedings of the 4th International Symposium on Leveraging Applications of Formal Methods, Verification, and Validation (ISoLA), Part II, in: Lecture Notes in Computer Science, vol. 6416, Springer, 2010, pp. 128–142. [15] R. Mateescu, W. Serwe, A study of shared-memory mutual exclusion protocols using CADP, in: Proceedings of the 15th International Workshop on Formal Methods for Industrial Critical Systems (FMICS), in: Lecture Notes in Computer Science, vol. 6371, Springer, 2010, pp. 180–197. [16] J. Reineke, D. Grund, C. Berg, R. Wilhelm, Timing predictability of cache replacement policies, Real-Time Syst. 37 (2) (2007) 99–122. [17] J. Reineke, Caches in WCET Analysis — Predictability, Competitiveness, Sensitivity, epubli GmbH, 2008. [18] M.Z. Kwiatkowska, G. Norman, D. Parker, Symmetry reduction for probabilistic model checking, in: Proceedings of the 18th International Conference on Computer-Aided Verification (CAV), in: Lecture Notes in Computer Science, vol. 4144, Springer, 2008, pp. 238–248. [19] A.F. Donaldson, A. Miller, Symmetry reduction for probabilistic model checking using generic representatives, in: Proceedings of the 4th International Symposium on Automated Technology for Verification and Analysis (ATVA), in: Lecture Notes in Computer Science, vol. 4218, Springer, 2006, pp. 9–23. [20] A.F. Donaldson, A. Miller, D. Parker, Language-level symmetry reduction for probabilistic model checking, in: Proceedings of the 6th International Conference on Quantitative Evaluation of Systems (QEST), IEEE Computer Society, 2009, pp. 289–298. [21] B. Haverkort, Performance of Computer Communication Systems: A Model-Based Approach, Wiley, 1998. [22] G. Kesidis, An Introduction to Communication Network Analysis, Wiley, 2007. [23] V. Kulkarni, Modeling and Analysis of Stochastic Systems, Chapman & Hall, 1995. [24] K. Sen, M. Viswanathan, G. Agha, On statistical model checking of stochastic systems, in: Proceedings of the 17th International Conference on Computer Aided Verification (CAV), in: Lecture Notes in Computer Science, vol. 3576, Springer, 2005, pp. 266–280. [25] H.L.S. Younes, R.G. Simmons, Statistical probabilistic model checking with a focus on time-bounded properties, Inf. Comput. 204 (9) (2006) 1368–1409. [26] M. Hähnel, Energy-Utility Functions, Diploma thesis, TU Dresden, Germany, April 2012. [27] H. Hansson, B. Jonsson, A logic for reasoning about time and reliability, Form. Asp. Comput. 6 (1994) 512–535. [28] J. Kemeny, J. Snell, Finite Markov Chains, Van Nostrand, 1960. [29] C. Courcoubetis, M. Yannakakis, The complexity of probabilistic verification, J. ACM 42 (4) (1995) 857–907. [30] M. Vardi, Probabilistic linear-time model checking: an overview of the automata-theoretic approach, in: Proceedings of the 5th International AMAST Workshop on Formal Methods for Real-Time and Probabilistic Systems (ARTS), in: Lecture Notes in Computer Science, vol. 1601, 1999, pp. 265–276. [31] C. Baier, J.-P. Katoen, Principles of Model Checking, MIT Press, 2008. [32] M. Ummels, C. Baier, Computing quantiles in Markov reward models, in: Proceedings of the 16th International Conference on Foundations of Software Science and Computation Structures (FOSSACS), in: Lecture Notes in Computer Science, vol. 7794, Springer, 2013, pp. 353–368. [33] C. Baier, M. Daum, C. Dubslaff, J. Klein, S. Klüppelholz, Energy-utility quantiles, in: Proceedings of the 6th NASA Formal Methods Symposium (NFM), in: Lecture Notes in Computer Science, vol. 8430, Springer, 2014, pp. 285–299. [34] G. Paoloni, How to benchmark code execution times on Intel IA-32 and IA-64 instruction set architectures, Intel Corporation, available at http://www.intel.com, Sep. 2010. [35] A. Traeger, E. Zadok, N. Joukov, C.P. Wright, A nine year study of file system and storage benchmarking, Transf. Storage 4 (2) (2008) 5:1–5:56. [36] C. Baier, M. Daum, B. Engel, H. Härtig, J. Klein, S. Klüppelholz, S. Märcker, H. Tews, M. Völp, Waiting for locks: How long does it usually take?, in: Proceedings of the 17th International Workshop on Formal Methods for Industrial Critical Systems (FMICS), in: Lecture Notes in Computer Science, vol. 7437, Springer, 2012, pp. 47–62. [37] C. Baier, M. Daum, B. Engel, H. Härtig, J. Klein, S. Klüppelholz, S. Märcker, H. Tews, M. Völp, Chiefly symmetric: results on the scalability of probabilistic model checking for operating-system code, in: Proceedings of the 7th Conference on Systems Software Verification (SSV), in: EPTCS, vol. 102, 2012, pp. 156–166. [38] J. Levon, OProfile manual, http://oprofile.sourceforge.net/, 2004. [39] M. Vardi, Automatic verification of probabilistic concurrent finite-state programs, in: Proceedings of the 26th Symposium on Foundations of Computer Science (FOCS), IEEE Computer Society, 1985, pp. 327–338.
C. Baier et al. / Journal of Computer and System Sciences 81 (2015) 258–287
287
[40] F. Laroussinie, J. Sproston, Model checking durational probabilistic systems, in: Proceedings of the 8th International Conference on Foundations of Software Science and Computational Structures (FOSSACS), in: Lecture Notes in Computer Science, vol. 3441, Springer, 2005, pp. 140–154. [41] E. Clarke, R. Enders, T. Filkorn, S. Jha, Exploiting symmetry in temporal logic model checking, Form. Methods Syst. Des. 9 (1/2) (1996) 77–104. [42] E.A. Emerson, P. Sistla, Symmetry and model checking, Form. Methods Syst. Des. 9 (1/2) (1996) 105–131. [43] C.N. Ip, D.L. Dill, Better verification through symmetry, Form. Methods Syst. Des. 9 (1/2) (1996) 41–75. [44] E.A. Emerson, R.J. Trefler, From asymmetry to full symmetry: new techniques for symmetry reduction in model checking, in: Proceedings of the 10th Conference on Correct Hardware Design and Verification Methods (CHARME), in: Lecture Notes in Computer Science, vol. 1703, Springer, 1999, pp. 142–156. [45] A. Miller, A.F. Donaldson, M. Calder, Symmetry in temporal logic model checking, ACM Computing Surveys 38 (3) (2006) 8. [46] E.A. Emerson, T. Wahl, On combining symmetry reduction and symbolic representation for efficient model checking, in: Proceedings of the 12th Conference on Correct Hardware Design and Verification Methods (CHARME), in: Lecture Notes in Computer Science, vol. 2860, Springer, 2003, pp. 216–230. [47] J.-P. Katoen, T. Kemna, I. Zapreev, D. Jansen, Bisimulation minimisation mostly speeds up probabilistic model checking, in: Proceedings of the 13th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS), in: Lecture Notes in Computer Science, vol. 4424, Springer, 2007, pp. 87–101. [48] R. Elliott, L. Aggoun, J. Moore, Hidden Markov Models: Estimation and Control, Springer, 1995. [49] M.T. Vechev, E. Yahav, G. Yorsh, Abstraction-guided synthesis of synchronization, Int. J. Softw. Tools Technol. Transf. 15 (5–6) (2013) 413–431. [50] P. Cerný, K. Chatterjee, T.A. Henzinger, A. Radhakrishna, R. Singh, Quantitative synthesis for concurrent programs, in: Proceedings of the 23rd International Conference on Computer Aided Verification, in: Lecture Notes in Computer Science, vol. 6806, Springer, 2011, pp. 243–259.