An empirical study for evaluating the performance of multi-cloud APIs

An empirical study for evaluating the performance of multi-cloud APIs

Accepted Manuscript An empirical study for evaluating the performance of multi-cloud APIs Reginaldo Ré, Rômulo Manciola Meloca, Douglas Nassif Roma Ju...

586KB Sizes 1 Downloads 34 Views

Accepted Manuscript An empirical study for evaluating the performance of multi-cloud APIs Reginaldo Ré, Rômulo Manciola Meloca, Douglas Nassif Roma Junior, Marcelo Alexandre da Cruz Ismael, Gabriel Costa Silva

PII: DOI: Reference:

S0167-739X(17)30180-2 https://doi.org/10.1016/j.future.2017.09.003 FUTURE 3658

To appear in:

Future Generation Computer Systems

Received date : 16 February 2017 Revised date : 1 June 2017 Accepted date : 2 September 2017 Please cite this article as: R. Ré, R.M. Meloca, D.N.R. Junior, M.A.d.C. Ismael, G.C. Silva, An empirical study for evaluating the performance of multi-cloud APIs, Future Generation Computer Systems (2017), https://doi.org/10.1016/j.future.2017.09.003 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

An Empirical Study for Evaluating the Performance of Multi-cloud APIs Reginaldo R´ea,∗, Rˆomulo Manciola Melocaa , Douglas Nassif Roma Juniorb , Marcelo Alexandre da Cruz Ismaelc , Gabriel Costa Silvad a Universidade

Tecnol´ogica Federal do Paran´a, Campus Campo Mour˜ao, Department of Computer Science, Via Rosalina Maria dos Santos, 1233, CEP 87301-899, Campo Mour˜ao, PR, Brazil b Universidade Tecnol´ ogica Federal do Paran´a, Campus Corn´elio Proc´opio, Department of Computer Science, Av. Alberto Carazzai, 1640, CEP 86300-000, Corn´elio Proc´opio, PR, Brazil c Instituto Federal de Educa¸ ca˜ o, Ciˆencia e Tecnologia de S˜ao Paulo, Campus Presidente Epit´acio, Rua Jos´e Ramos Junior, 27-50, CEP 19470-000, Presidente Epit´acio, SP, Brazil d Universidade Tecnol´ ogica Federal do Paran´a, Campus Dois Vizinhos, Department of Software Engineering, Estrada para Boa Esperan¸ca, Km 04, CEP 85660-000, Dois Vizinhos, PR, Brazil

Abstract The massive use of cloud APIs for workload orchestration and the increased adoption of multiple cloud platforms prompted the rise of multi-cloud APIs. Multi-cloud APIs abstract cloud differences and provide a single interface regardless of the target cloud platform. Identifying whether the performance of multi-cloud APIs differs significantly from platform-specific APIs is central for driving technological decisions on cloud applications that require maximum performance when using multiple clouds. This study aims to evaluate the performance of multi-cloud APIs when compared to platform-specific APIs. We carried out three rigorous quasi-experiments to measure the performance (dependent variable) of cloud APIs (independent variable) regarding CPU time, memory consumption and response time. jclouds and Libcloud were the two multi-cloud APIs used (experimental treatment). Their performance were compared to platform-specific APIs (control treatment) provided by Amazon Web Services and Microsoft Azure. These APIs were used for uploading and downloading (tasks) 39 722 files in five different sizes to/from storage services during five days (trials). Whereas jclouds performed significantly worse than platform-specific APIs for all performance indicators on both cloud platforms and operations for all five file sizes, Libcloud outperformed platform-specific APIs in most tests (p-value not exceeding 0.00125, A-statistic greater than 0.64). Once confirmed by independent replications, our results suggest that jclouds developers should review the API design to ensure minimal overhead whereas jclouds users should evaluate the extent to which this trade-off affect the performance of their applications. Multi-cloud users should carefully evaluate what quality attribute is more important when selecting a cloud API. Keywords: Multi-cloud, performance, evaluation, jclouds, Libcloud, experiment 1. Introduction Performance is a critical quality attribute for several applications in cloud computing, such as Netflix and Dropbox [1, 2]. Indeed, performance features ∗ Corresponding

author Email addresses: [email protected] (Reginaldo R´e), [email protected] (Rˆomulo Manciola Meloca), [email protected] (Douglas Nassif Roma Junior), [email protected] (Marcelo Alexandre da Cruz Ismael), [email protected] (Gabriel Costa Silva) URL: http://www.gabrielcostasilva.com/ (Gabriel Costa Silva) Preprint submitted to Future Generation Computer Systems

the list of top 5 concerns for cloud users [3] and top non-functional requirement for software architects [4]. However, ensuring application performance in highly distributed environments is challenging due to several influences [5]. The middleware used to integrate applications and cloud services may be one of such influences [6, 7]. Cloud Application Programming Interfaces (APIs) expose middleware functions, and play a key role for integrating SaaS and orchestrating application workloads in PaaS and IaaS [3]. Recent emphasis on the use of multiple cloud platforms [8] and hybrid deployments September 18, 2017

a trade-off of a prominent solution for vendor lock-in (Section 7). As we explain in Section 8, we identified and tackled some threats to the validity of our study. These threats should be addressed in future work as well as in the investigation of multi-cloud APIs code to identify reasons that led to performance differences (Section 9). Finally, Section 2 reviews key concepts and terminology needed for understanding this study.

[10] prompted the rise of multi-cloud APIs that abstract cloud differences and provide a single interface for cloud service management regardless of the target cloud platform [11]. Usually, literature refers to these APIs as open APIs [12, 13, 14]. In this paper, we use the term multi-cloud APIs instead, as it better reflects their purpose and prevent any confusion with open source software. However, little is known on the performance of multicloud APIs when compared to platform-specific APIs [11]. As opposed to multi-cloud, platform-specific APIs target a single cloud platform. Ideally, multi-cloud APIs should enhance qualities of platform-specific APIs rather than creating a significantly extra overhead. In fact, an extra overhead could overweigh multi-cloud API benefits for high performance cloud applications. The research question investigated in this paper asks whether the performance of multi-cloud APIs differs significantly from platform-specific APIs. However, investigating the reason of possible differences is beyond the aim of this study. The research question was prompted after a preliminary study [11] with jclouds1 , a prominent multi-cloud API for Java applications. To answer our research question, we extended the aforementioned study in four dimensions (Section 3):

2. Middleware and Cloud APIs Software systems evolved from monolithic architectures to multi-tier distributed systems [15], and more recently, to a coalition of systems [16]. The middleware integrates these heterogeneous systems and hides their complexity, enabling systems to work under a common service interface [15, 17]. Middlewares provide programming abstractions (e.g., Application Programming Interfaces - APIs) to enable that programmers have access to their functionalities [15]. As a type of distributed system [18], cloud computing relies on middlewares for service integration and management. Cloud APIs play an important role for programmatic management of cloud services by enabling that cloud application developers integrate software systems and cloud services [3]. Cloud providers offer their proprietary APIs to enable that cloud users manage their cloud services [19]. However, proprietary cloud APIs create a strong dependency of a particular cloud provider (vendor lock-in) and complicate multicloud management by requiring that a single application adopts several cloud APIs [20, 19]. Even though some cloud providers offer APIs based on open standards [20], such as WSDL, they lack a single interface [21]. API incompatibility arises even between supposedly compatible cloud APIs [22, 23]. Hill & Humphrey [24] and Nguyen, Tran & Hluchy [25] observe that these differences in APIs are not only the result of syntax and semantics differences, but also result of different service models and services provided. Multi-cloud APIs are a prominent solution for reducing the management heterogeneity among similar services offered by different cloud providers [26, 13]. Multi-cloud APIs abstract cloud differences and provide a single interface for cloud service management regardless of the target cloud platform [11]. jclouds and Libcloud are examples of multi-cloud APIs. Although both APIs aim the same goal, they have several differences, such as the target programming language, cloud services supported, and semantics.

• increased the number of performance indicators to investigate CPU time, memory consumption and response time; • investigated the two most popular multi-cloud APIs currently available: jclouds and Libcloud2 ; • considered five workloads: 155KB, 310KB, 620KB, 1240KB and 2480KB; and • explored two common operations for storage services: download and upload. Our results show that jclouds and Libcloud performance differ significantly from platform-specific APIs. However, we do not generalise our results by claiming that multi-cloud APIs perform worse or better than platform-specific APIs as performance varies according to the multi-cloud API used (Sections 4, 5 and 6). Our findings contribute to (i) cloud API engineers, by identifying performance bottlenecks; (ii) cloud application developers, by giving them an important criterion when selecting cloud APIs; (iii) researchers, by highlighting a little explored issue in cloud portability/multicloud management; and (iv) practitioners, by exposing 1 http://jclouds.apache.org

2 http://libcloud.apache.org

2

Table 1: Version of APIs used in this study.

3. Experiment Planning The goal of this study is to analyse two multi-cloud APIs for the purpose of evaluation with respect to their performance from the point of view of the platformspecific API users in the context of uploading and downloading files to/from cloud blob storage services3 . We note that whereas this study limits itself to analysing the performance of cloud APIs, the reason for possible differences is part of future work (Section 9). This study is an internal non-exact replication [27, 28] because it extends a previous study carried out to evaluate the performance of jclouds [11]. Due to the lack of rigorous guidelines for conducting experiments with middlewares in cloud computing, the framework for experimentation in Software Engineering proposed by Wohlin et al. [29] and guidelines for performance analysis proposed by Jain [30] were extensively used to prepare and conduct this study. The following sections detail the study protocol. Any emphasised text throughout this section refers to elements of the experimentation framework.

#

API

Version

1 2 3 4 5 6

jclouds Libcloud Java Azure Java AWS Python Azure Python AWS

1.9.2 1.1.0 4.0.0 1.9.0 1.0.3 1.3.1

open source projects, which drove our selection. Possible threats to the validity of this study raised by this selection are addressed in Section 8. jclouds and Libcloud differ in their target programming language. Whereas jclouds is a Java API, Libcloud is a Python API. To make a fair performance comparison with platform-specific APIs, we used both Java and Python versions of platform-specific APIs. Table 1 shows the versions of APIs used in this study. Section 8 discusses possible threats to the validity raised by the API versions used in this study. Application prototype. We developed a prototype application to perform the experimental tasks. The prototype consists of a single compilation unit that implements a cloud API, sign in to the cloud platform, prepare the workload, and upload/download the workload to/from a cloud service. Each prototype uses one cloud API only. Although six cloud APIs were used (Table 1), we developed eight versions of the prototype because jclouds and Libcloud were deployed on both Azure and AWS platforms. For Java prototypes we used the JDK version 1.8.0 91, and for Python prototypes, we used the Python version 2.7.12. Developing our own prototype may raise some threats to the validity of this experiment, which are addressed in Section 8.

3.1. Tasks Downloading and uploading files to/from a blob storage service are the two tasks considered in this study. These are essential tasks analysed when investigating the performance of cloud blob storage services [31, 2]. Because these are tasks heavily dependent on the network, we address possible threats to validity in Section 8. A set of instruments were used to perform these tasks, as explained in the next Section. 3.2. Instrumentation APIs, application prototypes, measurement tools, cloud platforms and services, and workloads are the instruments used in this study. Instruments are available online to encourage independent replications4 .

Cloud platforms and service. Amazon Web Services (AWS) and Microsoft Azure (Azure) provided both APIs and cloud services used in this study. Possible threats to the validity of this study raised by selecting these platforms are addressed in Section 8. As performance differences in a single cloud service is all we need for a preliminary demonstration of the existence of performance differences, this study focuses on blob storage services. Blob storage services are often target for performance analyses [2, 31, 32], which drove our choice for using this service. In addition, we also used VM services. Whereas the blob storage service was used as the target for the experimental task, the VM service was used to deploy the pro-

APIs. Petcu & Vasilakos list six popular multi-cloud APIs [13]. For this study, we investigated these six APIs and identified that only three are still actively maintained: jclouds, Libcloud and Fog. From these, only jclouds and Libcloud are used by large companies and 3 Emphasis in the main text represents elements of the Goal/Question/Metric template of Basili & Rombach 4 http://www.gabrielcostasilva.com/publications/ multicloud_performance/

3

Table 2: Performance indicators used in this study.

totype. Deploying the prototype application on a VM in the same cloud platform that provides the blob storage service was an strategy to mitigate external noise on the response time, and to provision similar environments for the prototype execution [31]. Thus, for analysing AWS APIs, the application prototype was deployed on the AWS VM service. We analysed the performance of AWS platform-specific API in Java and Python with jclouds and Libcloud when uploading/downloading files from the AWS VM service (where the prototype was deployed) to/from the AWS blob storage service. Similarly, for analysing Azure APIs, the application prototype was deployed on the Azure VM service. We analysed the performance of Azure platform-specific API in Java and Python with jclouds and Libcloud when uploading/downloading files from the Azure VM service to/from the Azure blob storage service. Note that jclouds and Libcloud were used twice – one for each cloud platform considered in this study.

Performance Measure Indicator

CPU time

Memory

Response time

Definition

The amount of CPU time within the process and Seconds the kernel that the prototype took for executing the task. The amount of RAM KB memory used by the prototype for the task. The amount of time that the API took for perMiliseconds forming the task disregarding the cloud authentication.

such as ShareLatex5 manages small text files and figures. On the other hand, a stream service like Netflix manages large media files. Therefore, to define the file size, we based on a related work [24] and on our prior experience [11]. As result, we decided to start with a (small) 155KB text file and incrementally double the file size four times, resulting in the following workloads: 155KB, 310KB, 620KB, 1240KB and 2480KB.

Measurement tools. In addition to perform the experimental task, the prototype had another goal: measuring the API reponse time. To do so, we added a line of code (LoC) just before the service authentication and just after the experimental task be performed to register the initial and final time (in miliseconds) to perform the experimental task. The difference between initial and final time was returned by the prototype representing the cloud API response time. This is a common strategy to measure middleware response time [33]. The Linux utility time was used to measure the CPU time and the memory consumption used to perform the experimental task. Although several profiling tools are available (e.g., jconsole, CProfile), we could not find an open source/free tool that fit for our purposes. The use of the time utility rather than a profile tool may raise some threats to the validity of this experiment, which are addressed in Section 8. We developed a shell script to automate the call both to the prototype and to the time utility, and to submit values of performance indicators to a web service responsible for storing data into a database (i.e., the dataset).

3.3. Variable Selection Performance is the dependent variable, measured in a ratio scale. The performance variable was broken down into three performance indicators as detailed in Table 2. It is worth to note that response time is a subset of CPU time [30]. Whereas response time includes only the time necessary for executing a task (e.g., upload), CPU time includes the time for carrying out the whole process of instantiating objects, file preparation, service authentication and task execution. To choose CPU time, memory consumption and response time as performance indicators, we analysed results of a systematic review that summarises performance indicators broadly used for measuring cloud blob storage services [32]. Cloud APIs represent the independent variable (or factor) investigated in this study. Multi-cloud (experimental treatment) and platform-specific (control treatment) APIs are the two possible values (treatments or

Workload. Different file sizes have been used for measuring service performance in cloud computing [2, 31, 32]. However, the literature does not show any consensus on the ideal file size when measuring response time. In addition, the heterogeneity of purposes for using storage services make it hard to define what can be considered a “real” scenario. For instance, a SaaS service

5 http://www.sharelatex.com

4

levels) analysed in each quasi-experiment. The independent variable is measured in a categorical scale. In addition, experimental tasks and workloads were used as moderators for analysing the performance of cloud APIs. A moderator is a variable that affects the relationship between independent and dependent variables [34]. For instance, a cloud API performance may vary with the workload used (moderator).

3.6. Data Analysis We adopted only non-parametric statistics and tests in this study, as summarised in Table 3. The rationale is that (i) non-parametric statistics are less sensitive to outliers and (ii) non-parametric tests make no assumption about the data distribution [36, 35]. Andrews et al. [37] underpinned the statistical test selection. Foundation for statistical tests can be found in [38, 29]. Charts are used to support data visualisation. The Type I error, also known as significance level (α), is the probability of falsely rejecting the null hypothesis [29]. The α was set up to traditional 0.05 as we had no reason to increase or decrease this value. This α level implies on a confidence level of 0.95 when accepting the null hypothesis. As 40 hypothesis tests were carried out for each experiment, we used the Bonferroni correction to control for the familywise error rate [38], which resulted in α = 0.00125 for accepting or rejecting the null hypothesis. We also defined an expected effect size (γ). The γ is important because it measures the magnitude of the difference between experimental treatments [39]. In this study, the Vargha & Delaney A measure [40] is used to measure the γ. The A-statistic, produced by the A measure, is a value between 0.5 (no effect) and 1 (large effect). A value less than 0.5 should be subtracted from 1 to obtain the effect size. For instance, if the A-statistic is 0.35, then the effect size is 0.65 – which is considered medium, according to [40]. According to Miller et al. [39], the value of γ can be estimated by analysing similar experiments. Taking our previous study [11] into account, we set γ to 0.64, which corresponds to a medium effect size. The R statistical software6 was used for performing data analysis. The package effsize7 was used to calculate the effect size. In the following three sections, we detail the results of applying the planned protocol presented in this section. In addition, in Section 7 we examine the impact of these results on hypothesis testing and research questions, and discuss implications of our findings.

3.4. Sample Size & Trials To define the number of executions (sample size), we used guidelines for assessing randomised algorithms [35], which recommend n = 1 000 for each randomized algorithm. These guidelines were used here due to the lack of established guidelines for assessing middlewares in cloud computing. Note that our analysis takes into account the combination of cloud platform, cloud API and programming language, as: • (for Azure) jclouds vs. Java Azure API; • (for AWS) jclouds vs. Java AWS API; • (for Azure) Libcloud vs. Python Azure API; • (for AWS) Libcloud vs. Python AWS API; Therefore, each combination of cloud platform, cloud API and programming language was executed 1 000 times (observations) each day, during five days (trials), resulting in 40 000 observations. However, it is important to highlight that our final dataset consists of 39 722 observations since several observations were lost due to network issues. 3.5. Study Design & Hypothesis Definition Since a quasi-experiment is to measure a causeeffect relationship [29], we carried out three quasiexperiments – one for each performance indicator in Table 2. These are quasi-experiments because they do not employ randomization [29]. However, to easy readability, we use the term experiment from hereon. Each experiment adopts one-factor, two treatments design, i.e., the independent variable is used for hypothesis testing. Two hypotheses were defined for each experiment. The null hypothesis states that the median performance of a performance indicator is the same regardless of the type of cloud API (i.e., multi-cloud or platform-specific) used whereas the alternative hypothesis states the opposite. As we explain in Section 3.6, the median was preferred instead of the mean because it is more robust. The hypotheses were checked for each combination of performance indicator, factor, and mediators, resulting in 120 tests (40 tests for each experiment).

4. CPU Time Experiment In order to evaluate the representativeness of each factor evaluated in our experiment in the CPU time, we performed a linear regression of log transformed 6 http://www.r-project.org

7 http://cran.r-project.org/web/packages/effsize/ index.html

5

Table 3: Summary of statistics and statistical tests adopted in this study.

Statistic Type Descriptive

Inferential

Statistic/Test

Definition/Config.

Purpose

Median and quartiles Range

25%, 50%, 75%, 100% Maximum - Minimum

To identify the middle value and variance. To quantify the data dispersion.

A-test

A-statistic

Wilcoxon rank-sum

Two-tailed

Spearman’s correlation

Two-tailed

Linear regression

Log transformed data

data (as described in Table 3). The six factors explain 91.13% of CPU time variation (Adjusted R2 = 0.9113). This suggests that these are the most representative factors for evaluating CPU time of cloud APIs. All factors are representative (p-value ≤ 0.001) at α = 0.05, apart from trials. This suggests that CPU time does not vary significantly over different days.

To identify the magnitude of the difference between treatments. To test the statistical significance of differences between two medians. To test the strength of a relationship between two numerical variables. To identify to what extent variables and mediators explain the performance indicator.

Java APIs on Azure download

upload

6

azure

CPU Time (sec)

8

4

4.1. CPU Time Analysis of Java APIs On Azure, boxplots in Figure 1 show that the x˜ CPU time of the Java multi-cloud API is more than three times longer than the platform-specific API for all five file sizes in both download and upload tasks. The Wilcoxon rank sum test confirms that CPU time differences between multi-cloud and platform-specific APIs are statiscally significant (p-value < 0.00125) whereas the effect size measured by the A-statistic shows that these differences are large (A = 1). Boxplots in Figure 1 shows little difference across medians of the same cloud API along the five file sizes for each task, which may suggest that the file size and CPU time are little correlated. We tested this correlation by using the Spearman’s correlation. According to reference values, the correlation between file size and CPU time consumption is considered small (0.232 ≤ rho ≤ 0.387) for both multi-cloud and platform-specific APIs, which suggests that increasing file size has little impact on CPU time increase. Like on Azure, Figure 2 shows that the x˜ CPU time of the multi-cloud API is longer than the platform-specific API for all five file sizes in both download and upload tasks on AWS. However, the difference between multicloud and platform-specific APIs on AWS (roughly one and half time) is smaller than on Azure (more than three times). Nevertheless, the A-statistic shows that differences between APIs on AWS are large (A = 1) and the

2

155

310

620

1240

2480

155

310

620

1240

2480

Filesize (KB) API:

Multi−cloud

Platform−specific

Figure 1: CPU time of jclouds and Azure API. jclouds had longer CPU time than Azure API for all five file sizes in both tasks (p-value < 0.00125; A = 1). Small correlation between file size and CPU time suggests CPU time does not increase significantly with file size.

Wilcoxon rank sum test confirms that differences are statistically significant (p-value < 0.00125). Note that the x˜ CPU time of the multi-cloud API on Azure (6sec < CPU time < 7sec) is longer than on AWS (3.25sec < CPU time < 3.75sec), which suggests that the multi-cloud API is optimized for AWS. We could not note significant difference in the x˜ CPU time of the platform-specific API between Azure and AWS. Unlike on Azure, the correlation between file size and CPU time consumption is considered medium (0.588 ≤ rho < 0.604) for multi-cloud API and large (rho ≥ 0.846) for platform-specific APIs. This correlation suggests that the file size has more impact on CPU time on AWS than on Azure regardless of the cloud API (i.e., multi-cloud or platform-specific) used. 6

Java APIs on AWS download

Python APIs on Azure upload

download

4.0

upload

0.4

3.5

2.5

2.0

azure

CPU Time (sec)

3.0 aws

CPU Time (sec)

0.3

0.2

0.1

155

310

620

1240

2480

155

310

620

1240

2480

155

310

620

Filesize (KB) API:

Multi−cloud

1240

2480

155

310

620

1240

2480

Filesize (KB) API:

Platform−specific

Figure 2: CPU time of jclouds and AWS API. As on Azure, jclouds x˜ CPU time is longer than AWS API. Correlation between file size and CPU is greater for AWS than for Azure.

Multi−cloud

Platform−specific

Figure 3: CPU time of Libcloud and Azure API. The result is the opposite of jclouds - Libcloud had shorter CPU time than the Azure API (p-value < 0.00125; 0.91 < A < 1). Correlation suggests that Azure tackles differences in file size more homogeneously than AWS.

4.2. CPU Time Analysis of Python APIs

Similar to the result of Python APIs on Azure, the multi-cloud API consumed less CPU time than the platform-specific API on AWS for all file sizes in both download and upload tasks. The difference of CPU time between multi-cloud and platform-specific APIs is statistically significant (p-value < 0.00125) and the effect size is large (A = 1). Unlike the result for Python APIs on Azure, the difference of CPU time between tasks varies from medium to large on AWS (0.693 < A ≤ 1). The correlation between file size and CPU time is large (rho > 0.801) for all Python APIs and all tasks on AWS, apart from the upload task of multi-cloud API (rho = 0.724). The calculated correlation for Python APIs on AWS is the largest compared with Java APIs on Azure, Java APIs on AWS and Python APIs on Azure. This suggests that variations on file size have more impact on CPU time for Python APIs on AWS than others.

Unlike Java APIs on Azure, the x˜ CPU time of the multi-cloud API is shorter than the platform-specific API for all five file sizes in both download and upload tasks for Python APIs on Azure (Figure 3). CPU time differences between Python multi-cloud and platformspecific APIs are statistically significant (p-value < 0.00125) and the effect size measured by the A-statistic is large (0.91 < A < 1) for all file sizes and tasks. Figure 3 also shows that the CPU time of the multicloud API varies between tasks. In fact, the effect size calculated for differences of CPU time between tasks on Azure for the Python multi-cloud API (0.87 < A < 0.96) is greater than for platform-specific APIs (0.51 < A < 0.61). This difference in the effect size suggests that CPU time of the platform-specific API is more homogeneous than the multi-cloud API across tasks. Like for Java APIs, the correlation between file size and CPU time for Python APIs on Azure is considered small for both tasks and APIs, apart from the platformspecific API for download (rho = 0.514). Along with the correlation calculated for Java APIs on Azure, this correlation suggests that Azure tackles differences in file size more homogeneously than AWS. Different from all prior boxplots about CPU time, the boxplot in Figure 4 shows tiny whiskers and boxes, suggesting that the CPU time varied little for Python APIs on AWS. Indeed, the range of CPU time (max - min) for the Python multi-cloud (0.01sec ≤ range ≤ 0.02sec) and platform-specific (0.01sec ≤ range ≤ 0.03sec) APIs on AWS is the lowest compared with Java APIs on Azure, Java APIs on AWS and Python APIs on Azure.

5. Memory Consumption Experiment A linear regression of log transformed memory consumption shows that the six factors explain 85.4% of memory consumption variation (Adjusted R2 = 0.854). This value shows that the six factors are representative for evaluating memory consumption of cloud APIs. All factors are representative (p-value < 0.003) at α = 0.05, apart from trials. This suggests that the memory consumption does not vary significantly over different days. 5.1. Memory Consumption Analysis of Java APIs Boxplots in Figure 5 show that the Java multi-cloud API consumed nearly the double of memory of the Java 7

Python APIs on AWS download

Java APIs on Azure download

upload

upload

0.25 110000

100000

Memory (KB)

0.15

0.10

90000 azure

aws

CPU Time (sec)

0.20

80000

70000

60000 0.05 155

310

620

1240

2480

155

310

620

1240

2480

155

310

620

Filesize (KB) API:

Multi−cloud

1240

2480

155

310

620

1240

2480

Filesize (KB) API:

Platform−specific

Figure 4: CPU time of Libcloud and AWS API. The x˜ CPU time is shorter for Libcloud than AWS API (p-value < 0.00125; A = 1). Correlations suggest that variations on file size have more impact on CPU time for Python APIs on AWS than others APIs/cloud platforms.

Multi−cloud

Platform−specific

Figure 5: Memory consumption of jclouds and Azure API. jclouds consumed nearly the double of memory of the Azure API for all five file sizes in both download and upload tasks (p-value < 0.00125; A = 1). Negligible and small correlations between file size and memory consumption were found.

platform-specific API for all five file sizes in both download and upload tasks on Azure. Differences between multi-cloud and platform-specific APIs are large (A = 1), and statistically significant (p-value < 0.00125). Apart from the upload task for the platform-specific API with 2 480KB file size, medians of memory consumption are similar across file sizes within the same group of task and API. For instance, the x˜ memory consumption of the multi-cloud API for download task varied from 104 872KB to 105 014KB. This suggests that different file sizes have little impact on memory consumption. We tested this impact by calculating the Spearman’s correlation between file size and memory consumption. Whereas the correlation for multi-cloud APIs was found negligible (rho (download) = 0.026; rho (upload) = 0.099), the correlation for platform-specific APIs was found small (rho (download) = 0.255; rho (upload) = 0.468). Figure 5 also shows that the range of memory consumption across the five file sizes in both tasks for the platform-specific API (9 708KB ≤ range ≤ 10 716KB) is shorter than the range for the multi-cloud API (18 732KB ≤ range ≤ 20 796KB), apart from the upload of the 2 480KB file size with the platform-specific API. This suggests that the memory consumption of platform-specific API varied little across tasks. Like on Azure, the memory consumption of Java APIs on AWS was greater for multi-cloud than for platform-specific (Figure 6). However, the average distance between medians is higher on AWS than on Azure. This distance suggests that the difference between medians is large. Indeed, the A-statistic shows

that the difference of memory consumption between multi-cloud and platform-specific Java APIs on AWS is large (A = 1). These differences are also statistically significant (p-value < 0.00125) as confirmed by the Wilcoxon rank sum test. Observing boxplots in Figure 6, one can notice that medians do not vary significantly between tasks. For instance, the x˜ memory consumption of the Java multicloud API for download was between 210 598KB and 211 334KB across file sizes whereas for upload it was between 210 726KB and 211 220KB. This suggests that download and upload tasks consumed similar memory. However, the Wilcoxon test shows that differences between tasks are statistically significant (p-value < 0.0001) and the A-statistic shows that these differences between tasks are large (0.74 < A < 0.974). The correlation between file size and memory consumption for Java APIs on AWS was found negligible (rho (download) = 0.053; rho (upload) = 0) for multicloud APIs and small (rho (download) = 0.369; rho (upload) = 0.261) for native APIs. Similar to the correlation calculated on Azure, this correlation suggests that the file size has little impact on the memory consumption for both multi-cloud and platform-specific Java APIs. 5.2. Memory Consumption Analysis of Python APIs In contrast with the memory consumption of Java APIs on Azure, the Python multi-cloud API consumed less memory than the platform-specific API for all file sizes in both download and upload tasks on Azure (Figure 7). Also unlike Java APIs on Azure, memory con8

Java APIs on AWS

Python APIs on Azure upload

download

215000

24000

200000

23000

185000

22000

155000 140000 125000

upload

21000 azure

Memory (KB)

170000

aws

Memory (KB)

download

20000 19000 18000

110000 17000

95000

16000

80000

15000

65000 155

310

620

1240

2480

155

310

620

1240

2480

155

310

620

Filesize (KB) API:

Multi−cloud

1240

2480

155

310

620

1240

2480

Filesize (KB) API:

Platform−specific

Multi−cloud

Platform−specific

Figure 6: Memory consumption of jclouds and AWS API. jclouds consumed more memory than AWS API (p-value < 0.00125; A = 1). Although boxplots suggest little difference of median between tasks, differences are statistically significant (p-value < 0.0001) and large (0.74 < A < 0.974).

Figure 7: Memory consumption of Libcloud and Azure API. Unlike jclouds, Libcloud consumed less memory than Azure API (pvalue < 0.00125; A = 1). Analysis of correlations suggests that the Java platform-specific API on Azure manages memory better than the Python platform-specific API when file size increases.

sumption differences between Python multi-cloud and platform-specific APIs on Azure increase with file size. Across file sizes and tasks, memory consumption differences are statistically significant (p-value < 0.00125) and large (A = 1). Whereas memory consumption of the platformspecific API on Azure increases with file size, memory consumption of the multi-cloud API varies little across file sizes, according to boxplots in Figure 7. This suggests that the impact of file size on memory consumption is greater for the platform-specific API than for the multi-cloud API. In fact, the Spearman’s correlation shows that the correlation between file size and memory consumption is large (rho (download) = 0.98; rho (upload) = 0.938) for the platform-specific API whereas the correlation is negligible for the multi-cloud API (rho (download) = 0.02; rho (upload) = -0.034). Whereas this result is similar to the correlation calculated for the Java multi-cloud API on Azure (rho (download) = 0.026; rho (upload) = 0.099), it differs significantly from the correlation calculated for the Java platform-specific API on Azure (rho (download) = 0.255; rho (upload) = 0.468). It suggests that the Java platform-specific API on Azure manages memory better than the Python platform-specific API when file size increases. Like the Java platform-specific API, Figure 7 shows that the interquartile range (IQR) of memory consumption for the Python platform-specific API is short (56KB ≤ IQR ≤ 108KB). It suggests little variation across download and upload requests for the Python platform-

specific API. The IQR of memory consumption for the Python multi-cloud API is also short on Azure (64KB ≤ IQR ≤ 100KB), but it was not for the Java multicloud API on Azure (2 281KB ≤ IQR ≤ 3 896KB). This suggests that the Python multi-cloud API suffers less memory variation than the Java multi-cloud API across download and upload requests. Like Python APIs on Azure, Figure 8 shows flat boxplots, suggesting that Python APIs varied less than Java APIs regarding memory consumption. In fact, the memory consumption range of Python APIs on AWS is the shortest ( x˜ = 16KB) compared with Python on Azure ( x˜ = 256KB), Java on AWS ( x˜ = 13 198 KB) and Java on Azure ( x˜ = 15 144KB). Similar to Python APIs on Azure, the Python multicloud API consumed less memory on Azure than the platform-specific API regardless of file size and task. Differences between multi-cloud and platform-specific APIs are statistically significant (p-value < 0.00125) and large (A = 1) for all five file sizes in both download and upload tasks. In contrast with the platform-specific API on Azure, boxplots in Figure 8 show that the memory consumption of platform-specific API on AWS did not vary with file size. In fact, the correlation between file size and memory consumption is negligible (rho (multicloud/download) = 0.03; rho (multi-cloud/upload) = 0.11; rho (platform-specific/download) = 0.09) for both APIs in both tasks, apart from the upload of platformspecific API (rho = 0.27). 9

Python APIs on AWS download

Java APIs on Azure download

upload

upload

2000 24000

18000

16000

1500

azure

Response Time (ms)

20000 aws

Memory (KB)

22000

1000

500

14000 0

12000 155

310

620

1240

2480

155

310

620

1240

2480

155

310

620

Filesize (KB) API:

Multi−cloud

1240

2480

155

310

620

1240

2480

Filesize (KB) API:

Platform−specific

Figure 8: Memory consumption of Libcloud and AWS API. Libcloud consumed less memory than AWS API (p-value < 0.00125; A = 1). The memory consumption range of Python APIs on AWS is the shortest ( x˜ = 16KB) compared with Python on Azure ( x˜ = 256KB), Java on AWS ( x˜ = 13 198 KB) and Java on Azure ( x˜ = 15 144KB)

Multi−cloud

Platform−specific

Figure 9: Response time of jclouds and Azure API. Response time of jclouds is large than Azure API (p-value < 0.00125; A = 1). 79 (outlier) observations are omitted in boxplots due to scale adjusts to fit into the figure size.

platform-specific API is considered small (rho = 0.353) for download and large (rho = 0.834) for upload. Boxplots in Figure 9 also show that the x˜ response time of the multi-cloud API slight varies between tasks whereas the x˜ response time of platform-specific API presents a greater variation between tasks. In fact, the x˜ response time difference between tasks for the same type of API (i.e., multi-cloud download compared with multi-cloud upload, and platform-specific download compared with platform-specific upload) is statistically significant (p-value < 0.001) across file sizes and API type. However, the effect size of differences is greater for the platform-specific (A = 1) than for the multi-cloud API (0.632 ≤ A ≤ 0.662). This result, along with the correlation analysed previously suggest that the Java multi-cloud is optimized to deal with different file sizes and tasks more homogeneously than the platformspecific API for the response time. Like the response time on Azure, x˜ response time of the multi-cloud API is greater than the platform-specific API for all five file sizes in both download and upload tasks (Figure 10). Differences are statistically significant (p-value < 0.00125) and large (0.988 ≤ A < 1) for all five file sizes in both download and upload tasks. Boxplots in Figure 10 suggest that response time increases with file size on AWS, like on Azure. Indeed, whereas the correlation is small for the Java multi-cloud API regardless of the task (rho (download) = 0.46; rho (upload) = 0.48), the correlation varies from medium (rho = 0.74) for the upload to large (rho = 0.81) for the download task of the Java platform-specific on AWS. The effect size of differences of response time be-

6. Response Time Experiment A linear regression of log transformed response time shows that the six factors explain 76.57% of response time variation (Adjusted R2 = 0.7657). This is the smallest value compared with the other two performance indicators. This result suggests that about one quarter of the variation is explained by other factors not analysed in this experiment. All factors are representative (p-value ≤ 0.0152) at α = 0.05. Unlike the other two performance indicators, two trials were representative (26/06/16 and 28/06/2016). This suggests that these days do impact on response time. 6.1. Response Time Analysis of Java APIs Boxplots in Figure 9 show that x˜ response time of the Java multi-cloud API is greater than the Java platformspecific API for all five file sizes in both download and upload tasks on Azure. The Wilcoxon rank sum test confirms that response time differences are statistically significant (p-value < 0.00125), and the A-statistic shows that these differences are large (A = 1). Boxplots in Figure 9 show that response time of Java APIs on Azure increases with file size in both download and upload tasks. Boxplots suggest that the response time of the multi-cloud API is more sensitive to file size than the platform-specific API. We tested this assumption by checking the Spearman’s correlation. Whereas the correlation between response time and file size of the multi-cloud API is considered large (rho (download) = 0.561; rho (upload) = 0.564), the correlation of the 10

Java APIs on AWS download

Python APIs on Azure upload

download

250

1050 aws

850

650

200 azure

Response Time (ms)

1250

Response Time (ms)

upload

300

1450

150

100

50

450

250

0 155

310

620

1240

2480

155

310

620

1240

2480

155

310

620

Filesize (KB) API:

Multi−cloud

1240

2480

155

310

620

1240

2480

Filesize (KB) API:

Platform−specific

Figure 10: Response time of jclouds and AWS API. Like on Azure, response time of jclouds is large than Azure API (p-value < 0.00125; 0.988 ≤ A < 1). 72 (outlier) observations are omitted in boxplots due to scale adjusts to fit into the figure size.

Multi−cloud

Platform−specific

Figure 11: Response time of Libcloud and Azure API. Two scenarios - for download, Libcloud had longer response time than Azure API whereas for upload, the result was the opposite (p-value < 0.00125; 0.714 < A < 0.951). 197 (outlier) observations are omitted in boxplots due to scale adjusts to fit into the figure size.

tween tasks is greater for the platform-specific (0.90 < A < 1) than for the multi-cloud API (0.66 < A < 0.81) for all five file sizes, like on Azure. Like on Azure, this suggests that the Java multi-cloud deals with different file sizes and tasks more homogeneously than the platformspecific API for the response time.

specific/download) = 0.74) to large (rho = 0.81) for platform-specific in the upload task. Finally, boxplots in Figure 11 show that the x˜ response time of the Python multi-cloud API vary significantly (0.937 < A < 0.970; p-value < 0.0001) between tasks, across file sizes. On the other hand, the x˜ response time difference between tasks of the Python platform-specific API (0.508 0.375). For the other two file sizes (1240KB and 2480KB), the x˜ response time differences between multi-cloud and platform-specific APIs were found significant (p-value < 0.00125). For all five file sizes, response time differences between multi-cloud and platform-specific APIs are small (0.50 < A < 0.578). Apart from the multi-cloud API for download (rho =

6.2. Response Time Analysis of Python APIs Unlike the result of response time for Java APIs on Azure, boxplots in Figure 11 show two different scenarios of response time for Python APIs on Azure. Whereas the multi-cloud API had longer x˜ response time than the platform-specific API for download, the position of APIs was the opposite for upload - the platform-specific had longer x˜ response time than the multi-cloud API. The x˜ response time differences are statistically significant (p-value < 0.00125). Boxplots in Figure 11 also shows that differences of x˜ response time between APIs are greater for download (0.918 < A < 0.951) than upload (0.714 < A < 0.799) task for all five file sizes. Although this heterogeneous difference of x˜ response time between APIs across tasks could also be observed for Java APIs on Azure, the scenario was the opposite - upload task presented greater difference than download. Boxplots in the Figure 11 show that x˜ response time of Python APIs on Azure increases with file size. Indeed, the correlation between response time and file size measured by the Spearman’s correlation varies from medium (rho (multi-cloud/download) = 0.53; rho (multi-cloud/upload) = 0.74; rho (platform11

upload tasks whereas Libcloud consumed less memory than Azure and AWS APIs for all secondary factors investigated with significant p-value and large effect size. Hence, we reject null hypotheses for memory consumption and accept alternative hypotheses that state that the median memory consumption varies with the type of cloud API used. The experiment in Section 6 evaluated response time of cloud APIs. Like previous experiments, the response time of jclouds was longer than Azure and AWS APIs for all five file sizes in both download and upload tasks, with significant p-value and large effect size for the 20 hypothesis tests. Therefore, we reject null hypotheses for response time and aforementioned combination of secondary factors and accept alternative hypotheses that states that the median response time varies with the type of cloud API used. On the other hand, the response time of Libcloud varied with tasks. For the download task, Libcloud had longer median response time than Azure and AWS APIs for all five file sizes with significant p-value and large effect size for all 10 hypothesis tests. Hence, we reject null hypotheses and accept alternative hypotheses for response time for this combination of factors. For the upload task on Azure, Libcloud had shorter median response time than Azure (p-value < 0.00125; 0.714 < A < 0.799) API for all five file sizes. This result enable us to reject null hypotheses for response time of the Libcloud on Azure for the upload task, and accept alternative hypotheses. Finally, for the upload task on AWS, the median response time varied with file sizes as follows: for three file sizes (155KB, 310KB and 620KB), the median response time differences between Libcloud and AWS API were not found significant (p-value ≥ 0.375) neither relevant (0.503 ≤ A ≤ 0.518). Therefore, for these three hypothesis tests we accept the null hypotheses that state that the response time does not vary with the type of cloud API used. For the other two file sizes (1 240KB and 2 480KB), the median response time differences between Libcloud and AWS API were found significant (p-value < 0.00125), but the effect size is small (0.562 ≤ A ≤ 0.578). Hence, for these two hypothesis tests we reject the null hypotheses and accept alternative hypotheses. Table 4 summarises descriptive and inferential statistics for all 120 hypothesis tests carried out in our three experiments.

Python APIs on AWS download

upload

500 450

350 300 aws

Response Time (ms)

400

250 200 150 100 50 0 155

310

620

1240

2480

155

310

620

1240

2480

Filesize (KB) API:

Multi−cloud

Platform−specific

Figure 12: Response time of Libcloud and AWS API. Like on Azure, there are two scenarios. For download, Libcloud had longer response time than Azure API whereas for upload, APIs performed alike. 45 (outlier) observations are omitted in boxplots due to scale adjusts to fit into the figure size.

0.425), the correlation between response time and file size for Python APIs on Azure was found medium (rho (platform-specific/download) = 0.617; rho (platformspecific/upload) = 0.559; rho (multi-cloud/upload) = 0.507) in both tasks. 7. Discussion 7.1. Hypothesis Testing Null hypotheses investigated in this study (Section 3.3) state that the median CPU time, memory consumption and response time to upload or download a set of files in five different sizes to/from two cloud storage services is the same regardless of the type of cloud API used (i.e., multi-cloud or platform-specific). The experiment in Section 4 evaluated the CPU time of cloud APIs. Whereas jclouds presented longer CPU time than Azure and AWS APIs for all five file sizes in both download and upload tasks, Libcloud presented shorter CPU time than Azure and AWS APIs for all secondary factors investigated. As the p-value is significant at α = 0.00125 (i.e., the α set for the experiment corrected with the Bonferroni correction for 40 tests) and the effect size is large for all 40 hypothesis tests undertaken, we reject null hypotheses for CPU time and accept alternative hypotheses that state that the median CPU time varies with the type of cloud API used. The experiment in Section 5 evaluated the memory consumption of cloud APIs. Like the CPU time experiment, jclouds consumed more memory than Azure and AWS APIs for all five file sizes in both download and

7.2. Research Question Our research question asks whether the performance of multi-cloud APIs differ significantly from platform12

specific APIs. Section 7.1 shows that 117 out of 120 hypothesis tests (97.5%) rejected the null hypothesis and accepted the alternative hypothesis, meaning that the performance indicator varies with the type of the cloud API used. Moreover, 115 out of 120 hypothesis tests (95.8%) presented a large effect size, meaning that performance differences between cloud APIs are relevant. Taking only the results and the context of this present study into consideration, the answer for our research question is that performance of multi-cloud APIs does differ significantly from platform-specific APIs. It is important to note that jclouds performed significantly worse than Azure and AWS APIs for all three performance indicators and five file sizes in both download and upload tasks. On the other hand, Libcloud outperformed Azure and AWS APIs regarding CPU time and memory consumption in both download and upload tasks for all five file sizes. These results show two opposite scenarios that prevent us to generalise that multi-cloud APIs perform better or worse than platform-specific APIs. Furthermore, our findings should be considered as preliminary. As any other, experiments reported in this paper constitute a single piece of evidence rather than a definitive proof [41]. It would be unrealistic to expect that a single experiment could enable wide generalizations [42]. Therefore, independent replications are necessary to increase the confidence in these results and enable their generalization [28] to other APIs, cloud services, and cloud platforms. Finally, we should highlight that as long as performance differences are confirmed by independent replications, it is essential to investigate the reasons that led to these differences.

engineers [7]. For instance, when studying jclouds architecture, one of the authors of this study identified some design decisions that might have impacted to the weak performance of jclouds. Finally, evaluating cloud portability/multi-cloud management solutions seems not to be a common target of researchers. For example, papers on cloud portability and multi-cloud management often propose technical solutions with minimal empirical evaluation [26]. This study highlight a little explored issue that might have great impact for solutions on cloud portability and multi-cloud management. 7.4. Should a Cloud User Replace Their Cloud API? Our results do not mean that one should avoid a particular cloud API, but they highlight the importance of evaluating what quality attribute is more important when selecting a cloud API. As Vinoski notes in [33], “Middleware has a variety of qualities, such as size, cost, complexity, flexibility, and performance. For each application, different qualities are more important than others. Choosing the right middleware means first determining the qualities that matter most for your application, and then evaluating different types of middleware to see how well they meet your requirements.” Another important aspect to take into account is that replacing cloud APIs requires a great re-engineering effort because cloud API and application code are often tangled. In addition, unlike other API replacements, cloud platforms may differ in semantics [21, 22, 23]. For instance, replacing AWT with Swing API in Java desktop applications requires mapping similar components, like buttons, text boxes and forms. On the other hand, cloud platforms may provide a similar service (e.g., computing), but with so many semantic differences that they overcomplicate the mapping. Finally, improving portability (i.e., by enabling the use of multiple clouds) is not just a matter of reengineering. Like most quality attributes, portability might conflict with other quality attributes [5]. In particular, portability often conflicts with performance [43]. Therefore, it is critical to analyse in advance the benefits and trade-offs of improving portability in order to ensure that its costs pay off.

7.3. Implications of this Study Once confirmed by independent replications, our results have four major implications. Firstly, since performance is the overarching non-functional requirement for software architects [4], our findings can support the decision process of selecting cloud APIs. Secondly, cloud users, jclouds users in particular, should evaluate the extent to which performance tradeoffs identified in this study negatively impact their applications. Although most performance differences between cloud APIs were found large in our study, only cloud users could claim whether this is a real threat or just a harmless trade-off. Next, performance bottlenecks identified in this study should be carefully analysed by engineers of cloud APIs as API performance is a great responsibility of software

8. Threats to Validity This section discusses the main internal and external threats to validity of this study. 13

14

1.83 1.85 1.91 1.96 2.08 1.80 1.82 1.85 1.92 2.00 0.19 0.19 0.20 0.21 0.21 0.20 0.20 0.21 0.21 0.23 75 604 75 596 75 640 75 650 77 152 74 284 74 252 74 384 74 396 77 162 24 980 24 980 24 980 24 980 24 984 24 980 24 988 24 988 24 984 24 984 467.00 484.00 510.00 559.00 621.00 409.00 419.50 436.00 465.00 508.00 130.58 133.83 144.92 169.85 225.34 65.97 70.84 74.11 82.19 99.07

3.47 3.49 3.52 3.57 3.65 3.50 3.51 3.54 3.59 3.67 0.05 0.05 0.06 0.06 0.07 0.07 0.07 0.08 0.08 0.09 210 932 210 992 211 220 210 726 210 914 210 624 210 598 211 334 211 256 210 942 12 476 12 480 12 480 12 476 12 480 12 288 12 288 12 288 12 288 12 288 879.00 898.50 907.50 943.50 1 014.50 846.00 849.00 867.00 885.00 930.00 127.62 136.03 145.44 157.97 217.22 102.79 106.70 113.53 121.20 129.71

189.6 188.6 184.2 182.1 175.4 194.4 192.8 191.3 186.9 183.5 26.3 26.3 30.0 28.5 33.3 35.0 35.0 38.0 38.0 39.1 278.9 279.1 279.2 278.5 273.3 283.5 283.6 284.1 283.9 273.3 49.9 49.9 49.9 49.9 49.9 49.1 49.1 49.1 49.1 49.1 188.2 185.6 177.9 168.7 163.3 206.8 202.3 198.8 190.3 183.0 97.7 101.6 100.3 93.0 96.3 155.8 150.6 153.1 147.4 130.9

2.13 · 10−152 4.04 · 10−165 4.03 · 10−165 4.21 · 10−165 4.48 · 10−165 3.85 · 10−165 4.03 · 10−165 4.08 · 10−165 4.36 · 10−165 4.10 · 10−165 7.63 · 10−146 7.96 · 10−177 3.42 · 10−178 5.18 · 10−195 3.11 · 10−187 2.87 · 10−189 4.07 · 10−180 3.11 · 10−179 1.92 · 10−188 2.87 · 10−187 2.84 · 10−152 5.85 · 10−165 5.85 · 10−165 5.85 · 10−165 5.85 · 10−165 5.85 · 10−165 5.85 · 10−165 5.85 · 10−165 5.85 · 10−165 5.85 · 10−165 4.55 · 10−136 1.27 · 10−169 2.41 · 10−169 8.07 · 10−170 1.98 · 10−169 2.94 · 10−169 4.90 · 10−170 6.05 · 10−170 1.92 · 10−170 4.48 · 10−170 1.60 · 10−149 2.05 · 10−161 1.60 · 10−162 2.90 · 10−160 7.98 · 10−158 1.70 · 10−162 1.22 · 10−163 1.88 · 10−162 7.92 · 10−160 2.10 · 10−162 3.70 5.10 8.50 1.64 · 10−05 0.00 1.36 · 10−132 3.00 · 10−123 2.45 · 10−124 2.52 · 10−124 1.23 · 10−79

A-statistic

1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.01 0.06 0.08 0.06 0.04 0.02 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 1.00 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.28 0.26 0.27 0.24 0.20 0.95 0.93 0.91 0.94 0.91

p-value

Diff(%)

5.54 · 10−165 5.61 · 10−165 5.58 · 10−165 5.64 · 10−165 5.60 · 10−165 5.55 · 10−165 5.55 · 10−165 5.60 · 10−165 5.61 · 10−165 5.57 · 10−165 2.72 · 10−163 6.21 · 10−163 2.50 · 10−161 2.61 · 10−161 8.69 · 10−158 3.70 · 10−125 8.75 · 10−117 1.15 · 10−127 1.23 · 10−135 1.14 · 10−149 5.84 · 10−165 5.84 · 10−165 5.84 · 10−165 5.84 · 10−165 5.84 · 10−165 5.84 · 10−165 5.84 · 10−165 5.84 · 10−165 5.84 · 10−165 5.84 · 10−165 4.68 · 10−165 4.71 · 10−165 4.87 · 10−165 4.74 · 10−165 4.72 · 10−165 4.81 · 10−165 4.70 · 10−165 4.41 · 10−165 4.67 · 10−165 4.77 · 10−165 5.64 · 10−165 5.69 · 10−165 5.74 · 10−165 1.11 · 10−163 1.47 · 10−164 7.78 · 10−160 8.38 · 10−160 2.07 · 10−160 3.42 · 10−161 1.48 · 10−159 9.27 · 10−32 2.36 · 10−37 1.46 · 10−33 2.46 · 10−44 3.05 · 10−60 7.21 · 10−135 3.33 · 10−124 2.86 · 10−116 4.80 · 10−133 4.21 · 10−116

Multi cloud

347.8 348.9 345.5 345.5 343.4 358.1 358.1 362.4 363.7 364.5 54.5 54.5 59.0 58.3 61.5 80.9 77.2 77.2 79.1 74.0 179.8 179.7 180.5 180.4 160.6 179.4 179.5 179.4 178.4 178.9 81.0 80.9 77.5 74.0 66.3 79.9 78.5 76.2 70.8 63.7 1 435.1 1 275.2 1 075.7 860.0 632.2 195.7 197.7 195.7 203.1 210.2 84.6 83.7 84.7 84.1 83.1 199.0 195.2 171.8 157.0 136.4

Platform specific

6.47 6.49 6.60 6.60 6.80 6.41 6.41 6.56 6.62 6.78 0.12 0.12 0.13 0.14 0.16 0.17 0.17 0.17 0.19 0.20 106 342 106 270 106 748 106 730 106 606 104 872 104 952 105 014 104 872 104 990 15 508 15 508 15 508 15 508 15 504 15 318 15 320 15 320 15 320 15 320 1 062.00 1 084.00 1 129.50 1 161.00 1 264.50 1 106.00 1 133.00 1 178.50 1 211.00 1 326.50 57.72 61.89 72.20 93.28 132.57 137.57 148.19 151.39 164.07 196.13

A-statistic

1.86 1.86 1.91 1.91 1.98 1.79 1.79 1.81 1.82 1.86 0.22 0.22 0.22 0.24 0.26 0.21 0.22 0.22 0.24 0.27 59 120 59 134 59 138 59 144 66 346 58 440 58 452 58 530 58 778 58 672 19 140 19 152 20 008 20 932 23 352 19 164 19 500 20 096 21 624 24 044 74.00 85.00 105.00 135.00 200.00 565.00 573.00 602.00 596.00 631.00 68.20 73.86 85.24 110.87 159.39 69.13 75.90 88.11 104.47 143.74

AWS

p-value

Diff(%)

155 310 620 1240 2480 155 310 620 1240 2480 155 310 620 1240 2480 155 310 620 1240 2480 155 310 620 1240 2480 155 310 620 1240 2480 155 310 620 1240 2480 155 310 620 1240 2480 155 310 620 1240 2480 155 310 620 1240 2480 155 310 620 1240 2480 155 310 620 1240 2480

Multi cloud

File Size (KB)

Platform specific

Download Download

Upload

Download

Java Python

Response time

Upload

Download

Upload

Download

Java Python

Memory consumption

Upload

Python

Upload

CPU time

Download

Java

Upload

Task

Azure Language

Performance Indicator

Table 4: Summary of descriptive and inferential statistics.

1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99 0.99 0.99 0.99 0.98 0.99 0.99 0.99 0.99 0.99 0.48 0.48 0.49 0.42 0.43 0.94 0.93 0.93 0.93 0.84

8.2. External Threats External threats restrict the result generalization beyond the scope of the study [29]. In this section, we identify three external threats. Cloud APIs are available in different versions. We used the latest version available for the time that experiments were carried out, according to Table 1. However, our results are not timeless because the API performance could be improved in a new version of a cloud API. Therefore, although the results are valid for versions used in this study, they could not hold for new versions. In fact, warning cloud API engineers for solving performance issues in new versions of their APIs is part of our study benefits (Section 7.3). Only two cloud platforms and one cloud service were used to analyse the performance of multi-cloud APIs. Although the cloud platforms used are leaders in the cloud market and the cloud service used has been widely used to measure performance of clouds, our selection restrict the applicability of our findings to this scope. However, we should emphasise that this is the case for most studies on cloud performance [32]. We kept the instruments used in this study available online to help other researchers to extend our study and evaluate other cloud platforms and services. Finally, many multi-cloud APIs have been devised to deal with multiple cloud management [26]. We used only jclouds and Libcloud because they are actively maintained and widely used. As our results show, the performance may vary significantly between cloud APIs. Therefore, extending our study to investigate other APIs is necessary to increase the generalisability of our results to other cloud APIs.

8.1. Internal Threats Internal threats impact on the cause-effect relationship. These threats might lead to an alternative cause for the effect [29]. We identified three internal threats. Firstly, uploading and downloading files to/from blob storage services is a task heavily dependent on the network bandwidth available between the application and the storage service. Because we are dealing with a public cloud provider, we could not isolate all external network noise between the prototype and the storage service. However, we followed procedures reported in the literature [2, 31, 32, 24, 35] to reduce threats introduced by network noise, such as deploying the prototype in the same cloud platform and region where the storage service is available, carrying out several trials each day, and collecting data in multiple days. Nevertheless, as suggested by one of our reviewers, future replications should consider a network bandwidth monitoring along with the experiment execution. It would enable the correlation between the API performance and the network monitoring results. We used the Linux utility time to measure CPU time and memory consumption (Section 3.2). Unlike a profile tool, the time utility collects the entire application performance, along with the JVM. A profile tool could collect the performance of the cloud API only rather than the entire application and JVM. However, it is important to highlight that the performance of all six cloud APIs were collected in the same way. Thus, comparisons were always fair. Finally, we developed the prototype used for implementing cloud APIs and collecting response time data (Section 3.2). Although developing instruments is a common practice in experiments, cloud APIs have different features and the way they are used could favour one cloud API over another. To mitigate this threat, we rigorously investigated each cloud API documentation to ensure that only essential features of cloud APIs were used in similar fashion. For instance, for download these features consist of (i) creating a client object by using the service credentials; (ii) using the client object to build a service object that accesses the desired resource; (iii) packaging the workload; (iv) opening the stream; (v) sending the workload; and (vi) closing the stream. Details on the configuration used can be checked in the source code available online8 .

9. Conclusion and Future Directions This paper presents three quasi-experiments to evaluate the CPU time, memory consumption and response time of jclouds and Libcloud when compared to Azure and AWS APIs for blob storage service. Our results show that 117 out of 120 hypothesis tests (97.5%) rejected the null hypothesis and accepted the alternative hypothesis, meaning that the performance indicator varies with the type of the cloud API used. Moreover, 115 out of 120 hypothesis tests (95.8%) presented a large effect size, meaning that performance differences between cloud APIs are relevant. Taking into account the scope of our study, these results suggest that the performance of multi-cloud APIs differs significantly from platform-specific APIs. However, we should highlight that jclouds performed significantly worse than Azure and AWS APIs for all

8 http://www.gabrielcostasilva.com/publications/ multicloud_performance/

15

three performance indicators. On the other hand, Libcloud outperformed Azure and AWS APIs regarding CPU time and memory consumption. These results show two opposite scenarios that prevent us to generalise that multi-cloud APIs perform better or worse than platform-specific APIs. Our results do not mean that one should avoid a particular cloud API, but they highlight the importance of evaluating what quality attribute is more important when selecting a cloud API. As part of future work, we are going to investigate jclouds and Libcloud code to identify reasons that led to performance differences when compared to Azure and AWS APIs. Furthermore, we strongly encourage independent replications of this study to extend, and to confirm or refute our results in different cloud platforms, services and APIs.

[10] State of the Cloud Report, Tech. rep., Right Scale (2016). URL http://www.rightscale.com/lp/ 2016-state-of-the-cloud-report [11] M. A. d. C. Ismael, C. A. da Silva, G. C. Silva, R. Re, An Empirical Study for Evaluating the Performance of jclouds, in: 2015 IEEE 7th CloudCom, IEEE, Vancouver, 2015, pp. 115–122. [12] D. Petcu, Portability and interoperability between clouds: Challenges and case study, in: W. Abramowicz, I. M. Llorente, M. Surridge, A. Zisman, J. Vayssi`ere (Eds.), Towards a ServiceBased Internet, Vol. 6994 LNCS, Springer Berlin Heidelberg, 2011, pp. 62–74. doi:10.1007/978-3-642-24755-2_6. [13] D. Petcu, A. V. Vasilakos, Portability in clouds: approaches and research opportunities, Scalable Computing: Practice and Experience 15 (3) (2014) 251–270. [14] J. Opara-Martins, R. Sahandi, F. Tian, Critical analysis of vendor lock-in and its impact on cloud computing migration: a business perspective, Journal of Cloud Computing 5 (1) (2016) 4. doi:10.1186/s13677-016-0054-z. [15] G. Alonso, F. Casati, H. Kuno, V. Machiraju, Web Services, Springer Berlin Heidelberg, Berlin, Heidelberg, 2004. [16] I. Sommerville, D. Cliff, R. Calinescu, J. Keen, T. Kelly, M. Kwiatkowska, J. Mcdermid, R. Paige, Large-scale complex IT systems, Communications of the ACM 55 (7) (2012) 71–77. [17] S. Vinoski, Where is middleware, IEEE Internet Computing 6 (2) (2002) 83–85. [18] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, I. Brandic, Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility, Future Generation Computer Systems 25 (6) (2009) 599–616. [19] B. S. Lee, S. Yan, D. Ma, G. Zhao, Aggregating IaaS Service, in: 2011 Annual SRII Global Conference, IEEE, San Jose, CA, 2011, pp. 335–338. [20] L. Rodero-Merino, L. M. Vaquero, V. Gil, F. Gal´an, J. Font´an, R. S. Montero, I. M. Llorente, From infrastructure delivery to service management in clouds, Future Generation Computer Systems 26 (8) (2010) 1226–1240. [21] J. Tao, H. Marten, D. Kramer, W. Karl, An Intuitive Framework for Accessing Computing Clouds, Procedia Computer Science 4 (1) (2011) 2049–2057. [22] H. Flores, S. N. Srirama, C. Paniagua, A generic middleware framework for handling process intensive hybrid cloud services from mobiles, in: Proceedings of the MoMM ’11, ACM Press, New York, New York, USA, 2011, pp. 87–94. [23] A. E. C. d. S. Souza, J. A. M. de Lima, R. Gondim, T. Diniz, N. Cacho, F. Lopes, T. Batista, Avaliando o Aprisionamento entre V´arias Plataformas de Computac¸a˜ o em Nuvem, in: 31th SBRC, Bras´ılia, 2013, pp. 775–788. [24] Z. Hill, M. Humphrey, CSAL: A Cloud Storage Abstraction Layer to Enable Portable Cloud Applications, in: CloudCom 2010, IEEE, Indianapolis, 2010, pp. 504–511. [25] B. Nguyen, V. Tran, L. Hluchy, High-Level Abstraction Layers for Development and Deployment of Cloud Services, in: Networked Digital Technologies, Vol. 293 of Communications in Computer and Information Science, Springer Berlin Heidelberg, Dubai, 2012, pp. 208–219. [26] G. C. Silva, L. M. Rose, R. Calinescu, A Systematic Review of Cloud Lock-In Solutions, in: CloudCom 2013, IEEE, Bristol, UK, 2013, pp. 363–368. [27] N. Juristo, S. Vegas, The role of non-exact replications in software engineering experiments, Empirical Software Engineering 16 (3) (2011) 295–324. [28] J. Miller, Replicating software engineering experiments: a poisoned chalice or the Holy Grail, Information and Software Technology 47 (4) (2005) 233–244. [29] C. Wohlin, P. Runeson, M. H¨ost, M. C. Ohlsson, B. Regnell,

Acknowledgement The authors are grateful to Dr Rafael Alves Paes de Oliveira and Dr Lucio Agostinho Rocha for their valuable feedback on an earlier version of the manuscript. References [1] A. Basiri, N. Behnam, R. de Rooij, L. Hochstein, L. Kosewski, J. Reynolds, C. Rosenthal, Chaos Engineering, IEEE Software 33 (3) (2016) 35–41. [2] I. Drago, M. Mellia, M. M. Munafo, A. Sperotto, R. Sadre, A. Pras, Inside dropbox, in: Proceedings of the 2012 ACM conference on Internet measurement conference - IMC ’12, ACM Press, Boston, USA, 2012, pp. 481–494. [3] J. M. Emison, State of Cloud Survey, Tech. rep., InformationWeek (2014). URL http://createyournextcustomer.com/2014/11/ 19/2014-state-of-cloud-computing/ [4] D. Ameller, C. Ayala, J. Cabot, X. Franch, How do software architects consider non-functional requirements: An exploratory study, in: 2012 20th IEEE International Requirements Engineering Conference (RE), IEEE, Chicago, IL, 2012, pp. 41–50. [5] M. Galster, E. Bucherer, A Taxonomy for Identifying and Specifying Non-Functional Requirements in Service-Oriented Development, in: 2008 IEEE Congress on Services - Part I, IEEE, Honolulu, HI, 2008, pp. 345–352. [6] L. Sales, H. Teofilo, J. D’Orleans, N. C. Mendonca, R. Barbosa, F. Trinta, Performance Impact Analysis of Two Generic Group Communication APIs, in: 2009 33rd Annual IEEE International Computer Software and Applications Conference, IEEE, Seattle, 2009, pp. 148–153. [7] R. F. Sproull, J. Waldo, The API performance contract, Communications of the ACM 57 (3) (2014) 45–51. [8] D. Petcu, Multi-Cloud: expectations and current approaches, in: Proceedings of the MultiCloud ’13, ACM Press, Prague, 2013, pp. 1–6. [9] K. Marko, J. M. Emison, Information Week 2014 Hybrid Cloud Survey, Tech. rep., Information Week (2014). URL http://reports.informationweek. com/abstract/5/12531/Cloud-Computing/ 2014-Hybrid-Cloud-Survey.html

16

[30] [31]

[32]

[33] [34] [35] [36]

[37]

[38] [39]

[40]

[41]

[42] [43]

A. Wessl´en, Experimentation in Software Engineering, Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. R. Jain, Art of Computer Systems Performance Analysis: Techniques For Experimental Design Measurements Simulation and Modeling, Wiley, 2015. A. Li, X. Yang, S. Kandula, M. Zhang, CloudCmp: comparing public cloud providers, in: Proceedings of the 10th annual conference on Internet measurement, ACM, Melbourne, 2010, pp. 1–14. Z. Li, L. O’Brien, H. Zhang, R. Cai, On a Catalogue of Metrics for Evaluating Commercial Cloud Services, in: 2012 ACM/IEEE 13th Int’l Conference on Grid Computing, IEEE, Beijing, 2012, pp. 164–173. S. Vinoski, The performance presumption, IEEE Internet Computing 7 (2) (2003) 88–90. T. Dybå, D. I. Sjøberg, D. S. Cruzes, What works for whom, where, when, and why?, in: Proceedings of the ESEM ’12, ACM Press, 2012, p. 19. A. Arcuri, L. Briand, A practical guide for using statistical tests to assess randomized algorithms in software engineering, in: Proceeding of the ICSE ’11, ACM Press, 2011, p. 1. B. Kitchenham, Robust statistical methods, in: Proceedings of the 19th International Conference on Evaluation and Assessment in Software Engineering - EASE ’15, ACM Press, 2015, pp. 1–6. doi:10.1145/2745802.2747956. F. M. Andrews, L. Klem, T. N. Davidson, P. M. O’Malley, W. L. Rodgers, A Guide for selecting statistical techniques for analyzing social science data, 2nd Edition, Survey Research Center, Institute for Social Research, University of Michigan, Michigan, 1981. A. Field, J. Miles, Z. Field, Discovering Statistics Using R, SAGE Publications, London, UK, 2012. J. Miller, J. Daly, M. Wood, M. Roper, A. Brooks, Statistical power and its subcomponents - missing and misunderstood concepts in empirical software engineering research, Information and Software Technology 39 (4) (1997) 285–295. A. Vargha, H. D. Delaney, A Critique and Improvement of the CL Common Language Effect Size Statistics of McGraw and Wong, Journal of Educational and Behavioral Statistics 25 (2) (2000) 101–132. F. Shull, R. L. Feldmann, Building Theories from Multiple Evidence Sources, in: F. Shull, J. Singer, D. I. K. Sjøberg (Eds.), Guide to Advanced Empirical Software Engineering, 1st Edition, Springer London, London, 2008, Ch. 13, pp. 337–364. V. Basili, F. Shull, F. Lanubile, Building knowledge through families of experiments, IEEE Transactions on Software Engineering 25 (4) (1999) 456–473. L. Bass, P. Clements, R. Kazman, Software Architecture in Practice, 3rd Edition, Addison-Wesley, Boston, Massachusetts, 2013.

17

  Reginald do Ré receiveed from the State Univerrsity of São P Paulo the MS Sc degree in 22002 and the PhD  degree iin 2009, both h degrees in computer sccience and computationa al mathemattics. His research  mining, defecct prediction,, co‐changes prediction aand cloud  interestss in last yearrs are data m computiing. Presently, he is softw ware engineeering professsor at the Un niversidade TTecnológica Federal  do Paran ná ‐ Campo M Mourão, Brazil.   

  Meloca is a co omputer scie nce undergrraduate student at the Unniversidade  Rômulo Manciola M Tecnológica Federal do Paraná – – Campo Mo urão. His ressearch intere est is the mannagement off  orms.  multiplee cloud platfo  

  Douglas Nassif Romaa Junior is a master studeent at the Un niversidade T Tecnológica  Federal do P Paraná ‐  Cornélio o Procópio. H He has 10 yea ars of experiience workin ng in industryy. Currently,  he is a syste em  analyst aat AppMoovve, Brazil. Hiss research intterest is the managemen nt of multiplee cloud platfforms.           

  matics from the  Marcelo o Alexandre d da Cruz Isma ael received tthe professio onal MSc deg gree in inform Universiidade Tecnollógica Federa al do Paranáá ‐ Cornélio P Procópio in 2016. He has  5 years of  experien nce working in industry. Currently, hee is a lecture er at the Instituto Federaal de Educaçã ão,  Ciência ee Tecnologiaa de São ‐ Pre esidente Epittácio, Brazil. His research h interest is tthe adoption n of  cloud co omputing. 

  Gabriel C Costa Silva received the MSc degree  in computerr science from the State  University of  Maringáá in 2010 and d PhD degree e in computeer science fro om the Unive ersity of Yorkk in 2016. He e holds  several p professional certification ns and has 111 years of exxperience wo orking in induustry. In acad demia,  he taugh ht topics relaated to softw ware engineeering for morre than 4 yea ars. Currentlyy, he is a sofftware  engineering lecturerr at the Unive ersidade Teccnológica Fed deral do Para aná ‐ Dois Viizinhos, Brazil. His  major reesearch interrest is the ad doption of m odern techn nologies to evvolve legacy  systems.                   

    1 ‐ The performance of multi‐cloud differs significantly from platform‐specific APIs.  2 ‐ jclouds performed significantly worse than platform‐specific APIs in all tests.  3 ‐ Libcloud outperformed platform‐specific APIs in most tests.  4 ‐ Multi‐cloud users should evaluate what quality attribute is more important.