An empirical study on implementing highly reliable stream computing systems with private cloud

An empirical study on implementing highly reliable stream computing systems with private cloud

ARTICLE IN PRESS JID: ADHOC [m3Gdc;July 27, 2015;17:36] Ad Hoc Networks xxx (2015) xxx–xxx Contents lists available at ScienceDirect Ad Hoc Netwo...

2MB Sizes 65 Downloads 165 Views

ARTICLE IN PRESS

JID: ADHOC

[m3Gdc;July 27, 2015;17:36]

Ad Hoc Networks xxx (2015) xxx–xxx

Contents lists available at ScienceDirect

Ad Hoc Networks journal homepage: www.elsevier.com/locate/adhoc

An empirical study on implementing highly reliable stream computing systems with private cloud Yaxiao Liu∗, Weidong Liu, Jiaxing Song, Huan He

Q1

Department of Computer Science and Technology, Tsinghua National Laboratory for Information Science and Technology (TNList), Tsinghua University, Beijing, China

a r t i c l e

i n f o

Article history: Received 8 December 2014 Revised 5 June 2015 Accepted 13 July 2015 Available online xxx Keywords: Stream computing Cloud Reliability Banking system

a b s t r a c t Stream computing systems are designed for high frequency data. Such systems can deal with billions of transactions per day in real cases. Cloud technology can support distributed stream computing systems by its elastic and fault tolerant capabilities. In a real deployment environment, such as the pre-treatment system in Chinese top banks, the reliability based on user experience is key metrics for performance. Although many significant works have been proposed in the literature, they have some limitations such as less of architectural focus or difficult to implement in complex projects. This paper describes the reliability issue which is caused by the service downgrade in cloud. We use novel reliability analysis techniques, queuing theory, and software rejuvenation management techniques to build a framework for supporting stream data with low latency and fault tolerance. A real streaming system from a top bank is studied to provide the supporting data. Operational parameters such as rejuvenation window and time-out parameter are identified as key parameters for the design of a distributed stream processing system. An algorithm for reliability optimization, monitoring and forecast is also introduced. The paper also compares the improved result with original issues, which saved millions of money and reputations. © 2015 Published by Elsevier B.V.

1

1. Introduction

2

Stream systems are designed to support continuous online data processing [1]. Many stream computing systems are designed to deal with the velocity characteristic of big data [2], which often requires stream systems to label, extract or generate events from original data. Since the data stream is continuous, the reliability of stream processing system will impact the quality of data outputs significantly. The cloud computing technology can provide elastic support to stream processing systems by helping them to handle data volume fluctuations. The fast rejuvenation of cloud services can also provide fault tolerant support.

3 4 5 6 7 8 9 10 11 12



Corresponding author. Tel.: +8613910752612. E-mail addresses: [email protected] (Y. Liu), [email protected] (W. Liu), [email protected] (J. Song), [email protected] (H. He).

In some real cases, we find that operational parameters also play an important role in stream management. For example, if a distributed system used software rejuvenation for fault tolerant, the system might cause queue overflow before the completion of rejuvenation in cloud. We also observe some cases of reliability downgrade caused by cloud system hardware failure. Even if the stream system is distributed in different virtual server zones, the unexpected queue overflow would also be encountered. The unexpected reliability in stream systems would cause big loss in business cases. We decide to solve the challenge by improving reliability of cloud based stream systems from an architectural view. In most cases, cloud service providers offer service-level agreements (SLAs) based on the ‘availability’ of cloud services [3]. ‘Availability’ means the ‘up’ time of single cloud services regardless user’s experience. For example, a cloud based load balancing services may be blocked by tremendous data flow while all cloud virtual servers are ‘up’. From cloud users’ view, the system is ‘unreliable’, while from the cloud

http://dx.doi.org/10.1016/j.adhoc.2015.07.009 1570-8705/© 2015 Published by Elsevier B.V.

Please cite this article as: Y. Liu et al., An empirical study on implementing highly reliable stream computing systems with private cloud, Ad Hoc Networks (2015), http://dx.doi.org/10.1016/j.adhoc.2015.07.009

13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

JID: ADHOC

ARTICLE IN PRESS

2

32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88

[m3Gdc;July 27, 2015;17:36]

Y. Liu et al. / Ad Hoc Networks xxx (2015) xxx–xxx

provider’s point of view, the services are ‘available’. Reliability plays an important role here to link business case with cloud performance. We propose several architecture patterns in banking private cloud [4]. The all patterns have specific needs in private cloud. Stream systems could be considered as a special application architecture pattern in cloud environment [1]. Our research is part of a general study of managing different banking application architecture patterns in cloud. Cloud availability/reliability [5–8] have been discussed in many papers. The empirical study comes from the practice for designing stream processing systems on private cloud for a major bank in China. The stream system helps the bank to deal with 3rd largest data processing requirements in China with more than 200 million daily transactions. In our previous work [4], we provide a methodology to manage the cloud deployment pattern from business, cloud and physical layer. In this paper, we focus on the stream aspects with the same layers. We analyze the mathematical considerations to manage reliability. We also describe how it is important to implement and manage the pre-treatment processing stream in Chinese top banks. After careful study, we provided 3 layers’ improvement solutions. The solutions are proved to be successful by the deployment in production environment. The deployment shows an improvement by 80% under the same issues. We provide both novel methodology and empirical data to support our contributions. Our contributions include:

In a common on premise IT environment, IT availability is defined as [5,9]:

Availability =

Total Service Time − Downtime . Total Service Time

90

(2.1)

Service reliability is defined as the possibility of a service to perform its intended function under stated condition [9]. Reliability cares about the system to provide acceptable services, which means correct or accurate service delivered within an acceptable time. If a system is up but takes long time to perform, it is considered unreliable. Another formula may express the idea more clearly [5]:

Number of Successful Responses Reliability = Total Number of Responses

89

91 92 93 94 95 96 97

(2.2)

1. Define reliability as an architectural metric for cloud based business services. The definition could be measured by the variance of reliability instead of calculation of many uncertain components. 2. Design a methodology to use above metrics to evaluate cloud based services components. For complexity reason, the study focuses on stream computing in cloud. 3. Resolve business challenges in high volume transactions in Chinese top banks. The solution prevents ATM processing system to lose of thousands of transactions, which help the bank to maintain both revenue and reputations.

For a specified checkpoint, a service could obtain track record of possibility of successful return from (2.2). A cloud system contains multiple architectural layers, such as bare metal servers, hypervisors and cloud management software. The work by [8] identified that, in a large cloud enabled datacenter (i.e. Microsoft cloud), the most frequent failure events come from storage systems, especially the storage system with multiple disks and RAID controllers. From an application user’s view, the business transactions’ reliability may or may not be affected by the failure. It depended on whether there was a high availability solution to avoid the impact or the application could store its data in different storage systems. Authors in [5] gave several guidelines for cloud deployment, such as deploying application nodes in different virtual machines to share load and deploying virtual machines in different hypervisor zones to avoid a single point failure. They also discussed the latencies for different requirements. Other research work [3,6,8] gave more details in practice on using elastic capabilities to support stream fault tolerance. One thing to be noted is that the ‘rapid elastic’ capabilities can support both low latency and fault tolerance. In our study, we find that we can use the variance of reliability for a specific architectural pattern to manage reliability instead of accurate variables which are more difficult to obtain.

This paper is organized as follows:

2.2. Related works on distributed stream processing

124

The velocity attribute of big data often requires a stream processing system to provide high performance capabilities. Since data source will send data packages continuously, stream systems have to process data in limited time slot. Streams can use parallel sub-streams to scale up the required time slots. Each sub stream will handle data labeling, business event identification or word counting tasks simultaneously. Low latency design will ensure the processing not to block future data flow. Fault tolerant design will help stream system not to lose processing capabilities. There are several systems designed to focus on large-scale and low-latency stream computation, such as Apache S4 [10], Twitter Storm [11], etc. In the S4 system, the system keeps computation inside memory and partitions the data. It will only restart a computation when there is a failure. Storm handles fault by 2-phase commit. A failure data would be re-input so that at-least-once tolerance is achieved.

125

• In Section 2, we review the related work in this area and introduce our contributions on architectural methodology and the variance of reliability records. • In Section 3, we describe the challenges from a banking system and provide analysis based on the framework in Section 2. • In Section 4, we provide the optimization solutions and results comparison. • In Section 5, a prospection of our work is provided together with conclusions. 2. Related works and our contributions 2.1. Related works on cloud reliability Cloud reliability could affect the stream system’s fault tolerance significantly. Some researches [5,6] showed concerns about reliability and availability in cloud computing virtual machines.

Please cite this article as: Y. Liu et al., An empirical study on implementing highly reliable stream computing systems with private cloud, Ad Hoc Networks (2015), http://dx.doi.org/10.1016/j.adhoc.2015.07.009

98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123

126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142

Q2

JID: ADHOC

ARTICLE IN PRESS Y. Liu et al. / Ad Hoc Networks xxx (2015) xxx–xxx

[m3Gdc;July 27, 2015;17:36] 3

Fig. 1. Logical reliationship diagram for cloud services.

143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174

Besides fault tolerance, some systems are well suited to be put on cloud system for elastic options [12–14]. In [14], the authors from Microsoft focused on lowlatency continuous processing. They compared Storm, Apache S4, D-Streams and many distributed systems. They used a timestamp and an event tracking system based on .NET technology. The processing system was based on DAG. When a vertex fails, a new one is initiated to replace it to continue the execution. Timestamp could manage the outof-order packages. Based on this algorithm, sub-DAGs could be inserted in the chains for elastic options. However, the authors ignored the operational parts, such as the rejuvenation cost in cloud. If a rejuvenation is not completed before a new vertex is born, the system will be halt. In [13], the authors implemented Storm on cloud in a straightforward way. They tried to allocate proper dynamic resources for stream processing and fault tolerance. They noticed that the queuing in Production Environments (PEs) would cause more latency and resource occupation. An algorithm was introduced to achieve higher performance by lowering the availability. The end-user would specify a contract for the minimum, maximum and average service levels. The inputs will bring the contract identification into the processing system with k distributed resources. When k resources in cloud experienced queuing inside PEs, some data would be discarded. This may cause transaction failures in case of large data loss. The related works show that: • Deploy streams on cloud can improve overall availability. • The deployment will reduce management works. • Cloud environment may cause more reliability challenges from operational view.

2.3. Our contributions The aim of our work is to build a fault tolerant, low latency stream processing architecture over cloud. From our previous work [4], we have identified some patterns to deploy solutions in a cloud. This paper focuses on the stream pattern. From an architectural view, the reliability dependency model could be expressed as a logical AND/OR map. A service could be divided into business services (applications), cloud services (cloud software such as Open Stack and KVM), and physical services (physical machines). The logical relationship for architectural connections could be expressed as a logical map with “AND”/“OR” gate. A reliability event would occur when at least one component under “AND” gate generates an event. All components under “OR” gate would generate a higher level reliability issue when all of them generate events. From Fig. 1, we can make the following observations: 1. More ‘AND’ gates in cloud will cause more reliability events. 2. Reliability depends on the architectural design for proper ‘OR’ gates. In other words, cloud reliability represents the user experience. The “AND”/“OR” diagram could be transferred into Reliability Building Block Diagram (RBD) [15] directly for further study. Most public cloud vendors published service availability commitment under certain state [3]. Such work described the availability part of cloud services. Cloud users could estimate empirical reliability ranges by the service level agreement. Some papers pointed out that the reliability

Please cite this article as: Y. Liu et al., An empirical study on implementing highly reliable stream computing systems with private cloud, Ad Hoc Networks (2015), http://dx.doi.org/10.1016/j.adhoc.2015.07.009

175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205

ARTICLE IN PRESS

JID: ADHOC 4

206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225

Y. Liu et al. / Ad Hoc Networks xxx (2015) xxx–xxx

calculation would be based on the avoidance of nonreliability events [5,6,8]. However, most calculation methodology focuses on hardware reliability. We also notice that the prediction of services reliability variations is more important than accurate reliability metrics. This means the study more focus on the time sensitivity of reliability. We will use R (t, x, y) to present the “real-time” reliability of a given system x, which is the possibility of x providing proper service to a given transaction y. To simplify our model, we make an assumption that the cloud application will provide a single kind of transactions to end-user. Then, R (t, x, y) can be simplified as R(t, x). Stream pattern is a single kind transaction pattern We provide Rh (t1 , t2 , x) as the ‘history’ reliability record, which is coming from the statistics of end-user’s transaction (2.2). t1 represents the start time, t2 = t1 + t’ represents the end time In a stream system, there are finite discrete transactions between (t1 , t2 ). We use y as the number of transactions, xi as an individual transaction between(t1 , t2 ).

Rh(t1 , t2 , x) y R x i=0 ( i ) , = y

R(xi ) = 1 R(xi ) = 0

if transaction xi is success if transaction xi is fail (2.3)

226 227 228 229 230

For a business service, the reliability should be kept in a manageable range which means higher reliability is better, but stability is more important. We use RH to present the history reliability from the beginning of system.

RH (t, x) = Rh(0, t, x) 231 232

234 235 236 237 238

(i=0)

240 241 242 243 244 245 246 247 248 249 250

(2.5)

R is average reliability metrics for a specified cloud service. Our goal is to make in every timeslot, R(t, x) stable as R. We define the time slot m as an average normal transaction completion period. Rtm(t, m, x) as a real time metrics for reliability. And z as the number of transactions between (t, t + m). So the variance of Rtm(t, m, x) is measured as:

VAR(Rtm(t, m, x)) = z (R(t, ti + m, x) − R)2 )/(z − 1), 239

(2.4)

If we consider a services lifecycle from its last upgrade to its next upgrade (time T), then

R = RH (T, x) = Rh(0, T, X ) 233

[m3Gdc;July 27, 2015;17:36]

(2.6)

where ti means number i transactions within (t, t + m). VAR(Rtm(t, m, x)) is expected to be stable. In our production system, m is set to 30 s. The expectation of the shock of reliability curve is less than 20%. We notice that time slot length m may be longer than average transaction cost for time-out is always set longer than normal completion range in production system. And the number of transactions z is also very important. It is linked to the throughput and utilization of the system. Based on a phase view of services in [5], which divides services into normal, downgrade, and broken phases, we can see that:

• R(t, t + m, x) keeps stable in normal phases. • In downgrade phase, R(t, t + m, x) varies due to different reliability events and system throughput. • R(t, t + m, x) equals 0 when t enters broken phase.

251 252 253 254

Stream system is a high volume system with high throughput and low latency. We focus on the following viewpoints for the management:

255

1. History reliability since last upgrade as reference. 2. Variance of reliability with specified reliability events. 3. Reliability metrics under different throughputs.

258

Thus, a practitioner could establish an ‘AND/OR’ gates framework on cloud to manage the reliability. The management can be transferred into RBDs for dependency analysis. The expected R can be observed by the dividend of each ‘AND/OR’ gates architecture. We observed the variance of VAR(Rtm(t, m, x)) for each components and overall services to establish an optimized solution for reliability management. In this paper, we focus on stream service management under this framework using queue theory [16] and other techniques. Our work not only gives an overview of the literature, but also presents the results of real systems from top Chinese banks, which are among the largest transaction-based stream processing systems in the world.

256 257

259 260 261 262 263 264 265 266 267 268 269 270 271 272 273

3. Research background and challenges

274

3.1. Research background

275

Five top banks in China are considering implementing streams in their ATM/Kiosk management systems. The systems will make preliminary processing on the ATM generated data, which include transactions, self-check indication, encryption messages, etc. The systems help the data to be processed properly and efficiently in multiple back-end banking systems. The actions include: 1. 2. 3. 4. 5. 6.

Message labeling. Format transformation. Transaction management. Message routing. End point management. Pre-analytic of core banking business performance.

We call these systems ATP or ATMP (ATM/Kiosk Processing Systems) in Chinese top bank. We have studied several systems which are listed in Table 1. In this paper we mainly refer to the experience and analysis in BANK_C. End Points service receives connections from ATM/Kiosk devices. The ATMP Streaming processing System provides low latency treatment for incoming messages. The treated messages are sent to back end systems, such as core banking, credit card management, inter-bank exchange system, etc. Back end systems return results to ATMP Streaming processing system. The back-end systems (such as core banking) complete the transaction and send back acknowledgements through ATMP to End Points. The End Points then close the connection and let ATM/Kiosk to tell cash or return results to customer. Since more and more mobile based technologies are embedded in the process, such as withdraw cash by

Please cite this article as: Y. Liu et al., An empirical study on implementing highly reliable stream computing systems with private cloud, Ad Hoc Networks (2015), http://dx.doi.org/10.1016/j.adhoc.2015.07.009

276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305

ARTICLE IN PRESS

JID: ADHOC

[m3Gdc;July 27, 2015;17:36]

Y. Liu et al. / Ad Hoc Networks xxx (2015) xxx–xxx

5

Table 1 Comparison of ATMP stream deployment model and transactions in Chinese top banks. Bank name

Daily transactions

Processing

Log processing

BANK_A BANK_B BANK_C BANK_D BANK_E

280M+ 590M+ 205M+ N/A 8M+

Distributed application Transaction middleware (Tuxedo) Message queue (IBM MQ) Transaction middleware (CICS) J2EE (Weblogic)

Oracle Oracle Oracle Sybase Oracle

Fig. 2. Logical architecture of ATMP.

306 307 308 309 310 311 312 313 314 315 316 317

authentication code on a mobile phone, the processing data amount can reach a very high volume. As an important part in banking modernization, top banks are consolidating systems on private cloud to achieve high flexibility and high availability for their infrastructure services. The designed private cloud could provide fastprovisioning capabilities for components. The infrastructure also supports fast growth and complex integration of components. After the deployment of ATMP streams on cloud, BANK_C found the services were not reliable. A single hardware defect would cause the entire service entered ‘broken’ phase.

We used our frameworks to set up models and analysis the root cause of this challenge. After solving the issue successfully, we applied the model to other top banks.

320

3.2. System architecture

321

In BANK_C, the ATMP system was implemented on a private cloud based on IBM PowerVM technology. The system was deployed on a large-scale production which can provide 2000 + AIX based virtual servers. Fig. 2 shows the logical architecture for the stream.

322

Please cite this article as: Y. Liu et al., An empirical study on implementing highly reliable stream computing systems with private cloud, Ad Hoc Networks (2015), http://dx.doi.org/10.1016/j.adhoc.2015.07.009

318 319

323 324 325 326

JID: ADHOC 6

ARTICLE IN PRESS

[m3Gdc;July 27, 2015;17:36]

Y. Liu et al. / Ad Hoc Networks xxx (2015) xxx–xxx

Fig. 3. Deployment topology of ATMP streams. Fig. 5. Basic RBD for ATMP input stream.

35000 30000

SUM Req

25000

SUM Timeout

20000

BANCS Req

15000

ICCD Req

5000 1005000001 1005011800 1005023600 1005035401 1005051201 1005063000 1005074800 1005090600 1005102400 1005114200 1005130000 1005141800 1005153600 1005165400 1005181200 1005193001 1005204800 1005220600 1005232400

ICCD Timeout

IST Req IST Timeout

Fig. 4. Reliability downgrade in BANK_C ATMP.

327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365

366

1. In current architecture model, how can the operator predict and manage the risk of business reliability service? 2. Is there any better model for deploying a high volume transaction stream in a private cloud?

368

367

BANCS Timeout

10000 0

highly available solutions, the services still failed. The challenges for this top bank are:

There were 2 streams in the architecture for bi-direction data processes. In the forward stream, data messages were labeled by their type of transactions. And some raw data would be integrated as events. In the backward stream, results and indications for ATM/Kiosk would be transferred into machine or UI messages for telling cash or return transactions. The message processing was tracked by a central database, which would connect the input with output or failed data to maintain the integrity. Two major backend systems would end the transactions with other banking services. An encryption service would support the incoming and outgoing edge of the private cloud. To ensure the high availability and low latency for huge data flow and transaction integrity, there were 4 streams deployed on the private cloud. And the cloud could also use elastic capability to allocate more virtual servers with deployed applications in case of failure or unexpected inputs. Nevertheless, BANK_C still experienced some reliability downgrades under this high reliability, elastic architecture. One of the typical downgrades was: when the frequency of transactions hit around 350 per second, one virtual server went down. The transactions were blocked in 15 s. The system was designed to allocate additional resources when some cloud services went downgraded. However, the downgrade period led to a downgrade phase that was shorter than the service provisioning lead time. The reliability curves fell very fast. The services were migrated to new virtual servers. All servers/services were considered as available after the server recovery window. But the business service returned normal for a much longer window. Fig. 4 shows transaction management curves for service downgrade. The x-axis is the timestamp in second. The y-axis represents the number of transactions (in 1000). For simplification, we demonstrated 4 out of 8 values in this graph. The curves represent requests and time outs for four types of streams. This specific case was called “business reliability” fault. The challenge was that although the private cloud used many

3.3. Reliability analysis by RBD

369 370 371 372 373

Following our previous methodology, we built Reliability Building Block Diagrams (RBD) [15] according to logical ‘AND/OR’ gate architecture. Fig. 5 illustrates an application-level RBD for ATMP. The building blocks in the figure are quite simple. LB represents the load balancer component. Code and Req represents different steps in a stream processing transactions. DB represents the message management database. BS1/2 is backend system. There are 4 application stream nodes inside ATMP. Each application stream contains 13 queues. The total number of streams is 42. Furthermore, we divided each node into 3 levels:

374 375 376 377 378 379 380 381 382 383 384 385 386

1. business level, which is the ‘black-box’ view, 2. cloud level, which represents the virtual machine in private cloud, and 3. physical level, which represents a physical node in private cloud.

387

The diagrams changed into detail model to represent the deployment architecture. One branch from Fig. 5 could be expressed as in Fig. 6. We could calculate the best R(t, t + m, x) for the system based on the following expression:

392

R = RGW × RGW _OS × RGW _CONN × RREPOS ,

388 389 390 391

393 394 395 396

(3.1)

where

397

RREPOS = 1 − (1 − RRE1 × RRE1OS × RRE1CONN ) ×(1 − RRE2 × RRE2OS × RRE2CONN ),

(3.2)

The formula links software services and hardware services together. If ATMP stays in a stable period, R will be a constant value. If we observe an unlimited time slot, the average R(0, T, x) should be R. As R could be calculated from known results [8,17], a cloud-based static streaming system should have a constant prediction in the business results. And reliability is more accurate for business measurement. Thus the services downgrade could be expressed more accurately as reliability downgrade. We do not need to estimate

Please cite this article as: Y. Liu et al., An empirical study on implementing highly reliable stream computing systems with private cloud, Ad Hoc Networks (2015), http://dx.doi.org/10.1016/j.adhoc.2015.07.009

398 399 400 401 402 403 404 405 406 407

JID: ADHOC

ARTICLE IN PRESS

[m3Gdc;July 27, 2015;17:36]

Y. Liu et al. / Ad Hoc Networks xxx (2015) xxx–xxx

7

Fig. 6. Detail RBD for multi layer framework.

408 409

the constant R. We only need to focus on how to manage the VAR(R) in reliability downgrade status.

410

3.4. Analysis of reliability state

411

In an architectural view, reliability is the non-functional requirement factor. In previous work, we notice R is linked to transaction cost and throughput. In real case, the reliability events include:

412 413 414 415 416 417 418 419

1. 2. 3. 4. 5.

Connections are lost or could not be established. Connections are refused. Timeouts. Expected errors occur with predefined error codes. Unexpected errors occur with incorrect response.

442

Since this paper focuses on the infrastructure part of cloud reliability, we would not consider Event 5. We found that the crisis always happens in a high volume scenario. A high volume status means that the transactions at t will almost hit a new record. For example, when we recorded 2 significant reliability disaster in BANK_C ATMP, both have a burst in transaction (one is 30% higher than average, the other is 40%), and we also recorded some software or hardware errors. The 30% event related to a hardware disaster in private cloud, the other one related to a software bug, which will cause a longer waiting time for some specific transactions. There was a hardware event at timestamp 1005204500. Before that, some time-out transactions already existed, which means that the system runs in a high volume level. After the disaster in private cloud, the virtual server was migrated to another physical server. However, more time out occurred even after the server migration. This problem caused a 54-min delay in ATMP streaming system, which is much longer than the 3-min migration expectation. In another case, the virtual server was automatically rebooted in 15 min. The business recovers to normal reliability model after a longer time.

443

3.5. Reliability description

444

The total input transactions to N end-points at time t are expressed as X(t). Here we assume that t is a discrete variable, which means t and t + 1 are 2 adjacent inputs of X(t). This is because each transaction will last for some time to wait for return. The discrete value could make the following calculation simpler and more accurate than a continuous variable. We also assume that each end-point could establish up to M transactions at the same time. If there are more than M

420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441

445 446 447 448 449 450 451

inputs to one end-point, the user will receive a refused connection error since the resources are exhausted. At t, the total input for S (which represents an ATMP streaming system) will be S(t).

452 453 454 455

S(t ) = X (t ), if X (t ) < M × N S(t ) = M × N, if X (t ) ≥ M × N,

(3.3)

In real case, all transactions will be closed at time t + Tm, where Tm means the threshold variable for timeout. For a given transaction x, we could define the execution result from the end-point’s view as:

456 457 458 459



Ex(x, t ) = 1, x gets expected result Ex(x, t ) = 0, x gets unexpected result

(3.4)

An expected result means that the system’s return is in the expected functional result sets. In another word, the software knows the error meanings by pre-definition. An unexpected result means that the return value comes from system (infrastructure) error code sets or an exception handling process in the input software. At time t, there will be S(t) transactions arriving at endpoints. The input software will get the results from t to t + Tm. We define a set X (t ) = {x1 , x2 , . . . , xS(t ) }. Each variable in this set represents an individual transaction that arrives at end-points at time t. Since there will be some queue or other algorithm inside ATMP, the returns for transaction xi will be distributed from t to t + Tm. Here, we have

460 461 462 463 464 465 466 467 468 469 470 471 472 473

⎧ ⎪ ⎨R(t, T m) = 0, all transactions are lost or time out; 0 < R(t, T m) < 1, some transactions are lost or ⎪ ⎩ time out; R(t, T m) = 1, no transactions are lost or time out. (3.5) In practice, we found the reliability management goal is not to achieve R(t, Tm) = 1 without any loss of data. The key is to keep R(t, Tm) as constant as possible. The result implies that the AND and OR gates should have a rejuvenation period before the huge loss of R(t, Tm). Actually, in BANK_C, R(t, Tm) was restricted to be more than half of normal transaction volumes. As described before, we analyzed the challenge from business, cloud service (architectural) and physical levels.

Please cite this article as: Y. Liu et al., An empirical study on implementing highly reliable stream computing systems with private cloud, Ad Hoc Networks (2015), http://dx.doi.org/10.1016/j.adhoc.2015.07.009

474 475 476 477 478 479 480 481 482

ARTICLE IN PRESS

JID: ADHOC 8

[m3Gdc;July 27, 2015;17:36]

Y. Liu et al. / Ad Hoc Networks xxx (2015) xxx–xxx

Fig. 7. Services downgrade records.

483

3.6. Business services level analysis

484

One challenge comes from the amount and rate of the input transactions. If we consider ATMP as a “black-box”, the transactions viewpoint is business. If S(t) exceeds M × N, the upper stream would have a return of “refused connection”. In ATMP, M and N were set to 150 and 4. The time-out variable Tm was often longer than average success transaction time slot. A transaction would return “time-out” when Tm seconds elapsed at end-points. We find that Tm was always set to a very large number. For example, Tm was set to 60 s in ATMP but a normal transaction will be completed in 0.6 s. This was a business decision that a customer would wait at most 60 s to allow old ATM process through old networks. Although the networks are built on dedicated optical lines nowadays, the timeout is still set at 60 s. If a transaction occupied the end-points, there would be more transactions queued in the upper stream. For example, 30+ transactions would be queued in ATM controllers. So the total number of queued transactions S(t, Tm) would be

485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503

S(t, T m) =

t+T m

S(i)

(3.6)

i=t

504

The required time-out queue capabilities would be:

Tm = 505 506 507 508 509 510 511 Q3 512 513 514 515

M×N . max(X (t ))

(3.7)

If X(t) was set to 333/s, M was set to 150/s, and Tm was set to 60 s, N would be at least 134 servers for queuing all transactions in the worst case scenario. In real case, N was set to 16, which could support Tm = 7.21 s. If T is set to 60 s, and X(t) 333/s, then if one transaction connection is timeout, there would be 60/7.21, or 8–9 transactions being blocked. It means that X(t) would be 324/s, which is a downgrade service Fig. 7. In business level, if we would like to keep R(t, Tm), a proper timeout constant should be set. Or, we should allocate enough edge servers in stream boundary, which would

Fig. 8. Server utilization estimation for ATMP queue.

keep R(t, Tm) in a stable condition to avoid business services stuck.

516

3.7. Cloud services (architecture) level analysis

518

From architectural view, the stream system is a queue which receives inputs from edge servers and outputs result to backend systems. We could simplify ATMP as an M/M/4 queuing system according to queue theory. We set a queuing model to illustrate the transactions in a smaller system:

517

519 520 521 522 523 524

1. assume that the input X(t) following a Poisson distribution; 2. set X(t) to 30/s; 3. set processing time to 0.03 s.

525

Table 2 and Fig. 8 present reliability analysis using queuing model. With the decrease of queue servers, a higher congestion possibility results in a lower utilization. The results in Fig. 8 and Table 2 show that if one channel went down, the total utilization as well as the possibility for congestion would go up. Less queue channels would cause the lower reliability estimation when there would be some failed components.

529

Please cite this article as: Y. Liu et al., An empirical study on implementing highly reliable stream computing systems with private cloud, Ad Hoc Networks (2015), http://dx.doi.org/10.1016/j.adhoc.2015.07.009

526 527 528

530 531 532 533 534 535 536

JID: ADHOC

ARTICLE IN PRESS

[m3Gdc;July 27, 2015;17:36]

Y. Liu et al. / Ad Hoc Networks xxx (2015) xxx–xxx

9

Table 2 Queue theory based estimation for ATMP.

1. Input parameters 1.1 Arrival rate (λ) 1.2 Mean service time (1/μ) 1.3 Number of servers in the system (c) 2. Results 2.1 Mean interarrival time (1/λ) 2.2 Service rate (μ) 2.3 Average # arrivals in mean service time (r) 2.4 Server utilization (r) (%) 2.5 Fraction of time all servers are idle (p0 ) 2.6 Mean number of customers in the queue (Lq ) 2.7 Mean wait time (W) 2.8 Mean wait time in the queue (Wq )

M/M/1

M/M/2

M/M/3

M/M/4

30.0 0.03 1

30.0 0.03 2

30.0 0.03 3

30.0 0.03 4

0.03 33.33 0.90 90.0 0.10 8.10 0.30 0.27

0.03 33.33 0.90 45.0 0.38 0.23 0.04 0.01

0.03 33.33 0.90 30.0 0.40 0.03 0.03 0.00

0.03 33.33 0.90 22.5 0.41 0.00 0.03 0.00

Fig. 9. Loss of queue result diagram.

537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555

If one virtual server failed, another image would take over in the private cloud with the capabilities of server rejuvenation. We could use Ts to denote the rejuvenation time for a new server. If at time t, the new virtual server was spawned, the new one would provide services at time t + Ts . For simplification, we could assume that ATMP provides 4 channels. The detection and service online time would be 0. During t + Ts , the M/M/4 would go to M/M/3 or even worse. According to Table 2 and Fig. 8, the loss of streaming processing channels would result in higher utilization for live servers. The possibility for congestion in live servers would be increased for waiting time slot. If M/M/4 turned to M/M/3, 1/4 of existing transactions would be lost. The input would wait for T to return “time out”. If Ts > T m, all transactions would be ended with the result of either fail or success. If Ts < T m, there would be 1/4 of inputs that would be suspended until time out, which is discussed in previous section.

We noticed that S(t ) = M × N, if X (t ) ≥ M × N. The failure of channels would make all inputs have more processing time inside ATMP according to Little’s Law [16]. The longer processing time would obviously cause the transactions to wait longer in queue inside ATMP. In the worst case, each transaction would wait T to end itself. Then the reliability R between time t and t + Tm would be:





(M × N) × N−y N R= , t+T m X t ( ) i i=t

556 557 558 559 560 561 562 563

(3.8)

where y represents the number of failed virtual servers. The cloud-based streaming system would fail to provide services to 2T at most. The ATMP system demonstrated the bottleneck in utilization as in Fig. 9. When server ATMPGW1 fails (due to failure in I/O drawer), ATMPGW3 recorded a burst in utilization. The high utilization caused a lot of timeouts in business services level for the loss of queues.

Please cite this article as: Y. Liu et al., An empirical study on implementing highly reliable stream computing systems with private cloud, Ad Hoc Networks (2015), http://dx.doi.org/10.1016/j.adhoc.2015.07.009

564 565 566 567 568 569 570 571

JID: ADHOC 10

ARTICLE IN PRESS

[m3Gdc;July 27, 2015;17:36]

Y. Liu et al. / Ad Hoc Networks xxx (2015) xxx–xxx

Fig. 10. Comparasion of ATMP average response time.

572

3.8. Physical level analysis

3.9. Summary

612

573

In previous section, we used an AND/OR diagram to link physical level services to business and cloud services. We have also reviewed some related works by other researchers on bare metal servers and storage systems’ reliability management. We find interesting results in ATMP’s private cloud. The DB in ATMP represents another type of physical level services. The DB is allocated in physical servers instead of virtual servers to ensure the throughput of transactions. The DB is used to label messages between input and output. The ATMP stream is a bi-directional transaction based. If endpoint A issues message A.1, ATMP will forward it to proper backend systems B with proper format. The system B will issue message B.1 as the transaction return to A.1. ATMP should return B.1 to endpoint A correctly. The DB in ATMP is used to issue a sequence number to label both message A.1 and endpoint A. The sequence number will be passed to system B as an overhead. This design will provide a loosely coupled service to endpoints. The DB connections were set to 600 which matched number of endpoints connections. We have observed the DB performance under different stressed traffic comparing with normal traffic, which was around 350 transactions per second (1 × TPS). The result of average response time is shown as follows. Fig. 10 illustrated that as TPS went to 1.7 × TPS (around 600), the latency inside ATMP was obviously observed. The latency was caused by the waiting in DB. A detail investigation showed that the waiting time came from the write through of storage, which is a FC-SAN based storage. Since there were too many messages came in through 600 DB connections, and the table for sequence was too large, the seek time for storage to locate previous message’s label take longer time. The seek time caused exhaust in the writing queue for DB. This phenomenon implies the reliability in stream should be linked to the inside of stream components. We will propose an optimization solution later.

In this section, we use the proposed framework to evaluate the reliability challenge in an ATMP system. A cloud environment could provide stream computing elastic capabilities. However, the cloud reliability and the burst in stream data would affect user experience. We used several techniques to analyze the root causes for the downgrade of reliability. In the next section, we will discuss different optimization algorithms for implementing stream computing on a private cloud.

613

574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611

614 615 616 617 618 619 620 621

4. Reliability optimization

622

4.1. Our works

623

Based on our experience in designing several next generation banking systems, we find reliability is more suitable to describe the usage of services than availability. However, we also noticed reliability was not well defined, especially when there were different deployment patterns. Stream is becoming more and more popular in core banking system. This technology is used to route and manage messages from the internet, smart phone and kiosk/ATMs. We have put a lot of efforts in BANK_C ATMP to establish our reliability model. We built an AND/OR pattern for ATMP where we found storage played AND role almost in every layer. We also built RBDs according to previous AND/OR diagram to establish a mathematical framework. During this experience, we found it was very challenging to capture accurate possibility in every component because we could not get the exact reliability number from hardware and software vendor. Fortunately, we find that we could use the inside queues to predict downgrade. In private cloud, we could track every component’s queue and performance which gave us a global view of related reliability events. The RBD showed that the elastic capabilities in cloud could avoid sudden changes of value by fast distribution and rejuvenation. However, in high volume stream system, the period of error detection and rejuvenation could

Please cite this article as: Y. Liu et al., An empirical study on implementing highly reliable stream computing systems with private cloud, Ad Hoc Networks (2015), http://dx.doi.org/10.1016/j.adhoc.2015.07.009

624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648

ARTICLE IN PRESS

JID: ADHOC

[m3Gdc;July 27, 2015;17:36]

Y. Liu et al. / Ad Hoc Networks xxx (2015) xxx–xxx

11

Table 3 Matrix of monitoring indexes in cloud stream system. Components

Monitoring status

Root cause

Stream software

Persistent storage full Exceeds steam capabilities Listener reaching max capabilities VM CPU UT > 90% VM storage I/O > 90% VM network I/O > 90% Hypervisor reporting ‘wait’ Networks UT > 70%

Queue management error or reaches max capabilities Application listeners stop responding

Virtual machines

Virtual networks

VM OS stops responding

Hypervisor stops responding Virtual networks stop responding

time-out values preset, and the y-axis represents the real connections. The curve demonstrates the records of connections during software rejuvenations. The result shows that Tm = 10 was the checkpoint. Another optimization approach was to disable the “hard” time-out value in the whole architecture. We could use a monitoring component inside the stream. When the connections in end-points elapsed more than 10 s, ATMP stream would terminate the transactions inside to release the resources. 4.3. Cloud level optimization Fig. 11. Management of time out value in ATMP.

659

result in time-out blocks. Another problem came from the sharing of physical resources, especially storages and middleware. In cloud services level, the component performed very well. In physical level, the contention inside storage or middleware writing queues may run out of resources, which would downgrade the reliability performance of the entire stream system. We propose several optimization solutions in following sections. We haven’t tested all solutions in real production environment by customer choices. Some comparisons are shown to prove effectiveness of our solutions.

660

4.2. Business level optimization

661

We have analyzed that the longer time-out constant value would cause the congestion inside cloud system. If time-out value Tm was set as (3.7). Only a small number of requests would receive the “refused connection” errors. For example, BANK_C ATMP set the time-out as 60 s according to the rule of thumb. The maximum inputs were recorded as 333 TPS (Transactions per seconds). N was set to 4 and M was set to 150. When the stream stopped due to cloud issues, the system would take a while for rejuvenation. There would be 1.8 s for the connections went into “wait” status. Since time-out value was 60 s, there would totally be 333 × (60 – 1.8) = 19 381 refused transactions. The reliability R(t) would be 0 in the worst case. When we set the time-out value to 2 s, there would be 67 refused transactions. If the revitalization took less than 2 s, the system would go back to normal. The new reliability could be 99.7%, which was significantly improved. Fig. 11 illustrates how the time-out value affects the number of connections. In this figure, the x-axis represents the different

649 650 651 652 653 654 655 656 657 658

The elastic capabilities helped the streams on cloud to deal with high-velocity transactions. However, the spawning of new cloud service instances would cause the downgrade of reliability level. A straightforward way was to set “physical channels” instead of “shuffle networks” in Fig. 3. We set up 2 channels after the load balancer. The transactions were distributed to the 2 channels as usual. When one channel’s reliability level dropped due to the reasons mentioned in the previous section, the other channel would not experience service downgrade. Fig. 12 demonstrates the refined architectural model for ATMP. 2 new physical channels are deployed. DBs are also separated. Snapshots of DBs are also deployed to maintain bi-directional stream management. 4.4. Physical level optimization

662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679

As shown in Table 3, we also identified some indices for system level monitoring. We noticed the queue could represent the future failure possibilities. Although there was no hardware error, the full of writing queue inside stream system would block the streams. An arbitration test case system was deployed to collect SNMP data from existing systems for network equipment, load balancers, and cloud management software. The test was deployed in a test environment. A simple mobile application has 4 servers in cloud to serve 400 concurrent user requests. Each request would be served for 5 s. If 800 users were to access the application service simultaneously, and the mobile part set the time-out value to 5 s, then there will be 400 successful business responses out of 800 requests. When one server detects that he has used out 100 TCP ports, he could ask for one more server from cloud. If the

Please cite this article as: Y. Liu et al., An empirical study on implementing highly reliable stream computing systems with private cloud, Ad Hoc Networks (2015), http://dx.doi.org/10.1016/j.adhoc.2015.07.009

680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724

JID: ADHOC 12

ARTICLE IN PRESS

[m3Gdc;July 27, 2015;17:36]

Y. Liu et al. / Ad Hoc Networks xxx (2015) xxx–xxx

Fig. 12. Refined topology of ATMP deployment model.

Fig. 13. Improved results in transactions.

725 726 727 728 729 730 731 732 733 734 735 736 737 738 739

cloud provisioning time and application start time are less than 5 s, the 800 requests could get 800 responses in the end. The reliability is improved. The process could be described as: Step 1: Detect reliability events. Step 2: Predict reliability value during or after the events. Step 3: Take reliability actions by: 3.1: service rejuvenation, 3.2: provide alternative service route, or 3.3: close service for upgrade. Step 4: Evaluate actions. Step 5: Record the reliability results including: 5.1: start/end time, 5.2: record the total requests, 5.3: record the total success translations,

5.4: record the reliability events, and 5.5: record the down rate of reliability of the events. 5. Result We have proposed all solutions to BANK_C. Like other major banks in China, the IT department in BANK_C is divided into software development center, infrastructure center and test center. Cloud level solution was considered to have a dedicated infrastructure focal and less project complexity. That solution was taken immediately. Fig. 13 shows the result of 2-channels deployment. We recorded same hardware error in private cloud at time 110112100. The number of transactions fell rapidly. However, not as Fig. 4, the stream system kept performing at around

Please cite this article as: Y. Liu et al., An empirical study on implementing highly reliable stream computing systems with private cloud, Ad Hoc Networks (2015), http://dx.doi.org/10.1016/j.adhoc.2015.07.009

740 741

742 743 744 745 746 747 748 749 750 751 752

JID: ADHOC

ARTICLE IN PRESS Y. Liu et al. / Ad Hoc Networks xxx (2015) xxx–xxx

763

60% of peak capacity. Few time out records were found in bank end system. Fig. 13 shows a single back-end system for data privacy. In our statics, other system performance had a better indication for the transaction number was less than the displayed one. The total capability remained around 80% of total peak performance. The result demonstrated our solution value in cloud level. For project complexity, other 2 solutions are still under discussion. The test case in test environment demonstrated improvement, especially in time-out management.

764

5.1. Conclusion and future work

765

781

Stream systems are designed to deal with high-velocity data. Many users, such as top Chinese banks, deploy the streaming systems over private cloud. Although cloud technology provides a robust way to support the system running by its elastic capabilities, there are still some user experience issues. We used reliability metric to replace traditional availability metric. Theory, algorithm and architecture were successfully used in improving reliability in Chinese top banks’ streaming systems. We have tested the algorithms in a small simulation environment to prove the capabilities. A solution proposed by us was deployed in real case system, which improved reliability variations significantly. In future work, we would continue to test the Internet of Things environment on cloud to optimize the whole framework. And the study of cloud reliability will be extended from private cloud to hybrid cloud with feedings from Internet.

782

Uncited references

753 754 755 756 757 Q4 758 759 760 761 762

766 767 768 769 770 771 772 773 774 775 776 777 778 779 780

783

[18–21].

784

Acknowledgments

785

792

This paper is coming from many real system optimization works. There were many workshops held to discuss the solution. Mr. Jiang Zhu, Ms. He Sang, Mr. Yuhang Xia and Mr. Zhan Zhang from BANK_C project team contributed to the system data collection and analysis. A special thanks is given to Dr. Yan Li from facebook.com, who reviewed the paper and improved the literature writings.

793

References

786 787 788 789 790 791

794 795 796 797 Q5 798 Q6 799 800 801 802 803 804 805 806 807 808 809 810

[1] H. Andrade, B. Gedik, D. Turaga, Fundamentals of Stream Processing: Application Design, Systems, and Analytics, Cambridge University Press, 2014. [2] B.R. Prasad, S. Agarwal, Handling big data stream analytics using SAMOA framework-a practical experience, Int. J. Database Theor. Appl. 4 (2014) . [3] P. Patel, A. Ranabahu, A. Sheth, Service Level Agreement in Cloud Computing, http://knoesis.wright.edu/library/download/OOPSLA_ cloud_wsla_v3.pdf, 2009. [4] Y. Liu, W. Liu, L. Liu, F. Wang, An infrastructure framework for deploying enterprise private cloud, in: Proceedings of IEEE International Conference on Services Computing (SCC), 2013, pp. 502–510. [5] E. Bauer, R. Adams, Reliability and Availability of Cloud Computing, John Wiley & Sons, 2012. [6] Y.-S. Dai, B. Yang, J. Dongarra, G. Zhang, Cloud service reliability: modeling and analysis, in: Proceedings of the 15th IEEE Pacific Rim International Symposium on Dependable Computing, 2009.

[m3Gdc;July 27, 2015;17:36] 13

[7] W. Kim, S.D. Kim, E. Lee, S. Lee, Adoption issues for cloud, in: Proceedings of the 7th International Conference on Advances in Mobile Computing and Multimedia, 2009, pp. 2–5. [8] K.V. Vishwanath, N. Nagappan, Characterizing cloud computing hardware reliability, in: Proceedings of the 1st ACM Symposium on Cloud Computing. ACM, 2010, pp. 193–204. [10] L. Neumeyer, B. Robbins, A. Nair, A. Kesari, S4: distributed stream computing platform, in: Proceedings of the IEEE International Conference on Data Mining (ICDM), 2010. [11] Storm, https://githuab.com/nathanmarz/storm/wiki, (online). [12] B.R. Prasad, S. Agarwal, Handling big data stream analytics using SAMOA framework-a practical experience, Int. J. Database Theor. Appl. 4 (2014) 7. [13] P. Bellavista, A. Corradi, S. Kotoulas, A. Reale, Adaptive fault-tolerance for dynamic resource provisioning in distributed stream processing systems, in: Proceedings of the International Conference on Extending Database Technology (EDBT), 2014, pp. 85–96. [14] Z. Qian, Y. He, Z. Wu, H. Zhu, T. Zhang, L. Zhou, Timestream: reliable stream computation in the cloud, in: Proceedings of the 8th ACM European Conference on Computer Systems, ACM, 2013. [15] S. Distefano, A. Puliafito, Dependability evaluation with dynamic reliability block diagrams and dynamic fault trees, Dependable Secur. Comput. IEEE Trans. 6 (1) (2009) 4–17. [16] J.D. Little, G.C. Stephen, Little’s Law, Springer, US, 2008, pp. 81–100. [17] J.D. Musa, A. Iannino, K. Okumoto, Software Reliability: Measurement, Prediction, Application, McGraw-Hill, Inc., 1987. [18] T. Thein, S.-D. Chi, J.S. Park, Availability modeling and analysis on virtualized clustering with rejuvenation, Int. J. Comput. Sci. Netw. Secur. 8 (9) (2008) 72–80. [19] J. Che, T. Zhang, W. Lin, H. Xi, A Markov chain-based availability model of virtual cluster nodes, in: Proceedings of the Seventh International Conference on Computational Intelligence and Security, 2011. [20] D. Bruneo, S. Distefano, F. Longo, A. Puliafito, M. Scarpa, Workloadbased software rejuvenation in cloud systems, IEEE Trans. Comput. 62 (6) (2013) 1072–1085. [21] K. Vaidyanathan, K.S. Trivedi, A comprehensive model for software rejuvenation, IEEE Trans. Depend. Secur. Comput. 2 (2) (2005) 124–137.

811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847

Yaxiao Liu is a Ph.D. candidate in the Department of Computer Science & Technology, Tsinghua University, China. Mr. Liu’s research major focus on distributed system architecture, cloud computing and big data processing. Yaxiao Liu had served in IBM Global Technology Services as chief architect for 15 years. During his experience in IBM, Mr. Liu had practices in major banks in China, telecommunications and government. Yaxiao got his master degree in computer science in 1999 and bachelor degree in computer science in 1997 from the Department of Computer Science & Technology, Tsinghua University.

848 849 850 851 852 853 854 855 856 857 858 859 860

Weidong Liu is a professor from the Department of Computer Science & Technology, Tsinghua University, China. Weidong’ research interests include the modeling and architecture of distributed information systems, and theories and applications of wireless sensor networks. In the area of distributed information systems, Weidong has modeled the distributed information system and proposed a performance analysis based on Petri Net. Weidong has also designed its architecture, and function modules and algorithms for data exchange, data privacy, etc. Based on them, his research group have designed and implemented the National College and University Enrolling System (NACUES) in 2000. NACUES successfully informationized the whole procedure of enrolling, in which the staff of admission offices on their own campuses could get all materials via Internet from respective Provincial Admission Centers. Till the year of 2009, NACUES had been adopted by 30 Provincial Admission Centers and all nation-wide universities and colleges (over 3000) to serve the admission process. Nearly 35 million students have been benefited from this system and enrolled by their universities. In the area of wireless sensor networks, Weidong focus on data transmission protocols and the applications of WSN under water. Weidong has several publications in ad hoc networks, big data and cloud computing. Mr. Liu received his Ph.D in computer science from Tsinghua University in 2006. He received his master degree in 1994 and bachelor in 1990 from the Department of Computer Science & Technology, Tsinghua University.

861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887

Please cite this article as: Y. Liu et al., An empirical study on implementing highly reliable stream computing systems with private cloud, Ad Hoc Networks (2015), http://dx.doi.org/10.1016/j.adhoc.2015.07.009

Q7

JID: ADHOC 14

888 889 890 891 892 893 894 895

ARTICLE IN PRESS

[m3Gdc;July 27, 2015;17:36]

Y. Liu et al. / Ad Hoc Networks xxx (2015) xxx–xxx Jiaxing Song received the B.S., M.S. and Ph.D. degrees in computer science and technology from Tsinghua University, China. His research interest includes computer networks, distributed system and cloud computing. Now he is an associate professor in Network Technology Institute of Department of Computer Science and Technology, Tsinghua University, China.

Huan He now joins China Auto Rental as an IT architect. Before joining CAR, she worked as an IT Specialist in the Global Technology Service of IBM. She obtained a M.Eng. degree in Computer Science from Tsinghua University in July 2011 and a B.Eng. degree in Computer Science from Beijing University of Posts and Telecommunications in July 2008. Her research interests include network economics, big data, and stream computing.

Please cite this article as: Y. Liu et al., An empirical study on implementing highly reliable stream computing systems with private cloud, Ad Hoc Networks (2015), http://dx.doi.org/10.1016/j.adhoc.2015.07.009

896 897 898 899 900 901 902 903 904