SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters

SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters

Accepted Manuscript SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters Rong Gu, Xiaoliang Yang, Jinshua...

1MB Sizes 0 Downloads 21 Views

Accepted Manuscript SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters Rong Gu, Xiaoliang Yang, Jinshuang Yan, Yuanhao Sun, Bing Wang, Chunfeng Yuan, Yihua Huang PII: DOI: Reference:

S0743-7315(13)00214-1 http://dx.doi.org/10.1016/j.jpdc.2013.10.003 YJPDC 3243

To appear in:

J. Parallel Distrib. Comput.

Received date: 9 February 2013 Revised date: 4 October 2013 Accepted date: 26 October 2013 Please cite this article as: R. Gu, X. Yang, J. Yan, Y. Sun, B. Wang, C. Yuan, Y. Huang, SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters, J. Parallel Distrib. Comput. (2013), http://dx.doi.org/10.1016/j.jpdc.2013.10.003 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

HIGHLIGHTS 1. Analyzed and identified two critical limitations of MapReduce execution mechanism 2. Achieved first optimization by implementing new job setup/cleanup tasks 3. Replaced heartbeat with an instant messaging mechanism to speedup task scheduling 4. Conducted comprehensive benchmarks to evaluate stable performance improvements 5. Passed a production test and integrated our work into Intel Distributed Hadoop

Click here to view linked References

SHadoop: Improving MapReduce Performance by Optimizing Job Execution Mechanism in Hadoop Clusters Rong Gu†,Xiaoliang Yang†, Jinshuang Yan†, Yuanhao Sun‡, Bing Wang‡, Chunfeng Yuan†, and Yihua Huang† †National Key Laboratory for Novel Software Technology, Nanjing University, 163 Xianlin Road, Nanjing, China, 210023 ‡ Intel Asia-Pacific Research and Development Ltd, 880 ZiXing Road, Zizhu Science Park, Shanghai, China, 200241 †{gurongwalker, yangxiaoliang2006 }@gmail.com †{cfyuan, yhuang}@nju.edu.cn ‡{yuanhao.sun,bin.c.wang }@intel.com

†Corresponding Author: Yihua Huang, Phone: 01186-25-8968-6517, Mailing Address: Dept. of Computer Science and Technology, Nanjing University, 163 Xianlin Road, Nanjing, China, 210023

Abstract As a widely-used parallel computing framework for big data processing today, the Hadoop MapReduce framework puts more emphasis on high-throughput of data than on low-latency of job execution. However, today more and more big data applications developed with MapReduce require quick response time. As a result, improving the performance of MapReduce jobs, especially for short jobs, is of great significance in practice and has attracted more and more attentions from both academia and industry. A lot of efforts have been made to improve the performance of Hadoop from job scheduling or job parameter optimization level. In this paper, we explore an approach to improve the performance of the Hadoop MapReduce framework by optimizing the job and task execution mechanism. First of all, by analyzing the job and task execution mechanism in MapReduce framework we reveal two critical limitations to job execution performance. Then we propose two major optimizations to the MapReduce job and task execution mechanisms: first, we optimize the setup and cleanup tasks of a MapReduce job to reduce the time cost during the initialization and termination stages of the job; second, instead of adopting the loose heartbeat-based communication mechanism to transmit all messages between the JobTracker and TaskTrackers, we introduce an instant messaging communication mechanism for accelerating performance-sensitive task scheduling and execution. Finally, we implement SHadoop, an optimized and fully compatible version of Hadoop that aims at shortening the execution time cost of MapReduce jobs, especially for short jobs. Experimental results show that compared with the standard Hadoop, SHadoop can achieve stable performance improvement by around 25% on average for comprehensive benchmarks without losing scalability and speedup. Our optimization work has passed a production-level test in Intel and has been integrated into the Intel Distributed Hadoop (IDH). To the best of our knowledge, this work is the first effort that explores on optimizing the execution mechanism inside map/reduce tasks of a job. The advantage is that it can complement job scheduling optimizations to further improve the job execution performance.

Key words Parallel computing; MapReduce; performance optimization; distributed processing; cloud computing.

1 Introduction The MapReduce parallel computing framework [1], proposed by Google in 2004, has become an effective and attractive solution for big data processing problems. Through simple programming interfaces with two functions, map and reduce, MapReduce significantly simplifies the design and implementation of many data-intensive applications in the real world. Moreover, MapReduce offers other benefits, including load balancing, elastic scalability, and fault tolerance, which makes it a widely adopted parallel computing framework. Hadoop [2], an open-source implementation of MapReduce, has been widely used in industry and researched in academia. Both the MapReduce frameworks of Google and Hadoop have been widely recognized for their high-throughput, elastic scalability, and fault tolerance. They focus more on these features than job execution efficiency. This results in relatively poor performance when using Hadoop MapReduce to execute jobs, especially short jobs. The term „short job‟ has already been used in some related work [3, 4]. There is no quantitative definition for short jobs now. Usually they refer to MapReduce jobs with execution time ranging from seconds to a few minutes as opposed to long MapReduce jobs that takes hours. Facebook names this type jobs as „small job‟ in its recently released optimized version of Hadoop, Corona [5]. Some studies show that short jobs compose a large portion of MapReduce jobs [1, 6]. For example, the average execution time of MapReduce jobs at Google in September 2007 is 395 seconds [1]. Response time is most important for short jobs in scenarios where users need the answer quickly, such as query or analysis on log data for debugging, monitoring and business intelligence [3]. In a pay-by-the-time environment like EC2, improving MapReduce performance means saving monetary costs. Optimizing MapReduce‟s execution time can also prevent jobs from occupying system resources too long, which is good for a cluster's health [4]. Today there are a number of high-level query and data-analysis systems that provide services on top of MapReduce, such as Google‟s Sawzall [7], Facebook‟s Hive [8] and Yahoo!‟s Pig [9]. These systems execute users‟ requests by converting SQL-like queries to a series of MapReduce jobs that are usually short. These high-level declarative languages can greatly simplify the task of developing applications in MapReduce without hand-coded MapReduce programs [10]. Thus, in practice, these systems play more important roles than hand-coded MapReduce programs. For example, more than 95% Hadoop jobs in Facebook are not hand-coded but generated by Hive and more than 90% MapReduce jobs in Yahoo! are generated by Pig [11]. In fact, these systems are very sensitive to the execution time of underlying short MapReduce jobs. Therefore, reducing the execution time of MapReduce jobs is very important to these widely-used systems. For the above reasons, in this paper we concentrate on improving the execution performance of short MapReduce jobs. Having studied the Hadoop MapReduce framework in great detail, we focus on the internal execution mechanisms of an individual job and the tasks inside a job. Through in-depth analysis, we reveal that there are two critical issues that limit the performance of MapReduce jobs. To address these

issues, we design and implement SHadoop, an optimized version of Hadoop that is fully compatible with the standard Hadoop. Different from improving the performance on job scheduling or job parameter optimization level, we optimize the underlying execution mechanism of each of tasks inside a job. In implementation, first, we optimize the setup and cleanup tasks, two special tasks when executing a MapReduce job, to reduce the time cost during the initialization and termination stages of the job; second, we add an instant messaging communication mechanism into the standard Hadoop for fast delivery of the performance-sensitive task scheduling and execution messages between the JobTracker and TaskTrackers. This way the tasks of a job can be scheduled and executed instantly without heartbeat delay. As a consequence, the job execution process becomes more compact and utilization of the slots on the TaskTrackers can be much improved. Experimental results show that SHadoop outperforms the standard Hadoop and can achieve stable performance improvements of around 25% on average for comprehensive benchmarks. Our optimization work has passed a production-level test in Intel and been integrated into the Intel Distributed Hadoop [12]. To the best of our knowledge, this work is the first effort that explores on optimizing the execution mechanism inside map/reduce tasks. The advantage is that it can complement job scheduling optimization work to further improve the job execution performance. The rest of this paper is organized as follows: Section 2 introduces the related works on MapReduce performance optimization and comparisons them with SHadoop. Section 3 focuses on analyzing the job/tasks execution mechanism in standard Hadoop MapReduce. Based on this, Section 4 describes our optimization methods for improving job execution efficiency in standard Hadoop MapReduce. Section 5 discusses experiments and performance evaluations of our optimization work. Finally, we conclude this paper in Section 6.

2 Related Work Analysis Many studies have been done to improve the performance of the Hadoop MapReduce framework from different levels or aspects. They fall into several categories. The first focuses on designing scheduling algorithms to optimize the execution order of jobs or tasks more intelligently [3, 16, 17, 18, 19, 20, 21, 22]. The second explores how to improve the efficiency of MapReduce with the aid of special hardware or supporting software [4, 23], [24]. The third conducts specialized performance optimizations towards a particular type of MapReduce applications [25, 26, 27]. Some researchers also focus on exploring optimizing job configuration settings or parameters to improve its execution performance [28]. Many researchers have shown interests in optimizing the scheduling policies in Hadoop. In 2009, Zaharia [3] proposed a task scheduling algorithm called LATE (longest Approximate Time to End), which executes speculative tasks to improve Hadoop‟s performance on heterogeneous clusters. To further improve the overall performance of Hadoop clusters, Hsin-Han [16] proposed a new scheduler named Load-Aware scheduler to address the problem resulting from the phenomenon of dynamic loading. A scheduler

that is aware of different types of jobs running on the cluster is designed by Radheshyam [17] for the same goal. Mohammad [18] proposed the Locality-Aware Reduce Task Scheduler (LARTS), another practical strategy for improving MapReduce performance. Similar work can be found in [19, 20, 21, 22]. These studies improve the Hadoop MapReduce performance by making intelligent or adaptive job and task scheduling for different running circumstance. On the other hand, our optimization work focuses on optimizing the underlying job and task execution mechanism to reduce the execution time cost for each individual job and its tasks. Little work has been done on this level so far and one advantage of our work is that it can complement the job scheduling optimizations as above to further improve the performance of the Hadoop MapReduce framework. To achieve higher execution efficiency on the Hadoop MapReduce framework, researchers have tried to adopt special hardware accelerators or supporting software for performance enhancement. Yolanda [23] and Miaond [24] exploited approaches of Hadoop MapReduce improvement with Cell BE processors and GPU acceleration respectively. A software method to improve the performance of MapReduce is using distributed memory cache [4], in which an existing open-source tool Memcached [29] is adopted to provide high-performance, distributed memory caching capability. Compared with these studies, SHadoop is easier to be put into practical use for the reason that we only modify the source code of the standard Hadoop to avoid employing any special hardware or supporting software and ensure full compatibility with existing MapReduce programs and applications, including configurations. Some other researchers focus on reducing the execution time cost of particular type of MapReduce jobs. For the one-pass analysis applications, Boduo [25] proposed a Hadoop-based prototype using a new frequent key based technique for performance enhancement. Ideas proposed in [26, 27] can help to improve execution efficiency of MapReduce jobs with heavy workloads in the shuffle and reduce phase. As each type of these specialized optimizations only pertain to certain type of applications, they lack general applicability. Our optimization is a generalized approach to improving the performance of MapReduce jobs.

3 In-depth Analysis of MapReduce Job Execution Process In this section, we first give a brief introduction to the Hadoop MapReduce framework. Then we focus on performing an in-depth analysis of the underlying execution mechanism and process of a MapReduce job and its tasks in Hadoop. Hadoop MapReduce framework, which is deployed on top of HDFS, consists of a JobTracker running on the master node and many TaskTrackers running on slave nodes. “Job” and “Task” are two important concepts in MapReduce architecture. Usually, a MapReduce job contains a set of independent tasks. As a core component in MapReduce framework, the JobTracker is responsible for scheduling and monitoring all the tasks of a MapReduce job. Tasks are assigned to the TaskTrackers on which the map and reduce functions implemented by users are executed. When receiving a job, MapReduce framework will divide the input data of the job into

several independent data splits. Then, each data split is assigned to one map task which will be distributed to the TaskTracker for processing by data locality optimization. Multiple map tasks can run simultaneously on the TaskTrackers and their outputs will be sorted by the framework and then fetched by reduce tasks for further processing. During the whole execution process of a job, the JobTracker monitors the execution of each task, reassigning failed tasks and altering state of the job in each phase. START initialization

PREPARE. INITIALIZING

PREPARE

initialization done kill

initialization failed

PREPARE. INITIALIZED launch job setup task

kill

PREPARE.SETUP

job setup task failed

job setup task completed

kill

RUNNING.WAIT_TO_RUN launch map/reduce tasks kill

RUNNING

RUNNING.RUNNING_TASKS

tasks execution failed

map/reduce tasks done

kill

RUNNING.SUC_WAIT kill

launch job cleanup task

CLEANUP

FINISHED

cleanup task failed

kill

Handle Killing Job

KILLED

job cleanup task completed

SUCCEEDE D

Handle Job Failure

FAILED

FIGURE 1: The state transition of a job duirng its executuion.

In order to better elaborate our optimization work in the next section, here first we present the execution state transition of a job in MapReduce framework, and then we analyze the execution process of a task. The execution state transition of a job is illustrated in Figure 1. Generally, the execution process can be divided into three phases in sequence. They are PREPARE, RUNNING and FINISHED. When a job is submitted to a Hadoop MapReduce cluster, the execution process works as follows: 1) PREPARE phase: A job begins its journey from the START state. It enters the PREPARE.INITIALIZING state to initialize itself, conducting some initialization processing such as reading the input data splits information from HDFS and generating the corresponding map and reduce tasks on the JobTracker. After that, a special task called “setup task” will be scheduled to a TaskTracker to setup the job execution environment. At this time, the job execution reaches the PREPARE.SETUP state. When the setup task finished successfully, the job will enter the RUNNING phase. 2) RUNNING phase: In this phase, the job starts from the RUNNING.RUN_WAIT state. During this state, the job waits to be scheduled for execution by the MapReduce framework. When one of its tasks has been scheduled to a TaskTracker for execution, the job will enter the RUNNING.RUNING_TASKS state to execute all its map/reduce tasks. Once all the map and reduce tasks are completed successfully, the job then moves to the RUNNING.SUC_WAIT state.

3) FINISHED phase: In this state, another special task called “cleanup task” will be scheduled to a TaskTracker to clean up the running environment of the job. After the cleanup task is done, the job will finally arrive at the SUCCEEDED state, or in other words, the job has finished successfully. In any state of the PREPARE and RUNNING phase, a job can be killed by the client and ended in the KILLED state or go into the FAILED state due to various failure. According to Figure 1, we know that when a job is initialized, many map/reduce tasks of the job will be created. These tasks are waiting to be scheduled to the TaskTrackers for execution. Figure 2 shows the timeline of how a task is processed. JobTracker

Child JVM

TaskTracker

(1) create TaskInProgress for task task req u est u n ictaio n ) (2) mm co t ea b (h eart

UNASSIGNED

response assignm ent (launch task action)

create TaskTracker. TaskInProgress for task (3)

launch task RUNNING

task state changed ch an ged task st at e ictaio n) co m m un (h eart beat

RUNNING

run task

(4) (5)

RUNNING

...

COM M IT_PENDING

task state changed

COM M IT_PENDING

ch an ged n) task st at e m un ictaio m co at (h eart be

(6)

COM M IT_PENDING

response (com m it task action) (7)

com m it task

com m it task state changed ch an ged task st at e ictaio n) co m m un at be rt (h ea

SUCCEED

(8)

SUCCEED

FIGURE 2: The execution process of a task with respect to time. The vertical lines represent time axes and each arrow line stands for one communication between entities. Child JVM lies on slave nodes with the TaskTracker. The tasks are executed on different child JVMs independently. The “TaskInProgress” in the JobTracker and the “TaskTracker.TaskInProgress” in the TaskTracker are two important runtime instances.

Generally, the processing workflow consists of 8 steps. 1) When the tasks are created, the JobTracker will generate a “TaskInProcess” instance for each task. At this time, the tasks are still in the UNASSIGNED state. 2) Each TaskTracker sends a heartbeat to the JobTracker for requesting tasks. In response, the JobTracker allocates one or several tasks to each TaskTracker. This information exchange is done through the first round of heartbeat communication. The

interval between two heartbeat messages is at least 3 seconds by default. 3) After receiving a task, the TaskTracker performs the following work: creating a “TaskTracker.TaskInProgress” instance, running an independent child JVM to execute the task, and then changing the task state of the TaskTracker to the RUNNING state. 4) Each TaskTracker reports the information of its task to the JobTracker, and the JobTracker updates the task state into the RUNNING state. This is done through the second round of heartbeat communication. 5) After a while, the task is completed in child JVM. Then, the TaskTracker changes the task state into the COMMIT_PENDING. This is a state that waits for the JobTracker's approval of committing the task. 6) This state change message will be forwarded to the JobTracker by the TaskTrackers through the next round of heartbeat communication. In response, the JobTracker changes the task state into the COMMIT_PENDING state to allow the TaskTrackers to commit the task results. 7) When getting the JobTracker's approval, the TaskTracker submits the task execution results and then changes the task state into the SUCCEEDED. 8) After that, the TaskTracker reports the SUCCEEDED state to the JobTracker through the next heartbeat. Then the JobTracker changes the task state into the SUCCEEDED. By this time, the execution of a task is completed.

4 Optimization of MapReduce Job and Task Execution Mechanisms Based on above in-depth analysis of execution mechanisms of a MapReduce job and its tasks, in this section we reveal two critical limitations to job execution performance in the standard Hadoop MapReduce framework. Then we present our optimization work to address these issues in more detail. The optimizations made in SHadoop aim at reducing the internal execution time cost of individual MapReduce jobs, especially for short jobs, by optimizing the job and task execution mechanisms to improve the hardware resource utilization rate of the TaskTracker slots. 4.1 Optimizing Setup/Cleanup Tasks in MapReduce Job As the state transition process of a job presented in Figure 1, prior to scheduling the map/reduce tasks of a job, a setup task should be scheduled and executed first. In brief, the setup task is processed as follows: 1) Launch job setup task: After the job is initialized, the JobTracker needs to wait for a heartbeat message from a TaskTracker to tell it has a free map/reduce slot ready to receive and execute a new task, and then the JobTracker schedules the setup task to this TaskTracker. 2) Job setup task completed: the TaskTracker processes the task, and keeps reporting state information of this task to the JobTracker by periodical heartbeat messages until the task is completed.

The two steps described above usually need to take two rounds of heartbeat communication (at least 6 seconds, as the default heartbeat interval of Hadoop is 3 seconds). Similarly, after all map/reduce tasks are completed successfully, a cleanup task must be scheduled to run on a TaskTracker before the job really ends. This needs another two rounds of heartbeat communication, which means another 6 seconds. Thus the setup and cleanup tasks will take at least 12 seconds in total. For a short job which runs only in a couple of minutes, these two special tasks may take around 10% or even more of its total execution time. If the fixed time cost of 4 rounds of heartbeat communication for a short job can be reduced, it will be a noticeable performance improvement for job execution. By taking a closer look at the implementation of the setup and cleanup tasks in the standard Hadoop MapReduce, we observe that, the job setup task running in a TaskTracker simply creates a temporary directory for outputting temporary data during job execution, and the only thing the job cleanup task does is deleting this temporary directory. These two operations are very light-weighted and their actual time cost is very little. Hence, instead of sending messages to a TaskTracker to launch the job setup/cleanup task based on periodical heartbeats, we execute the job setup/cleanup task immediately on the JobTracker side. That means, when the JobTracker initializes a job, a setup task of the job will be immediately executed one time in the JobTracker. After all map/reduce tasks of the job are completed, a cleanup task of the job will be immediately executed one time on the JobTracker as well. By this optimization, the job will avoid 4 heartbeat intervals used in the standard Hadoop for processing the setup and cleanup tasks. Because the setup and cleanup tasks are simple operations and each job only needs to execute the setup and cleanup tasks only once no matter how many map/reduce tasks the job has, running many jobs would not overwhelm the master node in a Hadoop cluster as only a few jobs can be scheduled to be executed at a time. Figure 3 shows the optimized state transition of a job in SHadoop. After making this optimization, the PREPARE.SETUP and CLEANUP states in Figure 2 are incorporated into the PREPARE.INITIALIZED and RUNNING.SUC_WAIT states respectively. START initialization

PREPARE. INITIALIZING

PREPARE

initialization done kill

initialization failed

PREPARE. INITIALIZED job setup task completed

kill

RUNNING.WAIT_TO_RUN launch map/reduce tasks

RUNNING

kill

RUNNING.RUNNING_TASKS kill

tasks execution failed

map/reduce tasks done

RUNNING.SUC_WAIT kill

Handle Killing Job

job cleanup task completed

Handle Job Failure

FINISHED KILL

SUCCEEDE D

FAILED

FIGURE 3: Job execution state transition after optimization

4.2 Optimizing Job/Tasks Execution Event Notification Mechanism The MapReduce Framework adopts a periodical heartbeat-based communication mechanism to exchange information and commands between the master node and slave nodes. From the task execution process described in Figure 2 we can see that the standard Hadoop MapReduce also uses these heartbeat messages between the JobTracker and TaskTrackers to notify job/tasks about scheduling and execution event information. Each TaskTracker periodically sends information to the JobTracker and performs a pull-model task requests if it has free map/reduce slots. And the JobTracker responds if it has more tasks to be executed. We refer to this as the pull-model heartbeat communication mechanism. Through this heartbeat communication mechanism, the TaskTrackers also report node information to the JobTracker and then the JobTracker issues control commands to the TaskTrackers. For controlling and managing a Hadoop cluster, an appropriate heartbeat period should be set. Now, for a cluster with less than 100 nodes, the default heartbeat interval is 3 seconds in the standard Hadoop, with an additional 1 second added per 100 extra nodes. To some extent, the pull-model heartbeat communication mechanism can help prevent the JobTracker from being overwhelmed. However, a heartbeat usually contains various messages such as the load state of a slave node, whether it is ready to execute tasks, reporting alive and so on. The transmission efficiency of some messages, such as ready to execute tasks, is very important and sensitive to the performance of job execution. We refer this type of performance-sensitive event messages for job and task scheduling and execution as critical event messages. Transferring critical event messages by the heartbeat communication mechanism leads a heavy time cost during job execution for two reasons: 1) The JobTracker has to wait for the TaskTrackers to request tasks passively, and as a result, there will be a delay between submitting a job and scheduling its tasks due to the fact that the TaskTrackers would not contact the JobTracker until the heartbeat interval passed. 2) Critical event messages (task requesting, task start running, task commit pending, and task finishing) cannot be timely reported from the TaskTrackers to the JobTracker and this delays the task scheduling, further increasing the time cost of job execution and decreasing the utilization efficiency of computing resources, even if the map/reduce slots on a TaskTracker are idle and wait for tasks. A short job usually only has dozens of tasks and runs for couple of minutes. If each task is delayed for a few second, the total execution time would be delayed a noticeable amount. The categories of the heartbeat messages sent from TaskTrackers to JobTracker are summarized in Table 1. There are only four critical events: task requesting, task start running, task commit pending, and task finishing. And the messages sent while the task is in the running state are not critical event messages. In other words, we regard the messages notifying the task execution workflow to move to the next state as critical event messages and the others are not. Accelerating transmission of these critical event messages will shorten the time cost of task scheduling and execution, further shortening

total execution time of a job. Decreasing the value of heartbeat interval is not a good solution to this problem. This naive approach could overwhelm the JobTracker and potentially crash the whole Hadoop cluster. It incurs many unnecessary heartbeats and usually only be used in small clusters [31]. To resolve this problem, in SHadoop, we separate the critical event messages from heartbeat messages and add an instant messaging communication mechanism for critical event notifications as shown in Figure 4. In this new mechanism, when critical events such as task completion happen, the message will be sent to the JobTracker immediately. In this way, critical event messages will be synchronized between the JobTracker and TaskTrackers quickly. For all job/tasks execution event notifications, we use the instant messaging communication, but for those cluster management events that are not that performance-sensitive we still adopt the heartbeat communication mechanism. This way we can improve the hardware resource utilization without overwhelming the JobTracker. Table 1: Type of the Messages Sent From a TaskTracker to the JobTracker in SHadoop.

Message Type:

Critical Messages

Non-critical

Message Name:

task requesting, task start running,

Messages

task in running

task commit pending, task finishing

To make SHadoop fully compatible with the standard Hadoop, we avoid using any third-part libraries during the implementation of our instant messaging communication mechanism. For network communication, SHadoop still adopts the inner RPC mechanism in Hadoop. In SHadoop, when a critical event happens, the message will be immediately transmitted by the Hadoop RPC methods between the JobTracker and TaskTrackers without waiting for a heartbeat period. JobTracker

Child JVM

TaskTracker

(1) creat e TaskInProgress for t ask UNASSIGNED

request t ask (inst ant m essaging com m unict aion)

(2)

creat e TaskTracker. TaskInProgress for t ask

response assignm ent (launch t ask act ion)

(3)

launch t ask RUNNING

t ask st at e changed Task st at e changed event (inst ant m essaging com m unict aion)

RUNNING

run t ask

(4) RUNNING

(5)

… Task st at e changed event (inst ant m essaging com m unict aion)

COM M IT_PENDING

t ask st at e changed

COM M IT_PENDING

(6)

COM M IT_PENDING

response (com m it t ask act ion)

(7)

com m it t ask

com m it t ask st at e changed Task st at e changed event (inst ant m essaging com m unict aion)

SUCCEED

(8) SUCCEED

FIGURE 4: Optimized task execution process after applying the instant messaging communication mechanism

5 Evaluation In order to verify the effects of our optimizations, we conducted a series of experiments to evaluate and compare the performance of SHadoop with the standard Hadoop. First, we performed a number of experiments to separately evaluate the effect of each of optimization measures. Second, in order to evaluate how much our optimization can benefit the MapReduce jobs with different workloads, we adopted several Hadoop MapReduce benchmark suites to further evaluate SHadoop. Third, the widely-used big data ad-hoc query and analysis systems such as Hive and Pig are built on top of MapReduce, thus, we also verified how much SHadoop can improve the execution efficiency of Hive with a number of comparative experiments. Fourth, we evaluated the scalability of SHadoop compared to the standard Hadoop. We carried out two experiments by (1) scaling the data while fixing the number of machines, and (2) scaling the number of machines while fixing the data. Finally, we evaluate the impact on the system workloads brought by the instant messaging optimization in SHadoop with both formal analysis and experimental verification. 5.1 Environment Setup The experiments were performed under Hadoop 1.0.3 (a stable version) and SHadoop. Our test cluster contains one master node and 36 compute nodes. The master node is equipped with two 6-core 2.8 GHz Xeon processors, 36 GB of memory and two 2TB 7200 RPM SATA disks. Each compute node has two 4-core 2.4 GHz Xeon processors, 24GB of memory and also two 2TB 7200 RPM SATA disks. The nodes are connected with 1Gb/s Ethernet. They all run RHEL6 with kernel 2.6.32 operating system and ext3 file system. Each compute node acts as a TaskTracker/DataNode, and the master node acts as the JobTracker/NameNode. The Hadoop configuration uses the default settings and 8 map/reduce slots per node. Both the standard Hadoop and SHadoop run with OpenJDK 1.6 with the same JVM heap size 2GB. 5.2 Analysis of Optimization Measures and Effects SHadoop has made two optimizations on the standard Hadoop MapReduce. Thus, we performed a set of experiments to demonstrate the effect of each optimization in this subsection. The job execution time is the performance metric here. First we run our experiments with the well-known WordCount benchmark. To make the application job short, the input data size was set to 4.5GB with around 200 data blocks. We ran the benchmark with 16 reduce tasks on the standard Hadoop 1.0 environment and SHadoop under a cluster of 20 slave nodes with 160 slots in total. During the execution of a job, we

recorded the load of the slots on each TaskTracker at every second into a log file on JobTracker. The results are shown in Figure 5. Figure 5(a) shows that the number of running tasks of the WordCount benchmark varies as time elapsed when running on the standard Hadoop. It can be seen clearly that at the beginning of a job, it takes about 7 seconds to execute a setup task before running users‟ map/reduce tasks. Similarly, a cleanup task needs to be executed before the job ends. As shown in Figure 5(b), after applying the optimized job setup/cleanup tasks, the setup and cleanup time costs are noticeably reduced. The total job execution time cost is shortened from 60 seconds to 46 seconds, for a 23.3% improvement in performance. As shown in Figure 5(c), with optimization of the instant messaging communication mechanism to the standard Hadoop, the number of running tasks stayed higher and changed more smoothly. The phenomenon indicates that during the job execution, the slots on the TaskTrackers have been maximally scheduled to run tasks and rarely stay at idle state. This makes the execution process more compact and efficient by improving the CPU utilization rate on each slot. For a given MapReduce job, the total computation workload is fixed, thus improving the CPU utilization rate of map/reduce slots would lead to a reduction in total execution time. Figure 5 (d) shows the job execution with both the setup/cleanup task optimization and instant messaging optimization applied together. Compared with only the setup/cleanup task optimization in Figure 5(b), the total execution time of the job is further shortened from 46 seconds to 39 seconds, about 11.7% of extra improvement, for adding the instant messaging optimization. The two optimizations have an additive effect because they work at different phases during job execution: the setup/cleanup task optimization works at the beginning and end of a job, while the instant messaging optimization takes effect in the middle of a job. To sum it up, both of the optimization measures can make a significant contribution to the performance improvement. Compared with the standard Hadoop, SHadoop can reduce the execution time cost of the short WordCount benchmark job by 35% in total.

(a) running wordcount benchmark on the standard Hadoop

(b) running wordcount benchmark with the job setup/cleanup task optimization.

(c) running wordcount benchmark with the instant messaging optimization.

(d) running wordcount benchmark with both job setup/cleanup tasks optimzation and instant messaging optimzation.

FIGURE 5: Performance evaluation for effects of optimization measures in SHadoop (in seconds, the lower Running Time is better)

Grep and Sort are another two widely-used benchmarks of MapReduce jobs that are also used in the original MapReduce paper [1]. Grep is a typical map-side job. For a map-side job, the most work is done on map tasks. The output from map tasks is usually several orders of magnitude smaller than the input and thus there is little work for reduce tasks. On the other hand, Sort is a typical reduce-side job. For reduce-side jobs, most execution time is spent on the reduce phase, including shuffling the intermediate data and performing reduce tasks. In these jobs, the output data of map tasks is the same size as the job input data. To evaluate the effect of optimizations on different type of MapReduce jobs, we also ran these two benchmarks on the standard Hadoop and SHadoop. Experiments in the first group are performed on the Grep benchmark with 10 GB input data. For the Sort benchmark, experiments are performed on 3 GB input data. Both experiments are conducted on 20 slave nodes as the WordCount benchmark experiments. The results of

these two groups of experiments are shown in Figure 6 and Figure 7 respectively.

(a) running Grep benchmark on the standard Hadoop

(b) running Grep benchmark on SHadoop

FIGURE 6: Performance evaluation of Grep benchmark under standard Hadoop and SHadoop (in seconds, lower Running Time is better)

As shown in Figure 6 and Figure 7, compared with the standard Hadoop, SHadoop has shortened the execution time cost of the Grep benchmark from 47 seconds to 29 seconds and the Sort benchmark from 63 seconds to 41 seconds. The total execution time is reduced by 38% and 34% respectively. This demonstrates that the optimizations can enhance the execution efficiency of both map-side and reduce-side MapReduce jobs.

(a) running Sort benchmark on the standard Hadoop

(b) running Sort benchmark on SHadoop

FIGURE 7: Performance evaluation of Sort benchmark under standard Hadoop and SHadoop (in seconds, lower Running Time is better)

5.3 Impact on Performance of Comprehensive Benchmarks In this subsection, to better evaluate and prove the general applicability and stable performance improvement of our optimization work, we further test the impact of our optimizations with comprehensive benchmarks, including HiBench [13], a widely-used benchmark suit from Intel, MRBench [14], a benchmark carried in the standard Hadoop distribution, and a widely-used application benchmark suite, the Hive benchmarks, to evaluate SHadoop. 1) HiBench Evaluation HiBench is a widely-used benchmark suite for Hadoop [13]. It consists of a set of Hadoop MapReduce program benchmarks, including both synthetic micro-benchmarks and real-world Hadoop applications. The running time of each HiBench benchmark is shown in Figure 8, and the corresponding performance improvement rates are recorded in Table 2. The running time of the standard Hadoop with each individual optimization are also reported. From Table 2, we can see that both optimization measures always take effect and the performance improvements vary for different benchmarks. Some commonly-used benchmarks such as WordCount, Sort, and Grep, can get more than 30% improvement, and the NutchIndexing and HiveBench-aggrator benchmarks get 6% improvement. Therefore, SHadoop can improve execution efficiency of all benchmarks to some degree and our optimization work achieves general applicability.

FIGURE 8: Execution performance of each HiBench benchmark under Hadoop and SHadoop. (in minutes, the lower Execution Time is better. Note: the running time of the Bayes benchmark is too long, large than 25 minutes. To better illustrate the performance evaluation of whole benchmarks, we scale down the running time of Bayes benchmarks 10 times. Its exact execution time is noted on their data tags) Table 2: Performance evaluation for effects of optimization measures in SHadoop using HiBench. (Hadoop represents the standard Hadoop, the first optimization represents Hadoop with only the setup/cleanup task optimization, the second optimization represents Hadoop with only the instant messaging optimization, and SHadoop represents Hadoop with both optimizations)

Benchmark case WordCount

Hadoop 60 sec

The First

The Second

Optimization

Optimization

50 sec

51 sec

SHadoop

Improvement In Total

39 sec

35.00%

Sort

63 sec

53 sec

52 sec

41 sec

34.90%

Grep

201 sec

179 sec

167 sec

146 sec

38.27%

PageRank

283 sec

224 sec

271 sec

213 sec

27.36%

Kmeans

47 sec

36 sec

40 sec

29 sec

24.73%

NutchIndexing

159 sec

153 sec

156 sec

151 sec

6.00%

HiveBench-aggrator

113 sec

110 sec

108 sec

106 sec

6.00%

HiveBench-join

212 sec

199 sec

197 sec

185 sec

12.70%

1,697 sec

1,588 sec

1,640 sec

1,518 sec

10.55%

Bayes

2) Hadoop MRBench Evaluation MRBench is one of the benchmarks from the standard Hadoop distribution for people to benchmark, stress test, measure, and compare the performance results of a Hadoop cluster with that of others. MRBench creates a sequence of small MapReduce jobs (the number can be configured), and the job execution efficiency of the underlying Hadoop is evaluated by the total time cost of these MapReduce jobs. We run three groups of comparative experiments with different submitting job numbers. The results are shown in Table 3. The running time of the standard Hadoop with each individual optimization are also reported. From Table 3, we can see that both optimization measures take effect and the performance improvement rate of SHadoop is

consistently around 30%. Table 3: Experiment results of MRBench under the standard Hadoop and SHadoop. (Hadoop represents the standard Hadoop, the first optimization represents Hadoop with only the setup/cleanup task optimization, the second optimization represents Hadoop with only the instant messaging optimization, and SHadoop represents Hadoop with both optimizations)

# Jobs 5

Hadoop 122 sec

the

First

The Second

Optimization

Optimization

91 sec

114 sec

SHadoop

Improvement In Total

85 sec

30.30%

50

1,252 sec

943 sec

1,178 sec

876 sec

30.03%

500

12,504 sec

9,020 sec

12,117 sec

8,754 sec

30.00%

3) Hive Benchmark Evaluation In practice, Hive and Pig are used more widely than hand-coded MapReduce programs for big data query and analysis applications. As we have mentioned, more than 95% Hadoop jobs in Facebook are not hand-coded but generated by Hive [11]. One motivation of our optimization work is to benefit big data query and analysis systems. Therefore, we also evaluate the impact of performance improvement for the Hive benchmarks.

FIGURE 9: Execution Performance for benchmarks in Hive under Hadoop and SHadoop.

In this experiment, we use Hive 0.9 as the big data query and analysis system, and run a number of the Hive benchmarks over Hive based on the standard Hadoop and SHadoop respectively. The experimental results are shown in Figure 9, and the corresponding performance improvement rates are recorded in Table 4. The running time of the standard Hadoop with each individual optimization are also reported. From Table 4, we can see that our optimized MapReduce framework can noticeably accelerate the execution speed of Hive applications. Both optimization measures take effect and the performance improvements vary for different benchmarks. The average improvement rate in total is around 20%, which is significant for many online queries and analysis.

Table 4: Performance evaluation for effects of optimization measures in SHadoop using the Hive benchmarks. (In the table, Hadoop represents the standard Hadoop, the first optimization represents Hadoop with only the setup/cleanup task optimization, the second optimization represents Hadoop with only the instant messaging optimization, and SHadoop represents Hadoop with both optimizations. GB_SingleReducer is short for GroupBy_SingleReducer.)

Benchmark

Hadoop

Name

the First

the Second

Optimization

Optimization

SHadoop

Improvement In Total

Join

67 sec

61 sec

56 sec

51 sec

23.9%

Combine

123 sec

106 sec

116 sec

99 sec

19.5%

GroupBy (GB)

49 sec

43 sec

45 sec

39 sec

20.4%

GB_SingleReducer

99 sec

87 sec

93 sec

82 sec

17.2%

Insert_Into

113 sec

94 sec

109 sec

91 sec

18.6%

Order

25 sec

22 sec

24 sec

21 sec

16.0%

Sort

26 sec

22 sec

24 sec

21 sec

19.2%

Union

26 sec

20 sec

24 sec

19 sec

23.1%

5.4 Scalability In this subsection, we experimentally evaluate the scalability of SHadoop compared to the standard Hadoop by scaling the data while fixing the number of machines, and scaling the number of machines while fixing the data. 1) Data scalability: Table 5 shows the performance of SHadoop and the standard Hadoop with different sizes of input data. The experimental results come from the WordCount benchmark running with 20 nodes. From Table 5, we see that SHadoop outperforms the standard Hadoop for all the different sizes of input data per node. The improvement percentage ranges from 30.38% with 256 MB input data per node to 4.47% with 8GB input data per node. It indicates that SHadoop can improve performance of various sizes of the MapReduce jobs but the improvement effect is much more significant for short jobs. In addition, the nearly linear trend between the job execution time and data size indicates that SHadoop performs excellent scalability when data size varies. Table 5: Execution time of the wordcount benchmark in Hadoop and SHadoop with different size of input data per node.

Data per Node

256MB

512MB

1GB

2GB

4GB

8GB

Hadoop

79 sec

115 sec

172 sec

291 sec

499 sec

962 sec

SHadoop

55 sec

90 sec

147 sec

263 sec

472 sec

919 sec

Improvement rate

30.38%

21.74%

14.53%

9.62%

5.41%

4.47%

2) Machine scalability: We also evaluate the job execution performance of SHadoop and the standard Hadoop with different number of nodes. All the experiments in this subsection are also conducted on the WordCount benchmark with 10 GB data in total and around 500 input blocks. The experimental results are shown in Figure 10. In Figure 10, the execution time of the job runs on SHadoop under 4, 8, 16, 32 nodes

are 287 seconds, 141 seconds, 85 seconds, and 46 seconds respectively. The later is always around half of the former, which means SHadoop scales well with different cluster nodes. Same as the standard Hadoop, when more nodes added, SHadoop speeds up proportionally. Furthermore, we can find that with the same number of nodes,the job execution efficiency of SHadoop is always higher than the standard Hadoop. The improvement percentages are more noticeable for shorter jobs.

FIGURE 10: Execution time of wordcount benchmark in SHadoop and the standard Hadoop with different node numbers.

In conclusion, SHadoop achieves much better execution efficiency than that of the standard Hadoop under different node numbers, without losing good scalability in clusters with dozens of nodes. 5.5 Impact on System Workloads In this subsection, we study the impact on the system workloads caused by the instant messaging optimization. First, we model this problem with a formal analysis of the impact on system workloads. To prove our model, we evaluate the real workloads in our cluster with the MapReduce job benchmarks before and after the instant messaging optimization. The workloads we study here include network traffic, CPU usage and memory usage in both the JobTracker and TaskTrackers. 1) Quantitative Analysis: In our Hadoop cluster, the JobTracker runs on the master node. Also, there are m slave nodes, each of which runs a TaskTracker. A TaskTracker can run k tasks simutaneously, where k is the number of slots. In the standard Hadoop, the TaskTrackers communicate with the JobTracker through heartbeat periodically. Let the period interval be T and the size of a heartbeat message be c . All the slots in the same TaskTracker share one heatbeat timer. When the timer counts the interval T down to zero, it will triger the TaskTracker to send a heartbeat message to the JobTracker.

Whenever a heartbeat message is sent out, the TaskTracker will reset the timer to count from the start again. Assume the life span of a task on a slot can be covered by a time window t , where t varies for different tasks. Then, as shown in Figure 11(a) and Figure 11(b), the increased number of messages is no more than 4  m  k . Thus, the increased message size transferred duirng the time window t is no more than 4  m  k  c . We can find that the increased message size is independent from the task execution time period t . This means that the increased message number caused by our instant messaging optimization is a fixed overhead, no matter how long the task execution time window t is. 1st heartbeat 2nd heartbeat message message

t/ T - 1 heartbeat message

t/ T heartbeat message

... Slot 1

... Slot 2

TaskTracker

Heartbeat Interval:T

. . .

Heartbeat Interval:T

. . .

Heartbeat Interval:T

Heartbeat Interval:T

... Slot k

Execute k Parallel Tasks. The time period is t

(a) message transferring model of the TaskTracker in the standard Hadoop (t- t1- t2)/ T - 1 heartbeat message

1st heartbeat 2nd heartbeat message message 1st IM

(t- t1- t2)/ T heartbeat message 3rd IM 4th IM

...

2nd IM

Slot 1

1st IM

TaskTracker

2nd IM

3rd IM 4th IM

...

Slot 2

. . .

t1

Heartbeat Interval:T Heartbeat Interval:T

1st IM

2nd IM

. . .

Heartbeat Interval:T Heartbeat Interval:T

...

3rd IM t2 4th IM

Slot k

Execute k Parallel Tasks. The time period is t

(b) message transferring model of the TaskTracker in SHadoop

FIGURE 11: The message transferring model of the TaskTracker in the standard Hadoop and SHadoop. IM is an abbreviation for instant message.

On the other hand, in practise, k is the number of slots in a TaskTracker that run tasks simultaneously. It is usually the number of cores of a machine, for example, 4, 8 or 16 etc. m is the number of the slaves of the Hadoop cluster. In a moderately-sized cluster, the number is usually in dozens. m  k can be regarded as the number of the cores running tasks parallelly in the cluster. It can rearch hunderds in a moderately-sized cluster. c is the size of a message, it is around 2 KB. Therefore, the increased network traffic for the increased message size during the time window t (around 40 seconds for typcial MapReduce Tasks) is around sereral mega bytes that will not create problem for a commonly-used cluster network environment such as Fast Ethernet or Gigabit Ethernet. 2)Experimental Studies: We also evaluate the increased workload caused by the instant messaging optimization through experiments to better verify the impact on system workloads. The workload we evaluate during expeiments includes not only the cluster network traffic but also the CPU/Memory usages in the JobTracker and TaskTrackers. In our experimental cluster, the number of the slave machines is 20 and each machine has 8 cores. We configured 8 map and 4 reduce slots per node. Thus the number of the tasks running in parllel is 240. We chosed a typical normal MapReduce job: WordCount as the benchmark case. To study the tasks with different running time, we adopted three scale-level datasets and their data block sizes are 16MB, 32MB and 64MB repectively. That means, each task needs to process 16MB, 32MB and 64MB data repectively. Table 6 shows the time cost and transmitted message numbers of each task in SHadoop and the standard Hadoop. Messages sent in SHadoop are more than messages sent in the standard Hadoop. This is because each slot of a TaskTracker send instant messages indepently. This reduces the opportunities to coalesce messages of the other slots in the same TaskTracker at the beginning and end of tasks. However, as shown in Table 6, the increased message number is only around 30. Also, the physical size of each message is around 2 KB, thus the increased data size of a TaskTracker is only dozens of kilo bytes that is not large for a moderately-sized cluster. Table 6: Execution time and transmitted message numbers of Map/Reduce tasks in SHadoop and the standard Hadoop with different data block sizes. The number of messages is that in total sent from a TaskTracker.

Data Block Size /

Execution

Num. of

Execution

Num. of

Increased

Task Role

Time

Messages

Time

Messages

Num. of

(Hadoop)

(Hadoop)

(SHadoop)

(SHadoop)

Messages

16 MB / map task

21 sec

7

17 sec

37

30

16 MB / reduce task

48 sec

16

41 sec

45

29

32 MB / map task

33 sec

11

29 sec

41

30

32 MB / reduce task

62sec

21

71 sec

55

34

64 MB / map task

57sec

19

54 sec

50

31

64 MB / reduce task

117sec

39

113 sec

69

30

We also recorded the system workloads of the JobTracker and TaskTrackers when running the tasks in Table 6. The workloads of a typical task, 32MB /map task, is choosen to elaborate the impact on the system workloads. The Figure 12 shows the workload of the JobTracker during running the task under the standard Hadoop and SHadoop. From the Figure 12(a), we can find that, the increased network traffic in the cluster is only several mega bytes. This increament is quite accpetable for commonly-used cluster network environment such as Fast Ethernet or Gigabit Ethernet in our cluster. As shown in Figure 12(b) and 12(c), CPU and memory usages do not increase much. This proves that the optimization does not overwhelm the JobTracker.

(a) network workload

(b) CPU workload

(c) Memory workload

FIGURE 12: Workloads of the JobTracker running the 32 MB/map wordcount task under the standard Hadoop and SHadoop. The solid lines represent the standard Hadoop and the dash lines represent SHadoop.

Similarly, The workload of the TaskTracker during the experiments is shown in Figure 13. We can see that the workload of the TaskTracker does not change much when the optimizations are applied.

(a) network workload

(b) CPU workload

(c) Memory workload

FIGURE 13: Workloads of TaskTracker running the 32 MB/map wordcount task under the standard Hadoop and SHadoop. The solid lines represent the standard Hadoop and the dash lines represent SHadoop.

To sum up, on one hand, as the number of the increased messages is fixed and their data sizes are not large, the optimizations made in SHadoop will not cause much overhead to the system and overwhelm the system. On the other hand, they improve the job execution performance by leading the MapReduce jobs to make better use of the

hardware resources in a small to moderately-sized cluster. Some studies [4] have already noted that small to moderately-sized MapReduce clusters with no more than dozens of machines with hundreds of cores in total are very common in most companies and laboratories. In the official Hadoop powered-by report website [15], we can see that among about total 600 Hadoop clusters from near 160 industry and research institutions, around 95% clusters are made up of a few dozens of nodes. Only a few clusters from Yahoo!, Facebook and eBay reach the scale of more than 500 nodes. From this point of view, the cluster with the size of 36 nodes and 288 cores in our experiment is close to the scale of most Hadoop clusters used in production environment. Therefore, our optimization work towards efficiently utilizing small to moderately-sized clusters will be a contribution to improve the performance of Hadoop in real applications.

6 Conclusion and Future Work MapReduce is a popular programming model and framework for processing large datasets. It has been widely used and recognized for its simple programming interfaces, fault tolerance and elastic scalability. However, the job execution performance of MapReduce is relatively disregarded. In this paper, we explore an approach to optimize the job and task execution mechanism and present an optimized version of Hadoop, named SHadoop, to improve the execution performance of MapReduce jobs. SHadoop makes two major optimizations by optimizing the job initialization and termination stages and providing an instant messaging communication mechanism for efficient critical event notifications. The first optimization can shorten the startup and cleanup time of all the jobs. It is especially effective for the jobs with short running time. The second optimization can benefit most short jobs with large deployment or many tasks. One potential side-effect of our optimizations is that it may bring a little more burden to the JobTracker as it needs to create and delete an empty temporary directory for each job. However, the increased burden is very little as shown in Figure 12 and Figure 13, unless the machine running the JobTracker is configured with relative low computational resource. Also, if the jobs are always long-running ones, our optimizations will not benefit much to the jobs. Compared with the standard Hadoop, SHadoop can achieve 25% performance improvement on average for various tested Hadoop benchmarks jobs and Hive applications. It also achieves excellent scalability when data size varies and excellent speedup when the number of cluster nodes increases. Moreover, SHadoop preserves all the features of the standard Hadoop MapReduce framework, without changing any programming APIs of Hadoop. It can be fully compatible with the existing programs and applications built on top of Hadoop. To the best of our knowledge, SHadoop is the first effort that optimizes the execution mechanism inside map/reduce tasks. Experimental results have shown that SHadoop works well on small to moderately-sized clusters which are the most cases in practice and achieves stable performance improvement for comprehensive benchmarks. Our optimization work is a contribution to the Hadoop

MapReduce framework and now has been integrated into the Intel Distributed Hadoop [12] after passing a production-level test in Intel. We have also distributed SHadoop as an open source project, which can be visited at the Github website[30]. In the future, we will try to explore more possible optimizations to further improve MapReduce performance by better utilizing cluster hardware resources. Currently the slots of a Hadoop cluster can only be statically configured and used, which limits the dynamic scheduling of slots in terms of the actual utilization rate and workload of computing resources. Thus, we are planning to work and study on a resource context-aware optimization model and approach for dynamic slot scheduling for the Hadoop MapReduce execution framework. We will also work on a job cost-aware model and approach for optimizing the MapReduce job scheduler in terms of workloads for different types of applications such as computation-intensive, I/O intensive or memory-intensive jobs. Finally, we plan to integrate all these new optimizations with the optimizations proposed in this paper to achieve more performance improvement.

Acknowledgment This work is funded in part by China NSF Grants (61223003) and the National High Technology Research and Development Program of China (863 Program) ( 2011AA01A202). References [1] [2] [3]

[4]

[5]

[6]

[7] [8] [9]

[10] [11]

[12] [13]

[14]

[15]

J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Communications of the ACM 51(1) 107-113. Apache Hadoop. http://hadoop.apache.org. M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I.Stoica, Improving mapreduce performance in heterogeneous environments, in: Proceedings of the 8th USENIX conference on Operating systems design and implementation (OSDI), 2008, pp. 29-42. Zhang, S. and Han, J. and Liu, Z. and Wang, K. and Feng, S. Accelerating MapReduce with Distributed Memory Cache, in: 15th International Conference on Parallel and Distributed Systems (ICPADS), 2009, pp. 472-478. Under the Hood: Scheduling MapReduce jobs more efficiently with Corona. https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-mo re-efficiently-with-corona/10151142560538920 Y. Chen, A. Ganapathi, R. Griffith, and R. Katz, The case for evaluating mapreduce performance using workload suites, in: 19th International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), 2011, pp. 390–399. R. Pike, S. Dorward, R. Griesemer, S. Quinlan. Interpreting the Data: Parallel Analysis with Sawzall, Scientific Programming Journal 13(4) 227-298. Thusoo, A. et al., Hive-A Warehousing Solution Over a Map-Reduce Framework, in: Proceedings of the VLDB Endowment, vol.2, no.2, Aug. 2009, pp. 1626-1629. Olston, C. and Reed, B. and Srivastava, U. and Kumar, R. and Tomkins, A. Pig latin: a not-so-foreign language for data processing, in: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, 2008, pp. 1099-1110. M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E. Paulson,A. Pavlo, and A. Rasin, MapReduce and parallel DBMSs: friends or foes?, Communications of the ACM 53(1) 64–71. Lee, R. and Luo, T. and Huai, Y. and Wang, F. and He, Y. and Zhang, X, Ysmart: Yet another sql-to-mapreduce translator, in: 31st International Conference on Distributed Computing Systems (ICDCS), 2011, pp. 25-36. Intel Distributed Hadoop. http://www.intel.cn/idh Huang, S. and Huang, J. and Dai, J. and Xie, T. and Huang, B., The HiBench Benchmark Suite: Characterization of The MapReduce-Based Data Analysis, in: 26th International Conference on Data Engineering Workshops (ICDEW), 2010, pp. 41-51. Benchmarking and Stress Testing an Hadoop Cluster With TeraSort, TestDFSIO & Co. http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with -terasort-testdfsio-nnbench-mrbench/ PoweredBy - Hadoop Wiki. http://wiki.apache.org/hadoop/PoweredBy

[16] You, H.H. and Yang, C.C. and Huang, J.L, A load-aware scheduler for MapReduce framework in

[17]

[18]

[19]

[20] [21]

[22]

[23]

[24]

[25]

[26] [27]

[28] [29] [30] [31]

heterogeneous cloud environments, in: Proceedings of the 2011 ACM Symposium on Applied Computing, 2011, pp. 127-132. Nanduri, R. and Maheshwari, N. and Reddyraja, A. and Varma, V., Job Aware Scheduling Algorithm for MapReduce Framework, in: 3rd IEEE International Conference on Cloud Computing Technology and Science (CloudCom), 2011, pp. 724-729. M. Hammoud and M. Sak, Locality-Aware Reduce Task Scheduling for MapReduce, in 3nd IEEE International Conference on Cloud Computing Technology and Science (CloudCom), 2011, pp. 570-576. Xie, J. et al. , Improving MapReduce performance through data placement in heterogeneous Hadoop clusters, in: 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010, pp 1-9. He, C. and Lu, Y. and Swanson, D. Matchmaking: A New MapReduce Scheduling Technique, in: 3rd International Conference on Cloud Computing Technology and Science (CloudCom), 2011, pp 40-47. Mao, H. and Hu, S. and Zhang, Z. and Xiao, L. and Ruan, L. A Load-Driven Task Scheduler with Adaptive DSC for MapReduce, in: 2011 IEEE/ACM International Conference on Green Computing and Communications (GreenCom), 2011, pp 28-33. R. Vernica, A. Balmin, K. S. Beyer, and V. Ercegovac. Adaptive MapReduce using situation-aware mappers, in: Proceedings of the 15th International Conference on Extending Database Technology, 2012, pp 420-431. Becerra Fontal, Y. and Beltran Querol, V. and Carrera P, D. and others., Speeding up distributed MapReduce applications using hardware accelerators, in: International Conference on Parallel Processing(ICPP), 2009, pp. 42-49. Xin, M. and Li, H. An Implementation of GPU Accelerated MapReduce: Using Hadoop with OpenCL for Data-and Compute-Intensive Jobs, in: 2012 International Joint Conference on Service Sciences (IJCSS), 2012, pp. 6-11. Li, B. and Mazur, E. and Diao, Y. and McGregor, A. and Shenoy, P., A Platform for Scalable One-Pass Analytics using MapReduce, in: Proceedings of the 2011 ACM SIGMOD international conference on Management of data, 2011, pp. 985-996. Seo, S. et al. , HPMR: Prefetching and Pre-shuffling in Shared MapReduce Computation Environment ,in: International Conference on Cluster Computing and Workshops (CLUSTER), 2009, pp. 1-8. Wang, Y. and Que, X. and Yu, W. and Goldenberg, D. and Sehgal, D., Hadoop Acceleration Through Network Levitated Merge, in: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, 2011, pp. 57-67. Babu,S. , Towards automatic optimization of mapreduce programs, in: Proceedings of the 1st ACM symposium on Cloud computing (SoCC), 2011, pp.137-142. Danga Interactive, memcached, http://www.danga.com/memcached/. RongGu/SHadoop. https://github.com/RongGu/SHadoop Todd Lipcon/[MAPREDUCE-1906] Lower default minimum heartbeat interval for tasktracker > Jobtracker - ASF JIRA.https://issues.apache.org/jira/browse/MAPREDUCE-1906

Author Biography

Rong Gu, received the BS degree in computer science from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 2011. He is currently a PhD candidate in computer science in Nanjing University, Nanjing, China. His research interests include parallel and distributed computing, cloud computing, and big data parallel processing. Xiaoliang Yang, received the BS degree in computer science from YanShan University, China, in 2008 and the Master degree in computer science from the Nanjing University, Nanjing, China, in 2012. He currently works at Baidu. His research interests include parellel and distributed computing and bioinformatics. Jinshuang Yan, received the BS degree in computer science from the Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 2010. He is currently a Master student in computer science from Nanjing University, Nanjing, China. His research interests include parallel computing, and large-scale data analysis. Yuanhao Sun joined Intel in 2003. He was managing the big data product team in Datacenter Software Division at Intel Asia-Pacific R&D Ltd., leading the efforts for Intel’s distribution of Hadoop and related solution and services. Yuanhao received his bachelor and master degree from Nanjing University, both in computer science. Bin Wang received the BS degree in software engineering from Nanjing University. He is currently a Master degree candidate in software engineering from Nanjing University. And he is taking an internship in Intel Asia-Pacific R&D Center. His research interests include distributed computing, large-scale data analysis and data mining.

Chunfeng Yuan is currently a professor in computer science department of Nanjing University, China. She received her bachelor and master degree from Nanjing University, both in computer science. Her main research interests include compute system architecture, big data parallel processing and Web information mining. Yihua Huang is currently a professor in computer science department of Nanjing University, China. He received his bachelor, master and Ph.D. degree from Nanjing University, both in computer science. His main research interests include parallel and distributed computing, big data parallel processing and Web information mining.