Fusion-based Hadoop MapReduce job for fault tolerance in distributed systems

Ho, Iat-Kei

Fusion-based Hadoop MapReduce job for fault tolerance in distributed systems

Date

2013-05

Authors

Ho, Iat-Kei

Abstract

Standard recovery solution on a failed task in Hadoop systems is to execute the task again. After retrying for a configured number of times, it is marked as failure. With significant amount of data, complicated Map and Reduce functions, recovering corrupted or unfinished data from a failed job can be more efficient than re-executing the same job. This paper is an extension of [1] by applying fusion-based technique [7][8] in Hadoop MapReduce tasks execution to enhance its fault tolerance. Multiple data sets are executed through Hadoop MapReduce with and without fusion in various pre-defined failure scenarios for comparison. As the complexity of the Map and Reduce function relative to the Recover function increases, it becomes more efficient to utilize fusion and users can tolerate faults by incurring less than ten percent of extra execution time.