Active replication vs. fusion as fault tolerance mechanisms
Abstract
This report compares two strategies for crash fault tolerance of nodes in distributed systems: active replication and fusion. To tolerate f crash faults, active replication maintains f backup servers for each primary. Fusion, however, maintains a set of f backup servers that contain the replicated data for all primaries in coded form. If n primaries each contain m data to be backed up, then, active replication requires O(nmf) space, while fusion requires only O(mf) space. These savings come at the cost of additional time during the recovery process due to additional messages and computation. For this report, we have implemented an application in which primary nodes maintain increasingly large data structures and periodically crash. Both active replication and fusion are evaluated as recovery mechanisms for the crashed nodes. The mechanisms are evaluated for performance across three metrics: backup size, time spent during updates to the backup, and recovery time. Our results validate theoretical expectations put forward in the literature – that fusion claims significant space savings at the cost of high recovery time. In the most extreme measured case, fusion costs approximately 83% of the space that replication does, while recovery time is three orders of magnitude more expensive in fusion (3.4s) than in replication (0.0037s). However, we also find that the gap between fusion and replication grows as nodes are introduced to the system. We find furthermore that fusion performs more consistently in update time due to the high variability of multicasting within active replication systems.