Replicating multithreaded services



Journal Title

Journal ISSN

Volume Title



For the last 40 years, the systems community has invested a lot of effort in designing techniques for building fault tolerant distributed systems and services. This effort has produced a massive list of results: the literature describes how to design replication protocols that tolerate a wide range of failures (from simple crashes to malicious "Byzantine" failures) in a wide range of settings (e.g. synchronous or asynchronous communication, with or without stable storage), optimizing various metrics (e.g. number of messages, latency, throughput). These techniques have their roots in ideas, such as the abstraction of State Machine Replication and the Paxos protocol, that were conceived when computing was very different than it is today: computers had a single core; all processing was done using a single thread of control, handling requests sequentially; and a collection of 20 nodes was considered a large distributed system. In the last decade, however, computing has gone through some major paradigm shifts, with the advent of multicore architectures and large cloud infrastructures. This dissertation explains how these profound changes impact the practical usefulness of traditional fault tolerant techniques and proposes new ways to architect these solutions to fit the new paradigms.