Browsing by Subject "Fault-tolerant computing"

Now showing 1 - 9 of 9

A unified approach to fault diagnosability of systems
(Texas Tech University, 1984-08) Narasimhan, Jagannathan
Not available
Byzantine fault-tolerance and beyond
(2006) Martin, Jean-Philippe Etienne; Alvisi, Lorenzo
Byzantine fault-tolerance techniques are useful because they tolerate arbitrary faults regardless of cause: bugs, hardware glitches, even hackers. These techniques have recently gained popularity after it was shown that they could be made practical. Most of the dissertation builds on Byzantine fault-tolerance (BFT) and extends it with new results for Byzantine fault-tolerance for both quorum systems and state machine replication. Our contributions include proving new lower bounds, finding new protocols that meet these bounds, and providing new functionality at lower cost through a new architecture for state machine replication. The second part of the dissertation goes beyond Byzantine fault-tolerance. We show that BFT techniques are not sufficient for networks that span multiple administrative domains, propose the new BAR model to describe these environments, and show how to build BAR-Tolerant protocols through our example of a BAR-Tolerant terminating reliable broadcast protocol.
Detecting and tolerating faults in distributed systems
(2008-12) Ogale, Vinit Arun, 1979-; Garg, Vijay K. (Vijay Kumar), 1963-
This dissertation presents techniques for detecting and tolerating faults in distributed systems. Detecting faults in distributed or parallel systems is often very difficult. We look at the problem of determining if a property or assertion was true in the computation. We formally define a logic called BTL that can be used to define such properties. Our logic takes temporal properties in consideration as these are often necessary for expressing conditions like safety violations and deadlocks. We introduce the idea of a basis of a computation with respect to a property. A basis is a compact and exact representation of the states of the computation where the property was true. We exploit the lattice structure of the computation and the structure of different types of properties and avoid brute force approaches. We have shown that it is possible to efficiently detect all properties that can be expressed by using nested negations, disjunctions, conjunctions and the temporal operators possibly and always. Our algorithm is polynomial in the number of processes and events in the system, though it is exponential in the size of the property. After faults are detected, it is necessary to act on them and, whenever possible, continue operation with minimal impact. This dissertation also deals with designing systems that can recover from faults. We look at techniques for tolerating faults in data and the state of the program. Particularly, we look at the problem where multiple servers have different data and program state and all of these need to be backed up to tolerate failures. Most current approaches to this problem involve some sort of replication. Other approaches based on erasure coding have high computational and communication overheads. We introduce the idea of fusible data structures to back up data. This approach relies on the inherent structure of the data to determine techniques for combining multiple such structures on different servers into a single backup data structure. We show that most commonly used data structures like arrays, lists, stacks, queues, and so on are fusible and present algorithms for this. This approach requires less space than replication without increasing the time complexities for any updates. In case of failures, data from the back up and other non-failed servers is required to recover. To maintain program state in case of failures, we assume that programs can be represented by deterministic finite state machines. Though this approach may not yet be practical for large programs it is very useful for small concurrent programs like sensor networks or finite state machines in hardware designs. We present the theory of fusion of state machines. Given a set of such machines, we present a polynomial time algorithm to compute another set of machines which can tolerate the required number of faults in the system.
Efficient fault tolerance for pipelined structures and its application to superscalar and dataflow machines
(2008-12) Mizan, Elias, 1976-; Chiou, Derek
Silicon reliability has reemerged as a very important problem in digital system design. As voltage and device dimensions shrink, combinational logic is becoming sensitive to temporary errors caused by single event upsets, transistor and interconnect aging and circuit variability. In particular, computational functional units are very challenging to protect because current redundant execution techniques have a high power and area overhead, cannot guarantee detection of some errors and cause a substantial performance degradation. As traditional worst-case design rules that guarantee error avoidance become too conservative to be practical, new microarchitectures need to be investigated to address this problem. To this end, this dissertation introduces Self-Imposed Temporal Redundancy (SITR), a speculative microarchitectural temporal redundancy technique suitable for pipelined computational functional units. SITR is able to detect most temporary errors, is area and energy-efficient and can be easily incorporated in an out-of-order microprocessor. SITR can also be used as a throttling mechanism against thermal viruses and, in some cases, allows designers to design very aggressive bypass networks capable of achieving high instruction throughput, by tolerating timing violations. To address the performance degradation caused by redundant execution, this dissertation proposes using a tiled-data ow model of computation because it enables the design of scalable, resource-rich computational substrates. Starting with the WaveScalar tiled-data flow architecture, we enhance the reliability of its datapath, including computational logic, interconnection network and storage structures. Computations are performed speculatively using SITR while traditional information redundancy techniques are used to protect data transmission and storage. Once a value has been verified, confirmation messages are transmitted to consumer instructions. Upon error detection, nullification messages are sent to the instructions affected by the error. Our experimental results demonstrate that the slowdown due to redundant computation and error recovery on the tiled-data flow machine is consistently smaller than on a superscalar von Neumann architecture. However, the number of additional messages required to support SITR execution is substantial, increasing power consumption. To reduce this overhead without significantly affecting performance, we introduce wave-based speculation, a mechanism targeted for data flow architectures that enables speculation only when it is likely to benefit performance.
Low-cost assertion-based fault tolerance in hardware and software
(2008-12) Vemu, Ramtilak, 1981-; Abraham, Jacob A.
In the recent past, there has been an increasing demand for low-cost safety critical applications. Custom-off-the-shelf (COTS) processors are preferred for usage in these applications due to their low cost. The reliability provided by these processors, however, is not sufficient to meet the safety requirements of these applications. Furthermore, due to the trends followed by the processor industry to enhance the performance of processors, the reliability provided by these processors is projected to decrease in the future. Traditional techniques for enhancing the reliability of computer systems are not viable for these applications due to the high overheads (and hence cost) incurred by them. This thesis describes fault tolerance techniques tailored for these applications, adhering to the tight overhead constraints in the area, memory, and performance dimensions. Techniques at both the hardware level (to be used by the processor manufacturers) and the software level (to be used by the application developers) are presented. At the hardware level, this thesis presents a technique for detecting faults in the processor control logic, for which techniques proposed previously incur very high overheads. Rather than detect all modeled faults, the technique protects against a subset of faults such that the best possible overall protection is achieved while incurring only permissible overheads. This subset of faults is selected depending on the probability of each individual fault causing damage to the architectural state of the processor and the overhead incurred in protecting against the fault. The technique is validated on control logic modules of an industrial processor. At the software level, this thesis concentrates on a category of errors called control flow errors. We describe an error detection technique which incurs lower overheads than any of the previously proposed techniques while at the same time detecting more errors than all of them. Even these low overheads may be too restrictive for some applications. For such applications, we present a technique for providing the best error detection capability possible at the overheads allowed. Once an error is detected, error recovery actions need to be performed. In this thesis, we present an error correction technique which automatically performs error recovery with a very low latency. The technique reuses the information available from the error detection technique to perform the recovery and hence, does not incur any additional performance penalty. All the techniques proposed at the software level have been integrated with GCC, a commonly used software compiler. This permits the fault tolerance to be incorporated into the application automatically as part of the compilation process itself. Evaluations are performed on SPEC and MiBench benchmark programs using an in-house software error injection framework.
Reconfiguration strategies in multiprocessor systems
(Texas Tech University, 1984-05) Ratheal, Steve Weldon
Not available
Reliability characteristics of neural networks having faulty interconnections
(Texas Tech University, 1991-08) Chung, Pau-choo
Not available
Robust multithreaded applications
(2008-05) Napper, Jeffrey Michael; Alvisi, Lorenzo
This thesis discusses techniques for improving the fault tolerance of multithreaded applications. We consider the impact on fault tolerance methods of sharing address space and resources. We develop techniques in two broad categories: conservative multithreaded fault-tolerance (C-MTFT), which recovers an entire application on the failure of a single thread, and optimistic multithreaded fault-tolerance (OMTFT), which recovers threads independently as necessary. In the latter category, we provide a novel approach to recover hung threads while improving recovery time by managing access to shared resources so that hung threads can be restarted while other threads continue execution.
xBFT : Byzantine fault tolerance with high performance, low cost, and aggressive fault isolation
(2008-05) Kotla, Ramakrishna Rao, 1976-; Dahlin, Michael
We are increasingly relying on online services to store, access, share, and disseminate critical information from anywhere and at all times. Such services include email, digital storage, photos, video, health and financial services, etc. With increasing evidence of non-fail-stop failures in practical systems, Byzantine fault tolerant state machine replication technique is becoming increasingly attractive for building highlyreliable services in order to tolerate such failures. However, existing Byzantine fault tolerant techniques fall short of providing high availability, high performance, and long-term data durability guarantees with competitive replication cost. In this dissertation, we present BFT replication techniques that facilitate the design and implementation of such highly-reliable services by providing high availability, high performance and high durability with competitive replication cost (hardware, software, network, management). First, we propose CBASE, a BFT state machine replication architecture that leverages application-level parallelism to improve throughput of the replicated system by identifying and executing independent requests concurrently. Traditional state machine replication based Byzantine fault tolerant (BFT) techniques provide high availability and security but fail to provide high throughput. This limitation stems from the fundamental assumption of generalized state machine replication techniques that all replicas execute requests sequentially in the same total order to ensure consistency across replicas. Our architecture thus provides a general way to exploit application parallelism in order to provide high throughput without compromising correctness. Second, we present Zyzzyva, an efficient BFT agreement protocol that uses speculation to significantly reduce the performance overhead and replication cost of BFT state machine replication. In Zyzzyva, replicas respond to a client’s request without first running an expensive three-phase commit protocol to reach agreement on the order in which the request must be processed. Instead, they optimistically adopt the order proposed by the primary and respond immediately to the client. Replicas can thus become temporarily inconsistent with one another, but clients detect inconsistencies, help correct replicas converge on a single total ordering of requests, and only rely on responses that are consistent with this total order. This approach allows Zyzzyva to reduce replication overheads to near their theoretical minima. Third, we design and implement SafeStore, a distributed storage system designed to maintain long-term data durability despite conventional hardware and software faults, environmental disruptions, and administrative failures caused by human error or malice. The architecture of SafeStore is based on fault isolation, which SafeStore applies aggressively along administrative, physical, and temporal dimensions by spreading data across autonomous storage service providers (SSPs). SafeStore also performs an efficient end-to-end audit of SSPs to detect data loss quickly and improve data durability by reducing MTTR. SafeStore offers durable storage with cost, performance, and availability competitive with traditional storage systems. We evaluate these techniques by implementing BFT replication libraries and further demonstrate the practicality of these approaches by implementing an NFS based replicated file system(CBASE-FS) and a durable storage system (SafeStore-FS).

Browsing by Subject "Fault-tolerant computing"

Results Per Page

Sort Options