Fault tolerance is the ability of a system to continue normal operation despite unexpected hardware or software failure, faults or outages. It is the ability to cope with unexpected faults, failures and crashes. In software, errors and faults arise often from untested and extreme situations. You can distinguish between faults, errors and failures: a fault can cause an error, and an error can lead to a failure. Complete fault avoidance and exhaustive testing are often not possible. The goal is therefore "to construct a system in such a way that it can automatically recover from partial and minor failures without seriously affecting the overall performance" [Tan02]. It is an important subject in distributed systems, because partial failures (when one or more components crash) are common to them.
Fault tolerance has been important since the inception of computers, as Gray and Reuter notice. In the early days of computing, fault tolerance was important because the vacuum tubes as fundamental computer parts were very unreliable. Von Neumann's early work was done in this time. Fault-tolerance was always an important topic for telephone companies, for the military, and the space missions. It is interesting that IBM is the company that emphasizes the vision of autonomic computing and self-management most strongly, although the problems of automated operations and fault-tolerance have been examined at IBM itself for many decades. IBM was a pioneer of fault-tolerance (the Apollo Guidance computer, the Shuttle computer, etc.).
Availability, Reliability and Performance
There are many things that could go wrong in a large system. Distributed systems are often quite large, and the larger the system, the bigger is the probability for faults and failures. In very large distributed systems, faults and failures are no longer the exception, they are the norm. Fault-Tolerance determines together with fault prevention and fault avoidance mechanisms the degree of reliability (deals with continuity of service) and availability (deals with readiness for usage), since unreliability and unavailability are a consequence of faults. In order to build reliable systems, one has to prevent, anticipate and avoid as many faults as possible, and tolerate the rest, the faults that still remain in the system.
Even if the system tolerates disturbances and faults, it can suffer a performance loss. A few faults can lead to degraded performance, and many faults can result in unacceptable performance. Through recovery actions the system can be brought back into regions of acceptable performance. In general there are a number of reasons why a system is not fault tolerant: even if is is fault tolerant in principle, it may fail to anticipate a fault, fail to perceive a fault once it has happened, fail to attempt to solve it after it has been perceived, or fail to succeed in attempts to solve it. These four reasons correspond roughly to the four fault removal and recovery phases.
Fault Removal and Recovery Phases
The four phases of fault removal and recovery are: find/detect it, judge/isolate it, remove/fix it, and finally prevent it.
- detection and diagnosis: error detection and diagnostic checks
- confinement and impact analysis: damage confinement and assesment
- recovery and repair: error recovery, fault treatment, and repair
- return and evaluation: resuming normal service, evaluation and analysis
The most difficult faults are certainly distributed and concurrent bugs, which arise for example through race conditions. They are difficult to find, difficult to isolate and difficult to fix.
Detection and Failure Detectors
The first step for all fault tolerance strategies is detection by checking the system state. In the case of distributed systems, faults and failures are often unavoidable, in the processors or nodes and in the communication links between them as well. In this case, fault-tolerance can be increased by detecting faults early, for example by timeout checks, or by making the system less distributed, for instance through increasing synchrony. In agreement/consensus problems for example, two methods are common:
- Synchrony: Provides the systems with a limitation for the delays in message delivery. Arbitrary slow messages must be avoided, because processes sending these messages are indistinguishable from crashed processes.
- Failure Detectors: Provide the process with a list of suspect processes. There is no straightforward intuitive implementation of failure detectors. They are comparable to detectives and oracles, because they must detect a failure like a detective and predict a failure like an orcale.
Chandra and Toueg introduced the concept of a failure detector in 1991 with their paper "Unreliable failure detectors for reliable distributed systems". Failure detectors can be classified according to their accuracy and completeness into different classes. The completeness property is a measure for the number of total detection of failures, while the property accuracy restricts the mistakes a failure detector can make.
Confinement and Loose Coupling
After we have detected the error or fault, it is important to judge it. How severe is the fault ? How far can it propagate through the system or has already affected the system ? How strongly is it correlated to other faults ? Amazon.com for instance distinguishes between two major failure types, Severity 1 and Severity 2. Severity 1 problems are rare, they affect customers directly and need to be resolved immediately. Severity 2 problems are much more frequent. They typically affect only single components, computers or services and do not immediately affect customers or the behavior of the website, but could turn into Severity 1 problems if they are not resolved quickly.
The stronger the coupling between the components and elements in a system, the larger is the effect of a fault in a single component on the whole system. Strong dependencies or a whole web of dependencies between components (as in many object oriented systems) can increase the probabilities for failures in a system. A large distributed system has typically many components, sometimes linked by chains of dependencies. If some component in a distributed system crashes (Failures in a large system are inevitable) then other system components that depend on it may freeze or crash too, and chains of dependencies gradually can cause more and more of the overall system to fail or freeze. Already Lamport noticed "A distributed system is one in which the failure of a machine you have never heard of can cause your own machine to become unusable". In a strongly coupled system a fault can propagate much wider than in a loose coupled systems. Loose coupling can reduces the risk that a change in one component or module will force a change in another component. One way to achieve loose coupling is communication via queued messaging and standardized interfaces, for example through the use of Web Services and message oriented middleware (MOM). Developers can decouple components of their application with messages and queues so that they run independently. Perhaps the highest degree of loose coupling can be achieved by server virtualization in form of virtual machines.
Recovery and repair
A system can recover from a fault if the degree of redundancy is high enough (by reconfiguring the system and substitution of the faulty component by a healthy, redundant component), of if the individual elements can be restored or restarted.
Similar to robustness, the key to achieve fault tolerance is "the ability of a system to develop and carry out alternate ways of achieving its objectives in the face of failures and other unexpected difficulties." Normal software programs have only way of doing something, the way the programmer has precisely specified, and each component has only a single purpose or function in a well-defined context.
As Russell Abbott says (ACM Comp. Surveys 1990), for fault tolerance and robustness "there must be multiple ways of achieving the same end result". If this condition is fulfilled, then it is possible for a fault-tolerant system "to achieve the same end results in a number of different ways". If one component fails, another can take it's place, if one path is blocked, another can be taken.
Design of Fault Tolerant Systems
Jim Gray and Andreas Reuter write in their book "Transaction Processing" (1993): "Don't be fooled by the many books on complexity or by the many complex and arcane algorithms you find in this book and elsewhere. Although there are no textbooks on simplicity, simple systems work and complex one don't."
Rejuvenation and Renewal
Rejuvenation and Renewal are common natural principles against decay. Rejuvenation by continous reboot and restart is one possibility to prevent faults. Recovery-oriented computing emphasizes the importance of microreboots, and even the Apache Web Server restarts processes periodically: "[...] consider the Apache web server. This is likely the most widely used web server software in the world and, despite the fact that many people customize it by adding their own code to it, it is also extremely reliable. The source of this reliability is largely due to a central framework that isolates the work to be done into separate processes and kills off and restarts any process that appears to have problems. In fact, Apache kills off and restarts processes periodically even if they don’t exhibit problems — which deals with creeping bugs, such as memory leaks." ( from a DDJ Review by Ron Burk, see ).
Partitioning and Integration
Modularity and partitioning of a system into subsystems with minimal interconnection requirements and loose coupling can help to isolate faults. It is essential to incorporate fault-tolerance in large distributed systems. Partitioning defines potentially replaceable redundant units and their error containment boundaries. Every subsystem should in turn be characterized by its own error-containment requirements, fault detection methods and recovery mechanisms.
Replication and Redundancy
The general approach to fault tolerance is prevention through suitable exception, error and fault handling (preventive measures through expectation of faults) and redundancy (corrective measures while or after the occurence of faults). Fundamental mechanisms for making a distributed system fault-tolerant are certainly redundancy and replication. Fault-tolerance is achieved through redundancy and failover, and redundancy in turn is achieved through replication of objects. One can distinguish between active and passive replication. Tasks and requests can be directed to one of replicated components (passive recplication), or to all of them at once in parallel (active replication). In the former case, one component is the primary one and the additional redundant components are secondary backups needed only when things go wrong, whereas in the latter case, the correct result is usually the result of a voting, election or quorum
- Replication and Redundancy: Providing multiple instances of the same system, use them all at once (active replication) or switch to one of the remaining instances in case of a failure (passive replication with fall-back or backup)
- Self-stabilization: Building systems to converge towards an error-free state automatically
- Recovery-Oriented Computing: Building systems which expect failures and can prevent and remove them through local or global restart
Fault Tolerance in Organic Systems
Redundancy, robustness and fault-tolerance are a inherent function of many natural and organic systems. Therefore they can serve as an inspiration for the design of artificial fault-tolerant systems. Already John von Neumann (1903-1957) reasoned about fault tolerance in organic systems. He said "It's very likely that on the basis of the philosophy that every error has to be caught, explained, and corrected, a system of the complexity of the living organism would not run for a millisecond."
Wondering how the fault tolerance in living organisms is achieved, he reasoned that "The system is sufficiently flexible and well organized that as soon as an error shows up in any part of it, the system automatically senses whether this error matters or not. If it doesn't matter, the system continues to operate without paying any attention to it. If the error seems to the system to be important, the system blocks that region out, by-passes it, and proceeds along other channels. [...] The duration of operability is determined by the time it takes until so many incurable errors have occurred, so many alterations and permanent by-passes have been made, that finally the operability is really impaired."
Finally he concludes about the architecture of living organisms that "The fact that natural organisms have such a radically different attitude about errors and behave so differently when an error occurs is probably connected with some other traits of natural organisms [...] The ability of a natural organism to survive in spite of a high incidence of error probably requires a very high flexibility and ability of the automaton to watch itself and reorganize itself." (all quotes from John von Neumann, "Theory and Organization of Complicated Automata", 4th Lecture "The Role of High and Extremely High Complication" in John von Neumann on Computing and Computer Theory, Vol. 12 in the Charles Babbage Institute, Reprint Series for the History of Computing Edited by William Aspray and Arthur Burks, The MIT Press, 1987).
The brain is an optimal example of fault tolerance: we have massive inherent redundancy and continuous rejuvenation through sleep. During a whole human life, the brain normally never stops working completely, even after massive failures of components (although it might be impaired). And yet humans are so well-known for making errors, that a proverb exists: "to err is human". Humans make constantly faults in reasoning, but a single fault in operating does not affect the system. Von-Neumann computers are quite the contrary: they never make faults in reasoning, but usually crash after a single fault. Is it possible that an architecture with the highest possible fault tolerance leads inevitable to errors and faults ? This would be quite paradox.
- Algirdas Avizienis, Toward Systematic Design of Fault-Tolerant Systems, IEEE Computer April (1997) 51-58
- Russell Abbott. Resourceful Systems for Fault Tolerance, Reliability, and Safety, ACM Computing Surveys, Vol. 22, No. 1, March 1990
- [Tan02] Andrew S. Tanenbaum and Maarten van Steen, Distributed Systems, Prentice Hall, 2002, ISBN 0-13-088893-1
- T.D. Chandra and S. Toueg, Unreliable failure detectors for reliable distributed systems, Journal of the ACM 43(1996) 225-267
- P.A. Lee and T. Anderson, Fault Tolerance - Principles and Practice, Springer, 1990, ISBN 3-211-82077-9
- Jim Gray and Andreas Reuter, Transaction Processing, Morgan Kauffmann Publishers, Inc., 1993 (Chapter 3: The Basics of Fault Tolerance)
ACM Queue vol. 2, no. 8 November (2004) Special Issue on Error Recovery