- People err. That is a fact of life. People are not precision machinery designed for accuracy
- In fact, we humans are a different kind of device entirely. Creativity, adaptability, and
- flexibility are our strengths. Continual alertness and precision in action or memory are our
- weaknesses. We are amazingly error tolerant, even when physically damaged. We are extremely
- flexible, robust, creative, and superb at finding explanations and meanings from partial and
- noisy evidence. The same properties that lead to such robustness and creativity also produce
- errors. The natural tendency to interpret partial information -- although often our prime virtue
- -- can cause operators to misinterpret system behavior in such a plausible way that the
- misinterpretation can be difficult to discover. - Donald A. Norman
Robustness in the context of software systems and applications is defined as the degree to which a system or component can still function in the presence of pertubations: faults, failures or adverse conditions. It is is associated with the resilience of a system and the ability to maintain function despite adverse (worst case) conditions and unfavorable changes in internal structure or external environment. A robust system is "pertubation-resistant". A system which can stil function in the presence of faults is called fault tolerant. Robustness depends on the fragility and brittleness of a system, and is a measure of how sensitive a particular system is to changes and disturbances. Reliability, robustness and fault tolerance can be achieved by redundancy and replication. In his paper The Ontology of Complex Systems, William Wimsatt gives the following definition of robustness:
- Things are robust if they are accessible (detectable, measureable, derivable, defineable, produceable, or the like) in a variety of independent ways.
Basically, a system is robust if a certain operation or process can be achieved in many ways, if there are many ways of doing the same thing. For example, there are many different ways to reach a node, because there is more than one path to it, or there are many different ways to access a certain node type, because there are many redundant instances.
- diversity of links: reach a node on different paths
- diversity of nodes: access different redundant instances of a node
- diversity of design/versions: reach a computational goal in different ways
In any case, diversity and redundancy are the key. There are in general two different ways to increase robustness in distributed systems: first redundancy of components, elements and nodes, and second redundancy of paths, links and channels between them. The former can be found in the replication of nodes, the latter can be observed in the internet, if one route is blocked, then another is taken. If there is always another way to reach the goal, then the failure of a single node or link does not affect the function of the system. A third method to increase robustness and fault tolerance is 'design diversity' to protect the system against design faults: the use of multiple functionally equivalent but diverse program versions based on the same specification to ensure safety in critical applications and to protect the most vital subsystems of complex systems and networks against design faults (also known as N-Version Programming or NVP which was first proposed by Algirdas Avizienis in 1977).