Redundancy

From CasGroup

Jump to: navigation, search

Redundancy is related to duplication and replication and refers in general to the quality or state of being redundant, i.e. there are more components than needed to perform a function. This can have a negative connotation, superfluous, but also positive, serving as a duplicate for preventing failure of an entire system. It is a common way to achieve high levels of fault tolerance in a computer system. If one subsystem, element or component fails, there is an redundant replacement ready to begin operating.

Contents

TMR and NMR

In engineering, redundancy means simply the duplication of critical components of a system with the intention of increasing reliability. In safety-critical systems, such as fly-by-wire aircraft, some parts of the control system may be triplicated. An error in one component then may then be out-voted by the other two. In a triply redundant system, the system has three sub components, all three of which must fail before the system fails. Since each one rarely fails, and the sub components are expected to fail independently, the probability of all three failing is calculated to be extremely small.

This common form of redundancy is also known as triple modular redundancy (TMR): the threefold replication of a component to compensate and correct the failure of a single component. The primary flight computer of the Boeing 777 for example uses triple modular redundancy [1]. Triplex or triple redundancy was originally envisaged by John Von Neumann. Having three units or components or elements offers an essential advantages, because three redundant components are enough elements to reach an unambiguous decision in a majority voting, as long as at least two components still work correctly. Systems with dual redundancy have difficulties to come to an agreement and to correct each other in most cases because the two redundant components can end up in a continuous loop of chatter about which one is the more correct (as in some marriages). N modular redundancy (NMR) is a generalization of TMR. A system of N = 2n + 1 redundant elements can mask or tolerate n faulty elements, if the elements act as voters and make a majority decision. The basic structure of TMR and NMR is very simple: in TMR (NMR) you have three (n) units with a voter.

In the Space Shuttle computer control system [2], four redundant computers are used to achieve reliability during flight-critical phases of a mission. Flight-critical or mission-critical phases are at the beginning (launch/ascent) and the end (landing/entry), periods where a loss of the system might mean loss of the vehicle. Before launch and during the on-orbit phase, the degree of active replication is reduced and different computers are running different applications. The four redundant computers during flight-critical phases are synchronized at the applications level and provide bit-for-bit identical output. The system is designed to cope with two successive failures. If a computer becomes defective, it is overruled by the other three and further ignored. If another computer fails, it is overruled by the remaining two. A fifth computer which was independently programmed can perform critical functions if all four computers fail. It can only be engaged manually by crew action.

Hybrid Redundancy

Hybrid redundancy offers the highest reliability. It is a combination of NMR (for error masking) and spare switching (for fault prevention and rejuvenation). An NMR system masks permanent and intermittent failures but its reliability drops below that of a single module for very long operation or mission times. Hybrid redundancy overcomes this by adding spare modules to renew the system by replacing active modules. A hybrid NMR system with spares consists of a core of N processors (NMR), and M spares.

There are other forms of hybrid redundancy, for example self-purging redundancy: all units actively participate in a NMR system, and each module has a capability to remove itself from the system if its faulty

Related Concepts

Contrary to traditional software programs or machines, self-organizing systems and agent based systems have often a high redundancy. This can be observed for example in nature, where the death of a single ant, termite, or honey bee does not affect the existence of the whole colony. Likewise the failure of a single neuron does not affect the function of a whole brain, although this does not mean that the principle of self-organization explains how the brain works or that a brain functions in the same way as an ant colony. However, it is clear that traditional software programs have a very low redundancy:

* a program does not work if arbitrary code lines are removed
* a self-organizing network/system usually still works if arbitrary nodes/agents are removed

Literature

[1] Y.C. (Bob) Yeh, Triple-triple redundant 777 primary flight computer, Proceedings of the 1996 IEEE Aerospace Applications Conference, Vol. 1, New York (1996) 293-307

[2] Alfred Spector and David Gifford, The space shuttle primary computer system, Communications of the ACM Volume 27, Issue 9 (September 1984)

[3] B. J. Flehinger, Reliability Improvement through Redundancy at Various System Levels, IBM J. Res. and Dev., vol. 2, April (1958) 148-158

Personal tools