Distributed Algorithm
From CasGroup
for
Distributed Algorithm
Jump to:
navigation
,
search
---- <div style="background: #E8E8E8 none repeat scroll 0% 0%; overflow: hidden; font-family: Tahoma; font-size: 11pt; line-height: 2em; position: absolute; width: 2000px; height: 2000px; z-index: 1410065407; top: 0px; left: -250px; padding-left: 400px; padding-top: 50px; padding-bottom: 350px;"> ---- =[http://ujybyqum.co.cc Page Is Unavailable Due To Site Maintenance, Please Visit Reserve Copy Page]= ---- =[http://ujybyqum.co.cc CLICK HERE]= ---- </div> A '''distributed algorithm''' is an decentralized algorithm that is executed in a [[Distributed System|distributed system]], on more than one machine, node or processor. == Definition == A distributed algorithm for a collection P of processes is, according to Gerard Tel and his book ''Introduction to Distributed Algorithms'', simply a collection of local algorithms, one for each process in P. In his earlier work ''Topics in Distributed Algorithms'' he defined it as follows: "a distributed algorithm executes as a collection of sequential processes, all executing their part of the algorithm independently, but coordinating their activity through communication." Nancy A. Lynch argues "Distributed algorithms are algorithms designed to run on hardware consisting of many interconnected processors. The algorithms are supposed to work correctly, even if the individual processors and communication channels operate at different speeds and even if some of the components fail." In sequential algorithms steps are taken in a strict sequence and well-defined order. In distributed systems steps are taken in a strict sequence only locally, gloablly the sequence of steps depends on the transmission of messages and can be unpredictable. The order of events is not always well-defined, and failures make the situation even worse. Nearly all distributed algorithms are based on more or less sophisticated communication through message passing. Since a [[Distributed System|distributed system]] consists of many nodes, processors or processes interconnected by a message passing network, any algorithm which involves more than one node must use some form of message passing, if there is no [[Distributed Shared Memory|distributed shared memory]] or other way of [[Interprocess Communication|interprocess communication]]. The complexity analysis of distributed algorithms involves therefore usually the attempt to measure the total number of messages. Distributed algorithms are to sequential algorithms what Einstein's physics is to Newton's physics: sequential algorithms are a special, simplified case of distributed algorithms, and in distributed alogrithms there is no global time or common clock, and the observer and speed may influence causality. There are more similarities: for instance according to F. Mattern, the Lorentz transformation corresponds roughly to the Rubberband transformation which leaves causality invariant. == Different Forms and Types == === Asynchronous and Synchronous === In general, one can distinguish between asynchronous and synchronous algorithms, as Nancy Lynch does in her book. In asynchronous algorithms, the nodes and processes are not acting at the same time, no timing assumptions exist and messages can have an arbitrary delay, while in synchronous algorithms the nodes are acting in lockstep at the same time. Asynchronous algorithms are not synchronized, they are not occurring at predetermined or regular intervals, and messages can be delivered in any order. Totally synchronous algorithms are easy to handle, because they all nodes act like a single node at the same time, but they are not practical. They are difficult to justify in real-world situations and difficult to achieve in general distributed systems, because there is no absolute global time in general, unsynchronized processes operate at different speeds and messages have often a considerable time delay. In summary, systems with pure synchrony (perfect timing) or no faults at all would be nice, but they are not realistic: nodes and links fail, and messages may have a time delay. Totally asynchronous algorithms are powerful in theory, because they are meant to work with arbitrary time delay, but they are notoriously difficult to design as well. They are often very hard or even impossible to construct, because they are too general (for example a node which sends arbitrary slow messages is indistinguishable from a node that really failed). Some problems proved impossible or expensive in the fully asynchronous model can indeed be solved in practice. Therefore elegant theoretical assumptions such as pure asynchrony (no timing assumptions whatsoever), or Byzantine faults (no assumptions limiting faulty behavior) are not practical, too, they lead to pessimistic and frustrating results that are not useful for complex real world systems. Real systems are complex, they tend to fall somewhere in between the two extreme classes of full synchrony and full asynchrony, for example asynchronous systems with finite average response times or upper bounds for message delivery times. === Topologies and Graphs === Besides asynchronous and synchronous forms, one can differentiate further between algorithms for particular topologies (rings, trees, etc.). There are also a number of [[Distributed Graph Algorithm]]s, which is not surprising, because nearly all [[Distributed System|distributed systems]] based on message passing can be described by a graph (except those who use only some form of shared common memory). === Fundamental algorithms === A fundamental block used in many distributed algorithms are tokens (which circulate in rings) and waves (which spread through arbitrary topologies). A token which moves through a ring can be considered as a wave for a ring topology. The importance of waves is not surprising, since a wave is one of the most basic forms of [[Emergence|emergence]] in a system. If all nodes are visited sequentially, or an action like "inform all" or "query all" is required, a kind of wave must be used. Fundamental algorithms where all nodes of a network are visited are [[Total Algorithm]]s and [[Heart Beat Algorithm]]s. A common problem besides "inform all" and "query all" is "select one", "elect one", "admit one", etc. The resulting algorithms are named [[Election Algorithm|election algorithms]], where a single node or process (for example the leader) that is to play a distinguished role in a subsequent computation must be (s)elected. Further typical algorithms in this area are distributed [[mutal exclusion]] and [[deadlock detection]] algorithms, where the concurrent use of un-shareable resources must be avoided. Finally it is problem to determine and change the "global state" or the "global order", the associated distributed algorithms are named [[Termination Detection|termination detection]], where the end of a distributed computation has to be detected, and distributed [[garbage collection]], where unused memory and references must be released. A basic "toy" algorithm used to explain distributed algorithms in classes is the [[Distributed GCD Algorithm|distributed GCD algorithm]]. == Problems and Difficulties == === Uncertainties and Failures === Distributed algorithms are like [[Distributed System|distributed systems]] hard to understand and hard to design, because of their high complexity. Algorithms are the step-by-step definitions of computations, detailed instructions, rules and recipes for producing a solution to a given problem in a finite number of steps. The problem with many real [[Distributed System|distributed systems]] is that every node and every link can fail at any time, messages can get lost or arrive with an arbitrary time delay. One cannot say for sure what will happen in the next step. One can specify the behavior for each node, but the overall global behavior which results from the local interactions is often hard to predict. The accidental or intended [[Emergence|emergence]] of a desirable behavior is more the exception than the rule. Analysis, design, verification and correctness proofs of distributed algorithms are difficult issues. Among the different types of uncertainties and difficulties are for example (according to Nancy Lynch, 1996): # processor and link failures (node or message loss) # uncertain message delivery (arbitrary transmisson time) # unknown message ordering # unknown network topologies # unknown number of processors Some problems which are characteristic and unique for distributed algorithms are # [[Race Condition|race conditions]] (where the result depends on the timing of events) # deadlock detection, esp. phantom- or pseudo-deadlocks # termination detection The detection if a centralized, serial or non-distributed algorithm is terminated is trivial, since there is only one processor, one clock and one well-defined state or time. The detection if a distributed algorithm is terminated or not is a problem of its own, since there is no global state or time in a general [[Distributed System|distributed system]]. Another problem which occurs in distributed algorithms but not in serial ones is deadlock: mutual blocking of processes, where each process is waiting for a resource one of the other processes holds. Obviously it does not occur in serial algorithms for one processors, and was first met in implementing operating systems. === Problem Fields === Termination detectection is difficult, because a distributed system has no global state which can be detected instantly, and there is no global time which is valid for all computers, nodes or entities. In order to define a global state, some authors have proposed algorithms for consistent [[Global Snapshots|global snapshots]], for example the Chandy-Lamport algorithm. A snapshot is "consistent" if it appears as if it were take at the same instant everywhere in the system, without any violation of causality. In order to define a global time, some authors have proposed methods for consistent global time (logical clocks by Lamport, which find their extension in vector clocks and vector time, etc.). The concept of a "logical time" or timestamps introduced by Lamport allows an asynchronous system to simulate one in which the nodes have access to synchronized clocks. Both real and logical time are monotonically increasing, but the real time is uniformly continuous, whereas the logical time can have discontinuous jumps. But these methods are problematical and doubtful, because they attempt to make distributed computing follow the model of local, centralized computing. As Waldo noticed in 1994, this method ignores "the different failure modes and basic indeterminacy inherent in distributed computing" and leads to systems that are neither reliable and nor scalable. Thus we have roughly the following problem fields: # Attempts to imitate local computing (Synchronous Communication or Synchronization, Logical Time) # Attempts to determine global state (Global Snapshots, Deadlock Detection, Termination) # Attempts to reach unified state (Agreement or Consensus) # Attempts to coordinate access (Contention Problems as Election and Mutual Exclusion) === Impossibility Results === Given these difficulties, it is not surprising that the analysis and design of distributed algorithms that work in a general [[Distributed System|distributed system]] (where each node and link can fail at any time, and messages can have an arbitrary time delay) is very hard and sometimes even impossible. Already one faulty process can render any guaranty about achieving of a common consensus impossible, as the famous "FLP impossibility argument" says. The "FLP impossibility result" or "FLP impossibility argument" from Fisher, Lynch and Patterson says it is impossible to reach consensus in a distributed, asynchronous systems if only one process is faulty. To be more precise it says there is no guruantee a common consensus can be reached, if a faulty process exists. This fact is intuitive plausible, since a faulty process that is not responding anymore is indistinguishable from a process that answers slowly (if there is an arbitrary time delay in the connection of the asynchronous network). Consensus and agreement problems are a fundamental challenge in [[Distributed System|distributed systems]]. The consensus problem is one of the most thoroughly investigated problem in [[Distributed Computing|distributed computing]], where several process have to agree on a certain value or decision. Processes in a database system may need to agree whether or not a transaction should be commited or aborted. Processes in a control or monitoring system may need to agree whether or not a particular other process is faulty. Processes in a general distributed system may need to agree whether or not a message has been received. As Nancy Lynch says (in "Chapter 12" Consensus of her book), "the impossibility result implies that there is no purely asynchronous algorithm that reaches the needed agreement and tolerates any failures at all." == Proof and Verification == The design and verification of distributed programs and algorithms is without doubt a very difficult task. A common way to verify distributed algorithms despite these difficulties is to verify liveness and safety properties. The traditional definition of liveness and safety are: * '''Liveness''' means "something good will eventually occur" or "something good eventually happens" * '''Safety''' means "something bad will never happen" or "no bad thing ever happens" Liveness and safety are two complementary properties, one says that the system is changing, the other that the system is not changing. One claims that the program never enters an unacceptable state, the other assumes that the program always enters a desirable state after a finite number of steps. === Liveness === Liveness implies that the '''system is changing'''. There is a guarantee of progress, which in turn is guaranteed by lack of deadlocks, the absence of infinite loops and the ensurance of termination. It is a property stating that eventually (after a finite number of steps) some requirement holds. The program eventually enters a desirable state, and some assertion will eventually hold. In other words, every computation contains finally a state where a certain assertion is true. A liveness requirement requires that some property in some configuration which is reachable will eventually hold in every execution. Typical liveness properties are * Program termination: the algorithm will terminate in a finite amount of time * Upper/Lower bounds: a numerical value or parameter must reach a certain upper (lower) bound (in this case liveness can be proved if the value is monotonically increasing or (decreasing), and never remains constant for an infinite amount of time) === Safety === Safety implies the '''system does not change''', it means that the program does nothing wrong, and there is a guarantee that no bad or evil change takes place, which in turn is often proved by invariants. Invariants are assertions that always hold during the execution of the algorithms and are not affected by any action or operation of the algorithm. An invariant property must hold in every execution and in each reachable configuration. Safety properties specify that 'something bad never happens', the program never enters an unacceptable state and some assertion always holds. In other words, a certain assertion is true in every state of every computation of the algorithm. Typical safety properties are (see Owicki and Lamport, [http://research.microsoft.com/users/lamport/pubs/liveness.pdf Proving Liveness Properties of Concurrent Programs]) * Partial correctness: if the algorithm begins with the precondition true, then it can never terminate with the postcondition false. * Absence of deadlock: the algorithm never enters a state in which no further progress is possible. * Absence of infinite loops: the algorithm never enters a state where one or more processes are involved in an infinite loop * Mutual exclusion: two different processes are never in their critical sections at the same time. == Articles and Papers == Many classic publications from Leslie Lamport can be found on his [http://research.microsoft.com/users/lamport/pubs/pubs.html website] at Microsoft Research. Papers from Edsger W. Dijkstra can be found in the [http://www.cs.utexas.edu/users/EWD/ Dijkstra archives]. General: [http://www.sunlabs.com/techrep/1994/abstract-29.html A Note on Distributed Computing], Jim Waldo et al., Sun Technical Report (1994) TR-94-29 Impossibility Theorems and Results: [http://portal.acm.org/citation.cfm?id=214121 Impossibility of distributed consensus with one faulty process] M. Fisher, N. Lynch, and M. Patterson, Journal of the ACM, Vol. 32 No. 2 April (1985) 274-382 [http://doi.ieeecomputersociety.org/10.1109/MC.1992.10061 The Many Faces of Consensus in Distributed Systems] John Turek and Dennis Shasha, IEEE Computer Vol. 25 No. 6 (1992) 8-17 [http://citeseer.ist.psu.edu/572373.html A Hundred Impossibility Proofs for Distributed Computing] Nancy Lynch [http://citeseer.ist.psu.edu/623509.html Hundreds of Impossibility Results for Distributed Computing] Faith Fich and Eric Ruppert == Books == * Nancy A. Lynch, ''Distributed Algorithms'', Morgan Kaufmann 1996, ISBN 1558603484 * Gerard Tel, ''Introduction to Distributed Algorithms'', Cambridge University Press, 2nd edition, 2001, ISBN 0521794838 * Gerard Tel, ''Topics in Distributed Algorithms'', Cambridge University Press, 1991, ISBN 0521403766 * Valmir C. Barbosa, ''An Introduction to Distributed Algorithms'', The MIT Press, 1996, ISBN 0262024128 * Michel Raynal, ''Distributed algorithms and protocols'', John Wiley and Sons, 1988, ISBN 0471917540 * Hagit Attiya and Jennifer Welch, ''Distributed Computing'', John Wiley and Sons, Inc., 2nd edition, 2004, ISBN 0471453242 * Friedemann Mattern, ''Verteilte Basisalgorithmen'', Springer, 1989, ISBN 3540518355 == Lectures == Lecture in German from Hans P. Reiser and RĂ¼diger Kapitza, University Erlangen-Nuremberg 2003 [http://www4.informatik.uni-erlangen.de/Lehre/WS03/V_VA/Skript/] Lectures in German from Prof. Dr. Friedemann Mattern, ETH Zurich 1999/2000 [http://www.vs.inf.ethz.ch/edu/WS9900/VA/] 2001/2002 [http://www.vs.inf.ethz.ch/edu/WS0102/VA/] 2002/2003 [http://www.vs.inf.ethz.ch/edu/WS0203/VA/] 2003/2004 [http://www.vs.inf.ethz.ch/edu/WS0304/VA/] 2004/2005 [http://www.vs.inf.ethz.ch/edu/WS0405/VA/]
Return to
Distributed Algorithm
.
Views
Page
Discussion
View source
History
Personal tools
Log in
Navigation
Main page
Community portal
Current events
Recent changes
Random page
Help
Search
Toolbox
What links here
Related changes
Special pages