Chapter 6. Synchronization
In the previous chapters, we
have looked at processes and communication between processes. While
communication is important, it is not the entire story. Closely related is how processes
cooperate and synchronize with one another. Cooperation is partly supported by
means of naming, which allows processes to at least share resources, or
entities in general.
In this chapter, we mainly
concentrate on how processes can synchronize. For example, it is important that
multiple processes do not simultaneously access a shared resource, such as
printer, but instead cooperate in granting each other temporary exclusive
access. Another example is that multiple processes may sometimes need to agree
on the ordering of events, such as whether message m1 from process P was sent
before or after message m2 from process Q.
As it turns out,
synchronization in distributed systems is often much more difficult compared to
synchronization in uniprocessor or multiprocessor systems. The problems and
solutions that are discussed in this chapter are, by their nature, rather
general, and occur in many different situations in distributed systems.
We start with a discussion
of the issue of synchronization based on actual time, followed by
synchronization in which only relative ordering matters rather than ordering in
absolute time.
In many cases, it is
important that a group of processes can appoint one process as a coordinator,
which can be done by means of election algorithms. We discuss various election
algorithms in a separate section.
[Page 232]
Distributed algorithms come
in all sorts and flavors and have been developed for very different types of
distributed systems. Many examples (and further references) can be found in
Andrews (2000) and Guerraoui and Rodrigues (2006). More formal approaches to a
wealth of algorithms can be found in text books from Attiya and Welch (2004),
Lynch (1996), and (Tel, 2000).
6.1. Clock Synchronization
In a centralized system,
time is unambiguous. When a process wants to know the time, it makes a system
call and the kernel tells it. If process A asks for the time, and then a little
later process B asks for the time, the value that B gets will be higher than
(or possibly equal to) the value A got. It will certainly not be lower. In a
distributed system, achieving agreement on time is not trivial.
Just think, for a moment,
about the implications of the lack of global time on the UNIX make program, as
a single example. Normally, in UNIX, large programs are split up into multiple
source files, so that a change to one source file only requires one file to be
recompiled, not all the files. If a program consists of 100 files, not having
to recompile everything because one file has been changed greatly increases the
speed at which programmers can work.
The way make normally works
is simple. When the programmer has finished changing all the source files, he
runs make, which examines the times at which all the source and object files
were last modified. If the source file input.c has time 2151 and the
corresponding object file input.o has time 2150, make knows that input.c has
been changed since input.o was created, and thus input.c must be recompiled. On
the other hand, if output.c has time 2144 and output.o has time 2145, no
compilation is needed. Thus make goes through all the source files to find out
which ones need to be recompiled and calls the compiler to recompile them.
Now imagine what could
happen in a distributed system in which there were no global agreement on time.
Suppose that output.o has time 2144 as above, and shortly thereafter output.c
is modified but is assigned time 2143 because the clock on its machine is
slightly behind, as shown in Fig. 6-1. Make will not call the compiler. The
resulting executable binary program will then contain a mixture of object files
from the old sources and the new sources. It will probably crash and the
programmer will go crazy trying to understand what is wrong with the code.
Figure 6-1. When each
machine has its own clock, an event that occurred after another event may
nevertheless be assigned an earlier time.
(This item is displayed on
page 233 in the print version)
There are many more examples
where an accurate account of time is needed. The example above can easily be
reformulated to file timestamps in general. In addition, think of application
domains such as financial brokerage, security auditing, and collaborative
sensing, and it will become clear that accurate timing is important. Since time
is so basic to the way people think and the effect of not having all the clocks
synchronized can be so dramatic, it is fitting that we begin our study of
synchronization with the simple question: Is it possible to synchronize all the
clocks in a distributed system? The answer is surprisingly complicated.
[Page 233]
6.1.1. Physical Clocks
Nearly all computers have a
circuit for keeping track of time. Despite the widespread use of the word
"clock" to refer to these devices, they are not actually clocks in
the usual sense. Timer is perhaps a better word. A computer timer is usually a
precisely machined quartz crystal. When kept under tension, quartz crystals
oscillate at a well-defined frequency that depends on the kind of crystal, how
it is cut, and the amount of tension. Associated with each crystal are two
registers, a counter and a holding register. Each oscillation of the crystal
decrements the counter by one. When the counter gets to zero, an interrupt is
generated and the counter is reloaded from the holding register. In this way,
it is possible to program a timer to generate an interrupt 60 times a second,
or at any other desired frequency. Each interrupt is called one clock tick.
When the system is booted,
it usually asks the user to enter the date and time, which is then converted to
the number of ticks after some known starting date and stored in memory. Most
computers have a special battery-backed up CMOS RAM so that the date and time
need not be entered on subsequent boots. At every clock tick, the interrupt
service procedure adds one to the time stored in memory. In this way, the
(software) clock is kept up to date.
With a single computer and a
single clock, it does not matter much if this clock is off by a small amount.
Since all processes on the machine use the same clock, they will still be
internally consistent. For example, if the file input.c has time 2151 and file
input.o has time 2150, make will recompile the source file, even if the clock
is off by 2 and the true times are 2153 and 2152, respectively. All that really
matters are the relative times.
As soon as multiple CPUs are
introduced, each with its own clock, the situation changes radically. Although
the frequency at which a crystal oscillator runs is usually fairly stable, it
is impossible to guarantee that the crystals in different computers all run at
exactly the same frequency. In practice, when a system has n computers, all n
crystals will run at slightly different rates, causing the (software) clocks
gradually to get out of synch and give different values when read out. This
difference in time values is called clock skew. As a consequence of this clock
skew, programs that expect the time associated with a file, object, process, or
message to be correct and independent of the machine on which it was generated
(i.e., which clock it used) can fail, as we saw in the make example above.
[Page 234]
In some systems (e.g.,
real-time systems), the actual clock time is important. Under these
circumstances, external physical clocks are needed. For reasons of efficiency
and redundancy, multiple physical clocks are generally considered desirable,
which yields two problems: (1) How do we synchronize them with real-world
clocks, and (2) How do we synchronize the clocks with each other?
Before answering these
questions, let us digress slightly to see how time is actually measured. It is
not nearly as easy as one might think, especially when high accuracy is
required. Since the invention of mechanical clocks in the 17th century, time
has been measured astronomically. Every day, the sun appears to rise on the
eastern horizon, then climbs to a maximum height in the sky, and finally sinks
in the west. The event of the sun's reaching its highest apparent point in the
sky is called the transit of the sun. This event occurs at about noon each day.
The interval between two consecutive transits of the sun is called the solar
day. Since there are 24 hours in a day, each containing 3600 seconds, the solar
second is defined as exactly 1/86400th of a solar day. The geometry of the mean
solar day calculation is shown in Fig. 6-2.
Figure 6-2. Computation of
the mean solar day.
In the 1940s, it was
established that the period of the earth's rotation is not constant. The earth
is slowing down due to tidal friction and atmospheric drag. Based on studies of
growth patterns in ancient coral, geologists now believe that 300 million years
ago there were about 400 days per year. The length of the year (the time for
one trip around the sun) is not thought to have changed; the day has simply
become longer. In addition to this long-term trend, short-term variations in
the length of the day also occur, probably caused by turbulence deep in the
earth's core of molten iron. These revelations led astronomers to compute the
length of the day by measuring a large number of days and taking the average
before dividing by 86,400. The resulting quantity was called the mean solar
second.
[Page 235]
With the invention of the
atomic clock in 1948, it became possible to measure time much more accurately,
and independent of the wiggling and wobbling of the earth, by counting transitions
of the cesium 133 atom. The physicists took over the job of timekeeping from
the astronomers and defined the second to be the time it takes the cesium 133
atom to make exactly 9,192,631,770 transitions. The choice of 9,192,631,770 was
made to make the atomic second equal to the mean solar second in the year of
its introduction. Currently, several laboratories around the world have cesium
133 clocks. Periodically, each laboratory tells the Bureau International de
l'Heure (BIH) in Paris how many times its clock has ticked. The BIH averages
these to produce International Atomic Time, which is abbreviated TAI. Thus TAI
is just the mean number of ticks of the cesium 133 clocks since midnight on
Jan. 1, 1958 (the beginning of time) divided by 9,192,631,770.
Although TAI is highly
stable and available to anyone who wants to go to the trouble of buying a
cesium clock, there is a serious problem with it; 86,400 TAI seconds is now
about 3 msec less than a mean solar day (because the mean solar day is getting
longer all the time). Using TAI for keeping time would mean that over the
course of the years, noon would get earlier and earlier, until it would
eventually occur in the wee hours of the morning. People might notice this and
we could have the same kind of situation as occurred in 1582 when Pope Gregory
XIII decreed that 10 days be omitted from the calendar. This event caused riots
in the streets because landlords demanded a full month's rent and bankers a
full month's interest, while employers refused to pay workers for the 10 days
they did not work, to mention only a few of the conflicts. The Protestant
countries, as a matter of principle, refused to have anything to do with papal
decrees and did not accept the Gregorian calendar for 170 years.
BIH solves the problem by
introducing leap seconds whenever the discrepancy between TAI and solar time
grows to 800 msec. The use of leap seconds is illustrated in Fig. 6-3. This
correction gives rise to a time system based on constant TAI seconds but which
stays in phase with the apparent motion of the sun. It is called Universal
Coordinated Time, but is abbreviated as UTC. UTC is the basis of all modern
civil timekeeping. It has essentially replaced the old standard, Greenwich Mean
Time, which is astronomical time.
[Page 236]
Figure 6-3. TAI seconds are
of constant length, unlike solar seconds. Leap seconds are introduced when
necessary to keep in phase with the sun.
(This item is displayed on
page 235 in the print version)
Most electric power
companies synchronize the timing of their 60-Hz or 50-Hz clocks to UTC, so when
BIH announces a leap second, the power companies raise their frequency to 61 Hz
or 51 Hz for 60 or 50 sec, to advance all the clocks in their distribution area.
Since 1 sec is a noticeable interval for a computer, an operating system that
needs to keep accurate time over a period of years must have special software
to account for leap seconds as they are announced (unless they use the power
line for time, which is usually too crude). The total number of leap seconds
introduced into UTC so far is about 30.
To provide UTC to people who
need precise time, the National Institute of Standard Time (NIST) operates a
shortwave radio station with call letters WWV from Fort Collins, Colorado. WWV
broadcasts a short pulse at the start of each UTC second. The accuracy of WWV
itself is about ±1 msec, but due to random atmospheric fluctuations that can
affect the length of the signal path, in practice the accuracy is no better
than ±10 msec. In England, the station MSF, operating from Rugby, Warwickshire,
provides a similar service, as do stations in several other countries.
Several earth satellites
also offer a UTC service. The Geostationary Environment Operational Satellite
can provide UTC accurately to 0.5 msec, and some other satellites do even
better.
Using either shortwave radio
or satellite services requires an accurate know-ledge of the relative position
of the sender and receiver, in order to compensate for the signal propagation
delay. Radio receivers for WWV, GEOS, and the other UTC sources are
commercially available.
6.1.2. Global Positioning
System
As a step toward actual
clock synchronization problems, we first consider a related problem, namely
determining one's geographical position anywhere on Earth. This positioning
problem is by itself solved through a highly specific, dedicated distributed
system, namely GPS, which is an acronym for global positioning system. GPS is a
satellite-based distributed system that was launched in 1978. Although it has
been used mainly for military applications, in recent years it has found its
way to many civilian applications, notably for traffic navigation. However,
many more application domains exist. For example, GPS phones now allow to let
callers track each other's position, a feature which may show to be extremely
handy when you are lost or in trouble. This principle can easily be applied to
tracking other things as well, including pets, children, cars, boats, and so
on. An excellent overview of GPS is provided by Zogg (2002).
[Page 237]
GPS uses 29 satellites each
circulating in an orbit at a height of approximately 20,000 km. Each satellite
has up to four atomic clocks, which are regularly calibrated from special
stations on Earth. A satellite continuously broadcasts its position, and time
stamps each message with its local time. This broadcasting allows every
receiver on Earth to accurately compute its own position using, in principle,
only three satellites. To explain, let us first assume that all clocks,
including the receiver's, are synchronized.
In order to compute a
position, consider first the two-dimensional case, as shown in Fig. 6-4, in
which two satellites are drawn, along with the circles representing points at
the same distance from each respective satellite. The y-axis represents the
height, while the x-axis represents a straight line along the Earth's surface
at sea level. Ignoring the highest point, we see that the intersection of the
two circles is a unique point (in this case, perhaps somewhere up a mountain).
Figure 6-4. Computing a
position in a two-dimensional space.
This principle of
intersecting circles can be expanded to three dimensions, meaning that we need
three satellites to determine the longitude, latitude, and altitude of a
receiver on Earth. This positioning is all fairly straightforward, but matters
become complicated when we can no longer assume that all clocks are perfectly
synchronized.
There are two important
real-world facts that we need to take into account:
It takes a while before data
on a satellite's position reaches the receiver.
The receiver's clock is
generally not in synch with that of a satellite.
Assume that the timestamp
from a satellite is completely accurate. Let Δr denote the deviation of
the receiver's clock from the actual time. When a message is received from
satellite i with timestamp Ti, then the measured delay Δi by the receiver
consists of two components: the actual delay, along with its own deviation:
[Page 238]
Δi = (Tnow - Ti) +
Δr
As signals travel with the
speed of light, c, the measured distance of the satellite is clearly cΔi.
With
di = c(Tnow - Ti)
being the real distance between
the receiver and the satellite, the measured distance can be rewritten to di +
cΔr. The real distance is simply computed as:
where xi, yi, and zi denote
the coordinates of satellite i. What we see now is that if we have four
satellites, we get four equations in four unknowns, allowing us to solve the
coordinates xr, yr, and zr for the receiver, but also Δr. In other words,
a GPS measurement will also give an account of the actual time. Later in this
chapter we will return to determining positions following a similar approach.
So far, we have assumed that
measurements are perfectly accurate. Of course, they are not. For one thing,
GPS does not take leap seconds into account. In other words, there is a
systematic deviation from UTC, which by January 1, 2006 is 14 seconds. Such an
error can be easily compensated for in software. However, there are many other
sources of errors, starting with the fact that the atomic clocks in the
satellites are not always in perfect synch, the position of a satellite is not
known precisely, the receiver's clock has a finite accuracy, the signal
propagation speed is not constant (as signals slow down when entering, e.g.,
the ionosphere), and so on. Moreover, we all know that the earth is not a
perfect sphere, leading to further corrections.
By and large, computing an
accurate position is far from a trivial undertaking and requires going down
into many gory details. Nevertheless, even with relatively cheap GPS receivers,
positioning can be precise within a range of 1–5 meters. Moreover, professional
receivers (which can easily be hooked up in a computer network) have a claimed
error of less than 20–35 nanoseconds. Again, we refer to the excellent overview
by Zogg (2002) as a first step toward getting acquainted with the details.
6.1.3. Clock Synchronization
Algorithms
If one machine has a WWV
receiver, the goal becomes keeping all the other machines synchronized to it.
If no machines have WWV receivers, each machine keeps track of its own time,
and the goal is to keep all the machines together as well as possible. Many
algorithms have been proposed for doing this synchronization. A survey is given
in Ramanathan et al. (1990).
[Page 239]
All the algorithms have the
same underlying model of the system. Each machine is assumed to have a timer
that causes an interrupt H times a second. When this timer goes off, the
interrupt handler adds 1 to a software clock that keeps track of the number of
ticks (interrupts) since some agreed-upon time in the past. Let us call the
value of this clock C. More specifically, when the UTC time is t, the value of
the clock on machine p is Cp(t). In a perfect world, we would have Cp(t) = t
for all p and all t. In other words, ideally should be 1.
is called the
frequency of p's clock at time t. The skew of the clock is defined as
and denotes the
extent to which the frequency differs from that of a perfect clock. The offset
relative to a specific time t is Cp(t) - t.
Real timers do not interrupt
exactly H times a second. Theoretically, a timer with H = 60 should generate
216,000 ticks per hour. In practice, the relative error obtainable with modern
timer chips is about 10-5, meaning that a particular machine can get a value in
the range 215,998 to 216,002 ticks per hour. More precisely, if there exists
some constant ρ such that
the timer can be said to be
working within its specification. The constant ρ is specified by the
manufacturer and is known as the maximum drift rate. Note that the maximum
drift rate specifies to what extent a clock's skew is allowed to fluctuate.
Slow, perfect, and fast clocks are shown in Fig. 6-5.
Figure 6-5. The relation
between clock time and UTC when clocks tick at different rates.
If two clocks are drifting
from UTC in the opposite direction, at a time Δt after they were
synchronized, they may be as much as 2ρ Δt apart. If the operating
system designers want to guarantee that no two clocks ever differ by more than δ,
clocks must be resynchronized (in software) at least every δ/2ρ
seconds. The various algorithms differ in precisely how this resynchronization
is done.
[Page 240]
Network Time Protocol
A common approach in many
protocols and originally proposed by Cristian (1989) is to let clients contact
a time server. The latter can accurately provide the current time, for example,
because it is equipped with a WWV receiver or an accurate clock. The problem,
of course, is that when contacting the server, message delays will have outdated
the reported time. The trick is to find a good estimation for these delays.
Consider the situation sketched in Fig. 6-6.
Figure 6-6. Getting the
current time from a time server.
In this case, A will send a
request to B, timestamped with value T1 . B, in turn, will record the time of
receipt T2 (taken from its own local clock), and returns a response timestamped
with value T3, and piggybacking the previously recorded value T2. Finally, A
records the time of the response's arrival, T4. Let us assume that the
propagation delays from A to B is roughly the same as B to A, meaning that T2 -
T1 T4 -T3. In that
case, A can estimate its offset relative to B as
Of course, time is not
allowed to run backward. If A's clock is fast, θ < 0, meaning that A
should, in principle, set its clock backward. This is not allowed as it could
cause serious problems such as an object file compiled just after the clock change
having a time earlier than the source which was modified just before the clock
change.
Such a change must be
introduced gradually. One way is as follows. Suppose that the timer is set to
generate 100 interrupts per second. Normally, each interrupt would add 10 msec
to the time. When slowing down, the interrupt routine adds only 9 msec each
time until the correction has been made. Similarly, the clock can be advanced
gradually by adding 11 msec at each interrupt instead of jumping it forward all
at once.
In the case of the network
time protocol (NTP), this protocol is set up pair-wise between servers. In
other words, B will also probe A for its current time. The offset θ is
computed as given above, along with the estimation δ for the delay:
[Page 241]
Eight pairs of
(θ,δ) values are buffered, finally taking the minimal value found for
δ as the best estimation for the delay between the two servers, and
subsequently the associated value θ as the most reliable estimation of the
offset.
Applying NTP symmetrically
should, in principle, also let B adjust its clock to that of A. However, if B's
clock is known to be more accurate, then such an adjustment would be foolish.
To solve this problem, NTP divides servers into strata. A server with a
reference clock such as a WWV receiver or an atomic clock, is known to be a
stratum-1 server (the clock itself is said to operate at stratum 0). When A
contacts B, it will only adjust its time if its own stratum level is higher
than that of B. Moreover, after the synchronization, A's stratum level will
become one higher than that of B. In other words, if B is a stratum-k server,
then A will become a stratum-(k+1) server if its original stratum level was
already larger than k. Due to the symmetry of NTP, if A's stratum level was lower
than that of B, B will adjust itself to A.
There are many important
features about NTP, of which many relate to identifying and masking errors, but
also security attacks. NTP is described in Mills (1992) and is known to achieve
(worldwide) accuracy in the range of 1–50 msec. The newest version (NTPv4) was
initially documented only by means of its implementation, but a detailed
description can now be found in Mills (2006).
The Berkeley Algorithm
In many algorithms such as
NTP, the time server is passive. Other machines periodically ask it for the
time. All it does is respond to their queries. In Berkeley UNIX, exactly the
opposite approach is taken (Gusella and Zatti, 1989). Here the time server
(actually, a time daemon) is active, polling every machine from time to time to
ask what time it is there. Based on the answers, it computes an average time
and tells all the other machines to advance their clocks to the new time or
slow their clocks down until some specified reduction has been achieved. This method
is suitable for a system in which no machine has a WWV receiver. The time
daemon's time must be set manually by the operator periodically. The method is
illustrated in Fig. 6-7.
Figure 6-7. (a) The time
daemon asks all the other machines for their clock values. (b) The machines
answer. (c) The time daemon tells everyone how to adjust their clock.
(This item is displayed on
page 242 in the print version)
In Fig. 6-7(a), at 3:00, the
time daemon tells the other machines its time and asks for theirs. In Fig.
6-7(b), they respond with how far ahead or behind the time daemon they are.
Armed with these numbers, the time daemon computes the average and tells each
machine how to adjust its clock [see Fig. 6-7(c)].
Note that for many purposes,
it is sufficient that all machines agree on the same time. It is not essential
that this time also agrees with the real time as announced on the radio every
hour. If in our example of Fig. 6-7 the time daemon's clock would never be
manually calibrated, no harm is done provided none of the other nodes
communicates with external computers. Everyone will just happily agree on a
current time, without that value having any relation with reality.
[Page 242]
Clock Synchronization in
Wireless Networks
An important advantage of
more traditional distributed systems is that we can easily and efficiently
deploy time servers. Moreover, most machines can contact each other, allowing
for a relatively simple dissemination of information. These assumptions are no
longer valid in many wireless networks, notably sensor networks. Nodes are
resource constrained, and multihop routing is expensive. In addition, it is
often important to optimize algorithms for energy consumption. These and other
observations have led to the design of very different clock synchronization
algorithms for wireless networks. In the following, we consider one specific
solution. Sivrikaya and Yener (2004) provide a brief overview of other
solutions. An extensive survey can be found in Sundararaman et al. (2005).
Reference broadcast
synchronization (RBS) is a clock synchronization protocol that is quite
different from other proposals (Elson et al., 2002). First, the protocol does
not assume that there is a single node with an accurate account of the actual
time available. Instead of aiming to provide all nodes UTC time, it aims at
merely internally synchronizing the clocks, just as the Berkeley algorithm
does. Second, the solutions we have discussed so far are designed to bring the
sender and receiver into synch, essentially following a two-way protocol. RBS
deviates from this pattern by letting only the receivers synchronize, keeping
the sender out of the loop.
In RBS, a sender broadcasts
a reference message that will allow its receivers to adjust their clocks. A key
observation is that in a sensor network the time to propagate a signal to other
nodes is roughly constant, provided no multi-hop routing is assumed.
Propagation time in this case is measured from the moment that a message leaves
the network interface of the sender. As a consequence, two important sources
for variation in message transfer no longer play a role in estimating delays:
the time spent to construct a message, and the time spent to access the
network. This principle is shown in Fig. 6-8.
[Page 243]
Figure 6-8. (a) The usual
critical path in determining network delays. (b) The critical path in the case
of RBS.
Note that in protocols such
as NTP, a timestamp is added to the message before it is passed on the network
interface. Furthermore, as wireless networks are based on a contention
protocol, there is generally no saying how long it will take before a message
can actually be transmitted. These factors of nondeterminism are eliminated in
RBS. What remains is the delivery time at the receiver, but this time varies
considerably less than the network-access time.
The idea underlying RBS is
simple: when a node broadcasts a reference message m, each node p simply
records the time Tp,m that it received m. Note that Tp,m is read from p's local
clock. Ignoring clock skew, two nodes p and q can exchange each other's
delivery times in order to estimate their mutual, relative offset:
where M is the total number
of reference messages sent. This information is important: node p will know the
value of q's clock relative to its own value. Moreover, if it simply stores
these offsets, there is no need to adjust its own clock, which saves energy.
Unfortunately, clocks can
drift apart. The effect is that simply computing the average offset as done
above will not work: the last values sent are simply less accurate than the
first ones. Moreover, as time goes by, the offset will presumably increase. Elson
et al. use a very simple algorithm to compensate for this: instead of computing
an average they apply standard linear regression to compute the offset as a
function:
[Page 244]
Offset [p,q](t) =
αt+β
The constants α and
β are computed from the pairs (Tp,k,Tq,k). This new form will allow a much
more accurate computation of q's current clock value by node p, and vice versa.
6.2. Logical Clocks
So far, we have assumed that
clock synchronization is naturally related to real time. However, we have also
seen that it may be sufficient that every node agrees on a current time,
without that time necessarily being the same as the real time. We can go one
step further. For running make, for example, it is adequate that two nodes
agree that input.o is outdated by a new version of input.c. In this case,
keeping track of each other's events (such as a producing a new version of
input.c) is what matters. For these algorithms, it is conventional to speak of
the clocks as logical clocks.
In a classic paper, Lamport
(1978) showed that although clock synchronization is possible, it need not be
absolute. If two processes do not interact, it is not necessary that their
clocks be synchronized because the lack of synchronization would not be
observable and thus could not cause problems. Furthermore, he pointed out that
what usually matters is not that all processes agree on exactly what time it
is, but rather that they agree on the order in which events occur. In the make
example, what counts is whether input.c is older or newer than input.o, not
their absolute creation times.
In this section we will
discuss Lamport's algorithm, which synchronizes logical clocks. Also, we
discuss an extension to Lamport's approach, called vector timestamps.
6.2.1. Lamport's Logical
Clocks
To synchronize logical
clocks, Lamport defined a relation called happens-before. The expression a b is read "a happens before b" and
means that all processes agree that first event a occurs, then afterward, event
b occurs. The happens-before relation can be observed directly in two
situations:
If a and b are events in the
same process, and a occurs before b, then a
b is true.
If a is the event of a
message being sent by one process, and b is the event of the message being
received by another process, then a b
is also true. A message cannot be received before it is sent, or even at the
same time it is sent, since it takes a finite, nonzero amount of time to
arrive.
[Page 245]
Happens-before is a
transitive relation, so if a b and
b c, then a c. If two events, x and y, happen in different processes that do
not exchange messages (not even indirectly via third parties), then x y is not true, but neither is y x. These events are said to be concurrent,
which simply means that nothing can be said (or need be said) about when the
events happened or which event happened first.
What we need is a way of
measuring a notion of time such that for every event, a, we can assign it a
time value C (a) on which all processes agree. These time values must have the
property that if a b, then C (a) < C
(b). To rephrase the conditions we stated earlier, if a and b are two events
within the same process and a occurs before b, then C (a) < C (b).
Similarly, if a is the sending of a message by one process and b is the reception
of that message by another process, then C (a) and C (b) must be assigned in
such a way that everyone agrees on the values of C (a) and C (b) with C (a)
< C (b). In addition, the clock time, C, must always go forward
(increasing), never backward (decreasing). Corrections to time can be made by
adding a positive value, never by subtracting one.
Now let us look at the
algorithm Lamport proposed for assigning times to events. Consider the three
processes depicted in Fig. 6-9(a). The processes run on different machines,
each with its own clock, running at its own speed. As can be seen from the
figure, when the clock has ticked 6 times in process P1, it has ticked 8 times
in process P2 and 10 times in process P3. Each clock runs at a constant rate,
but the rates are different due to differences in the crystals.
Figure 6-9. (a) Three
processes, each with its own clock. The clocks run at different rates. (b)
Lamport's algorithm corrects the clocks.
[Page 246]
At time 6, process P1 sends
message m1 to process P2. How long this message takes to arrive depends on
whose clock you believe. In any event, the clock in process P2 reads 16 when it
arrives. If the message carries the starting time, 6, in it, process P2 will
conclude that it took 10 ticks to make the journey. This value is certainly
possible. According to this reasoning, message m2 from P2 to R takes 16 ticks,
again a plausible value.
Now consider message m3. It
leaves process P3 at 60 and arrives at P2 at 56. Similarly, message m4 from P2
to P1 leaves at 64 and arrives at 54. These values are clearly impossible. It
is this situation that must be prevented.
Lamport's solution follows
directly from the happens-before relation. Since m3 left at 60, it must arrive
at 61 or later. Therefore, each message carries the sending time according to
the sender's clock. When a message arrives and the receiver's clock shows a
value prior to the time the message was sent, the receiver fast forwards its
clock to be one more than the sending time. In Fig. 6-9(b) we see that m3 now
arrives at 61. Similarly, m4 arrives at 70.
To prepare for our
discussion on vector clocks, let us formulate this procedure more precisely. At
this point, it is important to distinguish three different layers of software
as we already encountered in Chap. 1: the network, a middleware layer, and an
application layer, as shown in Fig. 6-10. What follows is typically part of the
middleware layer.
Figure 6-10. The positioning
of Lamport's logical clocks in distributed systems.
To implement Lamport's
logical clocks, each process Pi maintains a local counter Ci. These counters
are updated as follows steps (Raynal and Singhal, 1996):
1. Before executing an event (i.e., sending a message over the
network, delivering a message to an application, or some other internal event),
Pi executes Ci Ci + 1.
2. When process Pi sends a message m to Pj, it sets m's
timestamp ts (m) equal to Ci after having executed the previous step.
[Page 247]
3. Upon the receipt of a message m, process Pj adjusts its own
local counter as Cj max{Cj, ts (m)},
after which it then executes the first step and delivers the message to the
application.
In some situations, an
additional requirement is desirable: no two events ever occur at exactly the
same time. To achieve this goal, we can attach the number of the process in
which the event occurs to the low-order end of the time, separated by a decimal
point. For example, an event at time 40 at process Pi will be timestamped with
40.i.
Note that by assigning the
event time C (a) Ci(a) if a happened at
process Pi at time Ci(a), we have a distributed implementation of the global
time value we were initially seeking for.
Example: Totally Ordered
Multicasting
As an application of
Lamport's logical clocks, consider the situation in which a database has been
replicated across several sites. For example, to improve query performance, a
bank may place copies of an account database in two different cities, say New
York and San Francisco. A query is always forwarded to the nearest copy. The
price for a fast response to a query is partly paid in higher update costs,
because each update operation must be carried out at each replica.
In fact, there is a more
stringent requirement with respect to updates. Assume a customer in San
Francisco wants to add $100 to his account, which currently contains $1,000. At
the same time, a bank employee in New York initiates an update by which the
customer's account is to be increased with 1 percent interest. Both updates
should be carried out at both copies of the database. However, due to
communication delays in the underlying network, the updates may arrive in the
order as shown in Fig. 6-11.
Figure 6-11. Updating a
replicated database and leaving it in an inconsistent state.
The customer's update
operation is performed in San Francisco before the interest update. In
contrast, the copy of the account in the New York replica is first updated with
the 1 percent interest, and after that with the $100 deposit. Consequently, the
San Francisco database will record a total amount of $1,111, whereas the New
York database records $1,110.
[Page 248]
The problem that we are
faced with is that the two update operations should have been performed in the
same order at each copy. Although it makes a difference whether the deposit is
processed before the interest update or the other way around, which order is
followed is not important from a consistency point of view. The important issue
is that both copies should be exactly the same. In general, situations such as
these require a totally-ordered multicast, that is, a multicast operation by
which all messages are delivered in the same order to each receiver. Lamport's
logical clocks can be used to implement totally-ordered multicasts in a completely
distributed fashion.
Consider a group of
processes multicasting messages to each other. Each message is always
timestamped with the current (logical) time of its sender. When a message is
multicast, it is conceptually also sent to the sender. In addition, we assume
that messages from the same sender are received in the order they were sent,
and that no messages are lost.
When a process receives a
message, it is put into a local queue, ordered according to its timestamp. The
receiver multicasts an acknowledgment to the other processes. Note that if we
follow Lamport's algorithm for adjusting local clocks, the timestamp of the
received message is lower than the timestamp of the acknowledgment. The
interesting aspect of this approach is that all processes will eventually have
the same copy of the local queue (provided no messages are removed).
A process can deliver a
queued message to the application it is running only when that message is at
the head of the queue and has been acknowledged by each other process. At that
point, the message is removed from the queue and handed over to the
application; the associated acknowledgments can simply be removed. Because each
process has the same copy of the queue, all messages are delivered in the same
order everywhere. In other words, we have established totally-ordered
multicasting.
As we shall see in later
chapters, totally-ordered multicasting is an important vehicle for replicated
services where the replicas are kept consistent by letting them execute the
same operations in the same order everywhere. As the replicas essentially
follow the same transitions in the same finite state machine, it is also known
as state machine replication (Schneider, 1990).
6.2.2. Vector Clocks
Lamport's logical clocks
lead to a situation where all events in a distributed system are totally
ordered with the property that if event a happened before event b, then a will
also be positioned in that ordering before b, that is, C (a) < C (b).
[Page 249]
However, with Lamport
clocks, nothing can be said about the relationship between two events a and b
by merely comparing their time values C (a) and C (b), respectively. In other
words, if C (a) < C (b), then this does not necessarily imply that a indeed
happened before b. Something more is needed for that.
To explain, consider the
messages as sent by the three processes shown in Fig. 6-12. Denote by Tsnd (mi)
the logical time at which message mi was sent, and likewise, by Trcv (mi) the
time of its receipt. By construction, we know that for each message Tsnd (mi)
< Trcv (mi). But what can we conclude in general from Trcv (mi) < Tsnd
(mj)?
Figure 6-12. Concurrent
message transmission using logical clocks.
In the case for which mi=m1
and mj=m3, we know that these values correspond to events that took place at
process P2, meaning that m3 was indeed sent after the receipt of message m1.
This may indicate that the sending of message m3 depended on what was received
through message m1 . However, we also know that Trcv (m1) < Tsnd (m2).
However, the sending of m2 has nothing to do with the receipt of m1.
The problem is that Lamport
clocks do not capture causality. Causality can be captured by means of vector
clocks. A vector clock VC (a) assigned to an event a has the property that if
VC (a) < VC (b) for some event b, then event a is known to causally precede
event b. Vector clocks are constructed by letting each process Pi maintain a
vector VCi with the following two properties:
The first property is
maintained by incrementing VCi [i ] at the occurrence of each new event that
happens at process Pi. The second property is maintained by piggybacking
vectors along with messages that are sent. In particular, the following steps
are performed:
[Page 250]
1. Before executing an event (i.e., sending a message over the
network, delivering a message to an application, or some other internal event),
Pi executes VCi [i ] VCi [i ] + 1.
2. When process Pi sends a message m to Pj, it sets m's (vector)
timestamp ts (m) equal to VCi after having executed the previous step.
3. Upon the receipt of a message m, process Pj adjusts its own
vector by setting VCj [k ] max{VCj [k
], ts (m)[k ]} for each k, after which it executes the first step and delivers
the message to the application.
Note that if an event a has
timestamp ts (a), then ts (a)[i ]-1 denotes the number of events processed at
Pi that causally precede a. As a consequence, when Pj receives a message from
Pi with timestamp ts (m), it knows about the number of events that have
occurred at Pi that causally preceded the sending of m. More important,
however, is that Pj is also told how many events at other processes have taken
place before Pi sent message m. In other words, timestamp ts (m) tells the
receiver how many events in other processes have preceded the sending of m, and
on which m may causally depend.
Enforcing Causal
Communication
Using vector clocks, it is
now possible to ensure that a message is delivered only if all messages that
causally precede it have also been received as well. To enable such a scheme,
we will assume that messages are multicast within a group of processes. Note
that this causally-ordered multicasting is weaker than the totally-ordered
multicasting we discussed earlier. Specifically, if two messages are not in any
way related to each other, we do not care in which order they are delivered to
applications. They may even be delivered in different order at different
locations.
Furthermore, we assume that
clocks are only adjusted when sending and receiving messages. In particular,
upon sending a message, process Pi will only increment VCi[i ] by 1. When it
receives a message m with timestamp ts (m), it only adjusts VCi [k ] to max{VCj
[k ], ts (m)[k ]} for each k.
Now suppose that Pj receives
a message m from Pi with (vector) timestamp ts (m). The delivery of the message
to the application layer will then be delayed until the following two
conditions are met:
ts (m)[i] = VCj [i]+1
ts (m)[k] VCj [k ] for all ki
[Page 251]
The first condition states
that m is the next message that Pj was expecting from process Pi. The second
condition states that Pj has seen all the messages that have been seen by Pi
when it sent message m. Note that there is no need for process Pj to delay the
delivery of its own messages.
As an example, consider
three processes P0, P1, and P2 as shown in Fig. 6-13. At local time (1,0,0), P1
sends message m to the other two processes. After its receipt by P1, the latter
decides to send m*, which arrives at P2 sooner than m. At that point, the
delivery of m* is delayed by P2 until m has been received and delivered to P2's
application layer.
Figure 6-13. Enforcing
causal communication.
A Note on Ordered Message Delivery
Some middleware systems,
notably ISIS and its successor Horus (Birman and van Renesse, 1994), provide
support for totally-ordered and causally-ordered (reliable) multicasting. There
has been some controversy whether such support should be provided as part of
the message-communication layer, or whether applications should handle ordering
(see, e.g., Cheriton and Skeen, 1993; and Birman, 1994). Matters have not been
settled, but more important is that the arguments still hold today.
There are two main problems
with letting the middleware deal with message ordering. First, because the
middleware cannot tell what a message actually contains, only potential
causality is captured. For example, two messages from the same sender that are
completely independent will always be marked as causally related by the
middleware layer. This approach is overly restrictive and may lead to
efficiency problems.
A second problem is that not
all causality may be captured. Consider an electronic bulletin board. Suppose
Alice posts an article. If she then phones Bob telling about what she just
wrote, Bob may post another article as a reaction without having seen Alice's
posting on the board. In other words, there is a causality between Bob's
posting and that of Alice due to external communication. This causality is not
captured by the bulletin board system.
[Page 252]
In essence, ordering issues,
like many other application-specific communication issues, can be adequately
solved by looking at the application for which communication is taking place.
This is also known as the end-to-end argument in systems design (Saltzer et
al., 1984). A drawback of having only application-level solutions is that a
developer is forced to concentrate on issues that do not immediately relate to the
core functionality of the application. For example, ordering may not be the
most important problem when developing a messaging system such as an electronic
bulletin board. In that case, having an underlying communication layer handle
ordering may turn out to be convenient. We will come across the end-to-end
argument a number of times, notably when dealing with security in distributed
systems.
6.3. Mutual Exclusion
Fundamental to distributed
systems is the concurrency and collaboration among multiple processes. In many
cases, this also means that processes will need to simultaneously access the
same resources. To prevent that such concurrent accesses corrupt the resource,
or make it inconsistent, solutions are needed to grant mutual exclusive access
by processes. In this section, we take a look at some of the more important
distributed algorithms that have been proposed. A recent survey of distributed
algorithms for mutual exclusion is provided by Saxena and Rai (2003). Older,
but still relevant is Velazquez (1993).
6.3.1. Overview
Distributed mutual exclusion
algorithms can be classified into two different categories. In token-based
solutions mutual exclusion is achieved by passing a special message between the
processes, known as a token. There is only one token available and who ever has
that token is allowed to access the shared resource. When finished, the token
is passed on to a next process. If a process having the token is not interested
in accessing the resource, it simply passes it on.
Token-based solutions have a
few important properties. First, depending on the how the processes are
organized, they can fairly easily ensure that every process will get a chance
at accessing the resource. In other words, they avoid starvation. Second,
deadlocks by which several processes are waiting for each other to proceed, can
easily be avoided, contributing to their simplicity. Unfortunately, the main
drawback of token-based solutions is a rather serious one: when the token is
lost (e.g., because the process holding it crashed), an intricate distributed
procedure needs to be started to ensure that a new token is created, but above
all, that it is also the only token.
As an alternative, many
distributed mutual exclusion algorithms follow a permission-based approach. In
this case, a process wanting to access the resource first requires the
permission of other processes. There are many different ways toward granting
such a permission and in the sections that follow we will consider a few of
them.
[Page 253]
6.3.2. A Centralized
Algorithm
The most straightforward way
to achieve mutual exclusion in a distributed system is to simulate how it is
done in a one-processor system. One process is elected as the coordinator.
Whenever a process wants to access a shared resource, it sends a request message
to the coordinator stating which resource it wants to access and asking for
permission. If no other process is currently accessing that resource, the
coordinator sends back a reply granting permission, as shown in Fig. 6-14(a).
When the reply arrives, the requesting process can go ahead.
Figure 6-14. (a) Process 1
asks the coordinator for permission to access a shared resource. Permission is
granted. (b) Process 2 then asks permission to access the same resource. The
coordinator does not reply. (c) When process 1 releases the resource, it tells
the coordinator, which then replies to 2.
Now suppose that another
process, 2 in Fig. 6-14(b), asks for permission to access the resource. The
coordinator knows that a different process is already at the resource, so it
cannot grant permission. The exact method used to deny permission is system
dependent. In Fig. 6-14(b), the coordinator just refrains from replying, thus
blocking process 2, which is waiting for a reply. Alternatively, it could send
a reply saying "permission denied." Either way, it queues the request
from 2 for the time being and waits for more messages.
When process 1 is finished
with the resource, it sends a message to the coordinator releasing its
exclusive access, as shown in Fig. 6-14(c). The coordinator takes the first
item off the queue of deferred requests and sends that process a grant message.
If the process was still blocked (i.e., this is the first message to it), it
unblocks and accesses the resource. If an explicit message has already been
sent denying permission, the process will have to poll for incoming traffic or
block later. Either way, when it sees the grant, it can go ahead as well.
It is easy to see that the
algorithm guarantees mutual exclusion: the coordinator only lets one process at
a time to the resource. It is also fair, since requests are granted in the
order in which they are received. No process ever waits forever (no
starvation). The scheme is easy to implement, too, and requires only three
messages per use of resource (request, grant, release). It's simplicity makes
an attractive solution for many practical situations.
[Page 254]
The centralized approach
also has shortcomings. The coordinator is a single point of failure, so if it
crashes, the entire system may go down. If processes normally block after
making a request, they cannot distinguish a dead coordinator from
"permission denied" since in both cases no message comes back. In
addition, in a large system, a single coordinator can become a performance
bottleneck. Nevertheless, the benefits coming from its simplicity outweigh in
many cases the potential drawbacks. Moreover, distributed solutions are not
necessarily better, as our next example illustrates.
6.3.3. A Decentralized
Algorithm
Having a single coordinator
is often a poor approach. Let us take a look at fully decentralized solution.
Lin et al. (2004) propose to use a voting algorithm that can be executed using
a DHT-based system. In essence, their solution extends the central coordinator
in the following way. Each resource is assumed to be replicated n times. Every
replica has its own coordinator for controlling the access by concurrent
processes.
However, whenever a process
wants to access the resource, it will simply need to get a majority vote from m
> n/2 coordinators. Unlike in the centralized scheme discussed before, we
assume that when a coordinator does not give permission to access a resource
(which it will do when it had granted permission to another process), it will
tell the requester.
This scheme essentially
makes the original centralized solution less vulnerable to failures of a single
coordinator. The assumption is that when a coordinator crashes, it recovers
quickly but will have forgotten any vote it gave before it crashed. Another way
of viewing this is that a coordinator resets itself at arbitrary moments. The
risk that we are taking is that a reset will make the coordinator forget that
it had previously granted permission to some process to access the resource. As
a consequence, it may incorrectly grant this permission again to another
process after its recovery.
Let p be the probability
that a coordinator resets during a time interval Δt. The probability P [k
] that k out of m coordinators reset during the same interval is then
Given that at least 2m - n
coordinators need to reset in order to violate the correctness of the voting
mechanism, the probability that such a violation occurs is then . To give
an impression of what this could mean, assume that we are dealing with a
DHT-based system in which each node participates for about 3 hours in a row.
Let Δt be 10 seconds, which is considered to be a conservative value for a
single process to want to access a shared resource. (Different mechanisms are
needed for very long allocations.) With n = 32 and m = 0.75n, the probability
of violating correctness is less than 10-40. This probability is surely smaller
than the availability of any resource.
[Page 255]
To implement this scheme,
Lin et al. (2004) use a DHT-based system in which a resource is replicated n
times. Assume that the resource is known under its unique name rname. We can
then assume that the i-th replica is named rname-i which is then used to
compute a unique key using a known hash function. As a consequence, every
process can generate the n keys given a resource's name, and subsequently
lookup each node responsible for a replica (and controlling access to that
replica).
If permission to access the
resource is denied (i.e., a process gets less than m votes), it is assumed that
it will back off for a randomly-chosen time, and make a next attempt later. The
problem with this scheme is that if many nodes want to access the same
resource, it turns out that the utilization rapidly drops. In other words, there
are so many nodes competing to get access that eventually no one is able to get
enough votes leaving the resource unused. A solution to solve this problem can
be found in Lin et al. (2004).
6.3.4. A Distributed
Algorithm
To many, having a
probabilistically correct algorithm is just not good enough. So researchers
have looked for deterministic distributed mutual exclusion algorithms.
Lamport's 1978 paper on clock synchronization presented the first one. Ricart
and Agrawala (1981) made it more efficient. In this section we will describe
their method.
Ricart and Agrawala's
algorithm requires that there be a total ordering of all events in the system.
That is, for any pair of events, such as messages, it must be unambiguous which
one actually happened first. Lamport's algorithm presented in Sec. 6.2.1 is one
way to achieve this ordering and can be used to provide timestamps for
distributed mutual exclusion.
The algorithm works as
follows. When a process wants to access a shared resource, it builds a message containing
the name of the resource, its process number, and the current (logical) time.
It then sends the message to all other processes, conceptually including
itself. The sending of messages is assumed to be reliable; that is, no message
is lost.
When a process receives a
request message from another process, the action it takes depends on its own
state with respect to the resource named in the message. Three different cases
have to be clearly distinguished:
[Page 256]
If the receiver is not
accessing the resource and does not want to access it, it sends back an OK
message to the sender.
If the receiver already has
access to the resource, it simply does not reply. Instead, it queues the
request.
If the receiver wants to
access the resource as well but has not yet done so, it compares the timestamp
of the incoming message with the one contained in the message that it has sent
everyone. The lowest one wins. If the incoming message has a lower timestamp,
the receiver sends back an OK message. If its own message has a lower
timestamp, the receiver queues the incoming request and sends nothing.
After sending out requests
asking permission, a process sits back and waits until everyone else has given
permission. As soon as all the permissions are in, it may go ahead. When it is
finished, it sends OK messages to all processes on its queue and deletes them
all from the queue.
Let us try to understand why
the algorithm works. If there is no conflict, it clearly works. However, suppose
that two processes try to simultaneously access the resource, as shown in Fig.
6-15(a).
Figure 6-15. (a) Two
processes want to access a shared resource at the same moment. (b) Process 0
has the lowest timestamp, so it wins. (c) When process 0 is done, it sends an
OK also, so 2 can now go ahead.
Process 0 sends everyone a
request with timestamp 8, while at the same time, process 2 sends everyone a
request with timestamp 12. Process 1 is not interested in the resource, so it
sends OK to both senders. Processes 0 and 2 both see the conflict and compare
timestamps. Process 2 sees that it has lost, so it grants permission to 0 by
sending OK. Process 0 now queues the request from 2 for later processing and
access the resource, as shown in Fig. 6-15(b). When it is finished, it removes
the request from 2 from its queue and sends an OK message to process 2,
allowing the latter to go ahead, as shown in Fig. 6-15(c). The algorithm works
because in the case of a conflict, the lowest timestamp wins and everyone
agrees on the ordering of the timestamps.
[Page 257]
Note that the situation in
Fig. 6-15 would have been essentially different if process 2 had sent its
message earlier in time so that process 0 had gotten it and granted permission
before making its own request. In this case, 2 would have noticed that it
itself had already access to the resource at the time of the request, and
queued it instead of sending a reply.
As with the centralized
algorithm discussed above, mutual exclusion is guaranteed without deadlock or
starvation. The number of messages required per entry is now 2(n - 1), where
the total number of processes in the system is n. Best of all, no single point
of failure exists.
Unfortunately, the single
point of failure has been replaced by n points of failure. If any process
crashes, it will fail to respond to requests. This silence will be interpreted
(incorrectly) as denial of permission, thus blocking all subsequent attempts by
all processes to enter all critical regions. Since the probability of one of
the n processes failing is at least n times as large as a single coordinator
failing, we have managed to replace a poor algorithm with one that is more than
n times worse and requires much more network traffic as well.
The algorithm can be patched
up by the same trick that we proposed earlier. When a request comes in, the
receiver always sends a reply, either granting or denying permission. Whenever
either a request or a reply is lost, the sender times out and keeps trying
until either a reply comes back or the sender concludes that the destination is
dead. After a request is denied, the sender should block waiting for a
subsequent OK message.
Another problem with this
algorithm is that either a multicast communication primitive must be used, or
each process must maintain the group membership list itself, including
processes entering the group, leaving the group, and crashing. The method works
best with small groups of processes that never change their group memberships.
Finally, recall that one of
the problems with the centralized algorithm is that making it handle all
requests can lead to a bottleneck. In the distributed algorithm, all processes
are involved in all decisions concerning accessing the shared resource. If one
process is unable to handle the load, it is unlikely that forcing everyone to
do exactly the same thing in parallel is going to help much.
Various minor improvements
are possible to this algorithm. For example, getting permission from everyone
is really overkill. All that is needed is a method to prevent two processes
from accessing the resource at the same time. The algorithm can be modified to
grant permission when it has collected permission from a simple majority of the
other processes, rather than from all of them. Of course, in this variation,
after a process has granted permission to one process, it cannot grant the same
permission to another process until the first one has finished.
Nevertheless, this algorithm
is slower, more complicated, more expensive, and less robust that the original
centralized one. Why bother studying it under these conditions? For one thing,
it shows that a distributed algorithm is at least possible, something that was
not obvious when we started. Also, by pointing out the shortcomings, we may
stimulate future theoreticians to try to produce algorithms that are actually
useful. Finally, like eating spinach and learning Latin in high school, some
things are said to be good for you in some abstract way. It may take some time
to discover exactly what.
6.3.5. A Token Ring
Algorithm
A completely different
approach to deterministically achieving mutual exclusion in a distributed
system is illustrated in Fig. 6-16. Here we have a bus network, as shown in
Fig. 6-16(a), (e.g., Ethernet), with no inherent ordering of the processes. In
software, a logical ring is constructed in which each process is assigned a
position in the ring, as shown in Fig. 6-16(b). The ring positions may be
allocated in numerical order of network addresses or some other means. It does
not matter what the ordering is. All that matters is that each process knows
who is next in line after itself.
Figure 6-16. (a) An unordered
group of processes on a network. (b) A logical ring constructed in software.
When the ring is
initialized, process 0 is given a token. The token circulates around the ring.
It is passed from process k to process k+1 (modulo the ring size) in
point-to-point messages. When a process acquires the token from its neighbor,
it checks to see if it needs to access the shared resource. If so, the process
goes ahead, does all the work it needs to, and releases the resources. After it
has finished, it passes the token along the ring. It is not permitted to
immediately enter the resource again using the same token.
If a process is handed the
token by its neighbor and is not interested in the resource, it just passes the
token along. As a consequence, when no processes need the resource, the token
just circulates at high speed around the ring.
The correctness of this
algorithm is easy to see. Only one process has the token at any instant, so
only one process can actually get to the resource. Since the token circulates
among the processes in a well-defined order, starvation cannot occur. Once a
process decides it wants to have access to the resource, at worst it will have
to wait for every other process to use the resource.
[Page 259]
As usual, this algorithm has
problems too. If the token is ever lost, it must be regenerated. In fact,
detecting that it is lost is difficult, since the amount of time between
successive appearances of the token on the network is unbounded. The fact that
the token has not been spotted for an hour does not mean that it has been lost;
somebody may still be using it.
The algorithm also runs into
trouble if a process crashes, but recovery is easier than in the other cases.
If we require a process receiving the token to acknowledge receipt, a dead
process will be detected when its neighbor tries to give it the token and fails.
At that point the dead process can be removed from the group, and the token
holder can throw the token over the head of the dead process to the next member
down the line, or the one after that, if necessary. Of course, doing so
requires that everyone maintain the current ring configuration.
6.3.6. A Comparison of the
Four Algorithms
A brief comparison of the
four mutual exclusion algorithms we have looked at is instructive. In Fig. 6-17
we have listed the algorithms and three key properties: the number of messages
required for a process to access and release a shared resource, the delay
before access can occur (assuming messages are passed sequentially over a
network), and some problems associated with each algorithm.
Figure 6-17. A comparison of
three mutual exclusion algorithms.
Algorithm |
Messages per entry/exit |
Delay before entry (in
message times) |
Problems |
Centralized |
3 |
2 |
Coordinator crash |
Decentralized |
3mk, k = 1, 2,... |
2 m |
Starvation, low efficiency |
Distributed |
2(n - 1) |
2(n - 1) |
Crash of any process |
Token ring |
1 to |
0 to n - 1 |
Lost token, process crash |
The centralized algorithm is
simplest and also most efficient. It requires only three messages to enter and
leave a critical region: a request, a grant to enter, and a release to exit. In
the decentralized case, we see that these messages need to be carried out for
each of the m coordinators, but now it is possible that several attempts need
to be made (for which we introduce the variable k). The distributed algorithm
requires n - 1 request messages, one to each of the other processes, and an
additional n - 1 grant messages, for a total of 2(n - 1). (We assume that only
point-to-point communication channels are used.) With the token ring algorithm,
the number is variable. If every process constantly wants to enter a critical
region, then each token pass will result in one entry and exit, for an average
of one message per critical region entered. At the other extreme, the token may
sometimes circulate for hours without anyone being interested in it. In this
case, the number of messages per entry into a critical region is unbounded.
[Page 260]
The delay from the moment a
process needs to enter a critical region until its actual entry also varies for
the three algorithms. When the time using a resource is short, the dominant
factor in the delay is the actual mechanism for accessing a resource. When
resources are used for a long time period, the dominant factor is waiting for
everyone else to take their turn. In Fig. 6-17 we show the former case. It takes
only two message times to enter a critical region in the centralized case, but
3mk times for the decentralized case, where k is the number of attempts that
need to be made. Assuming that messages are sent one after the other, 2(n - 1)
message times are needed in the distributed case. For the token ring, the time
varies from 0 (token just arrived) to n - 1 (token just departed).
Finally, all algorithms
except the decentralized one suffer badly in the event of crashes. Special
measures and additional complexity must be introduced to avoid having a crash
bring down the entire system. It is ironic that the distributed algorithms are
even more sensitive to crashes than the centralized one. In a system that is
designed to be fault tolerant, none of these would be suitable, but if crashes
are very infrequent, they might do. The decentralized algorithm is less
sensitive to crashes, but processes may suffer from starvation and special
measures are needed to guarantee efficiency.
6.4. Global Positioning of
Nodes
When the number of nodes in
a distributed system grows, it becomes increasingly difficult for any node to
keep track of the others. Such knowledge may be important for executing
distributed algorithms such as routing, multicasting, data placement,
searching, and so on. We have already seen different examples in which large
collections of nodes are organized into specific topologies that facilitate the
efficient execution of such algorithms. In this section, we take a look at
another organization that is related to timing issues.
In geometric overlay
networks each node is given a position in an m-dimensional geometric space,
such that the distance between two nodes in that space reflects a real-world
performance metric. The simplest, and most applied example, is where distance
corresponds to internode latency. In other words, given two nodes P and Q, then
the distance d (P,Q) reflects how long it would take for a message to travel
from P to Q and vice versa.
There are many applications
of geometric overlay networks. Consider the situation where a Web site at
server O has been replicated to multiple servers S1,..., Sk on the Internet.
When a client C requests a page from O, the latter may decide to redirect that
request to the server closest to C, that is, the one that will give the best
response time. If the geometric location of C is known, as well as those of
each replica server, O can then simply pick that server Si for which d (C,Si)
is minimal. Note that such a selection requires only local processing at O. In
other words, there is, for example, no need to sample all the latencies between
C and each of the replica servers.
[Page 261]
Another example, which we
will work out in detail in the following chapter, is optimal replica placement.
Consider again a Web site that has gathered the positions of its clients. If
the site were to replicate its content to K servers, it can compute the K best
positions where to place replicas such that the average client-to-replica
response time is minimal. Performing such computations is almost trivially
feasible if clients and servers have geometric positions that reflect internode
latencies.
As a last example, consider
position-based routing (Araujo and Rodrigues, 2005; and Stojmenovic, 2002). In
such schemes, a message is forwarded to its destination using only positioning
information. For example, a naive routing algorithm to let each node forward a
message to the neighbor closest to the destination. Although it can be easily
shown that this specific algorithm need not converge, it illustrates that only
local information is used to take a decision. There is no need to propagate
link information or such to all nodes in the network, as is the case with
conventional routing algorithms.
Theoretically, positioning a
node in an m-dimensional geometric space requires m+1 distance measures to
nodes with known positions. This can be easily seen by considering the case m =
2, as shown in Fig. 6-18. Assuming that node P wants to compute its own
position, it contacts three other nodes with known positions and measures its
distance to each of them. Contacting only one node would tell P about the
circle it is located on; contacting only two nodes would tell it about the
position of the intersection of two circles (which generally consists of two
points); a third node would subsequently allow P to compute is actual location.
[Page 262]
Figure 6-18. Computing a
node's position in a two-dimensional space.
(This item is displayed on
page 261 in the print version)
Just as in GPS, node P can
compute is own coordinates (xP,yP) by solving the three equations with the two
unknowns xP and yP:
As said, di generally
corresponds to measuring the latency between P and the node at (xi,yi). This
latency can be estimated as being half the round-trip delay, but it should be
clear that its value will be different over time. The effect is a different
positioning whenever P would want to recompute its position. Moreover, if other
nodes would use P's current position to compute their own coordinates, then it
should be clear that the error in positioning P will affect the accuracy of the
positioning of other nodes.
Moreover, it should also be
clear that measured distances by different nodes will generally not even be
consistent. For example, assume we are computing distances in a one-dimensional
space as shown in Fig. 6-19. In this example, we see that although R measures
its distance to Q as 2.0, and d (P,Q) has been measured to be 1.0, when R measures
d (P,R) it finds 3.2, which is clearly inconsistent with the other two
measurements.
Figure 6-19. Inconsistent
distance measurements in a one-dimensional space.
Fig. 6-19 also suggests how
this situation can be improved. In our simple example, we could solve the
inconsistencies by merely computing positions in a two-dimensional space. This
by itself, however, is not a general solution when dealing with many measurements.
In fact, considering that Internet latency measurements may violate the
triangle inequality, it is generally impossible to resolve inconsistencies
completely. The triangle inequality states that in a geometric space, for any
arbitrary three nodes P, Q, and R it must always be true that d (P,R) d (P,Q) + d (Q,R).
There are various ways to
approach these issues. One approach, proposed by Ng and Zhang (2002) is to use
L special nodes b1, . . . , bL, known as landmarks. Landmarks measure their
pair-wise latencies d(bi,bj) and subsequently let a central node compute the
coordinates for each landmark. To this end, the central node seeks to minimize
the following aggregated error function:
[Page 263]
where corresponds to the geometric distance, that is,
the distance after nodes bi and bj have been positioned.
The hidden parameter in
minimizing the aggregated error function is the dimension m. Obviously, we have
that L > m, but nothing prevents us from choosing a value for m that is much
smaller than L. In that case, a node P measures its distance to each of the L
landmarks and computes its coordinates by minimizing
As it turns out, with
well-chosen landmarks, m can be as small as 6 or 7, with being no more than a factor 2 different from
the actual latency d (P,Q) for arbitrary nodes P and Q (Szyamniak et al.,
2004).
Another way to tackle this
problem is to view the collection of nodes as a huge system in which nodes are
attached to each other through springs. In this case, indicates to what extent nodes P and Q are
displaced relative to the situation in which the system of springs would be at
rest. By letting each node (slightly) change its position, it can be shown that
the system will eventually converge to an optimal organization in which the aggregated
error is minimal. This approach is followed in Vivaldi, of which the details
can be found in Dabek et al. (2004a).
6.5. Election Algorithms
Many distributed algorithms
require one process to act as coordinator, initiator, or otherwise perform some
special role. In general, it does not matter which process takes on this
special responsibility, but one of them has to do it. In this section we will
look at algorithms for electing a coordinator (using this as a generic name for
the special process).
If all processes are exactly
the same, with no distinguishing characteristics, there is no way to select one
of them to be special. Consequently, we will assume that each process has a
unique number, for example, its network address (for simplicity, we will assume
one process per machine). In general, election algorithms attempt to locate the
process with the highest process number and designate it as coordinator. The
algorithms differ in the way they do the location.
[Page 264]
Furthermore, we also assume
that every process knows the process number of every other process. What the
processes do not know is which ones are currently up and which ones are
currently down. The goal of an election algorithm is to ensure that when an
election starts, it concludes with all processes agreeing on who the new
coordinator is to be. There are many algorithms and variations, of which several
important ones are discussed in the text books by Lynch (1996) and Tel (2000),
respectively.
6.5.1. Traditional Election
Algorithms
We start with taking a look
at two traditional election algorithms to give an impression what whole groups
of researchers have been doing in the past decades. In subsequent sections, we
pay attention to new applications of the election problem.
The Bully Algorithm
As a first example, consider
the bully algorithm devised by Garcia-Molina (1982). When any process notices
that the coordinator is no longer responding to requests, it initiates an
election. A process, P, holds an election as follows:
P sends an ELECTION message
to all processes with higher numbers.
If no one responds, P wins
the election and becomes coordinator.
If one of the higher-ups
answers, it takes over. P's job is done.
At any moment, a process can
get an ELECTION message from one of its lower-numbered colleagues. When such a
message arrives, the receiver sends an OK message back to the sender to
indicate that he is alive and will take over. The receiver then holds an
election, unless it is already holding one. Eventually, all processes give up
but one, and that one is the new coordinator. It announces its victory by
sending all processes a message telling them that starting immediately it is
the new coordinator.
If a process that was
previously down comes back up, it holds an election. If it happens to be the
highest-numbered process currently running, it will win the election and take
over the coordinator's job. Thus the biggest guy in town always wins, hence the
name "bully algorithm."
In Fig. 6-20 we see an
example of how the bully algorithm works. The group consists of eight
processes, numbered from 0 to 7. Previously process 7 was the coordinator, but
it has just crashed. Process 4 is the first one to notice this, so it sends
ELECTION messages to all the processes higher than it, namely 5, 6, and 7, as
shown in Fig. 6-20(a). Processes 5 and 6 both respond with OK, as shown in Fig.
6-20(b). Upon getting the first of these responses, 4 knows that its job is over.
It knows that one of these bigwigs will take over and become coordinator. It
just sits back and waits to see who the winner will be (although at this point
it can make a pretty good guess).
[Page 265]
Figure 6-20. The bully
election algorithm. (a) Process 4 holds an election. (b) Processes 5 and 6
respond, telling 4 to stop. (c) Now 5 and 6 each hold an election. (d) Process
6 tells 5 to stop. (e) Process 6 wins and tells everyone.
In Fig. 6-20(c), both 5 and
6 hold elections, each one only sending messages to those processes higher than
itself. In Fig. 6-20(d) process 6 tells 5 that it will take over. At this point
6 knows that 7 is dead and that it (6) is the winner. If there is state
information to be collected from disk or elsewhere to pick up where the old
coordinator left off, 6 must now do what is needed. When it is ready to take
over, 6 announces this by sending a COORDINATOR message to all running
processes. When 4 gets this message, it can now continue with the operation it
was trying to do when it discovered that 7 was dead, but using 6 as the
coordinator this time. In this way the failure of 7 is handled and the work can
continue.
If process 7 is ever
restarted, it will just send all the others a COORDINATOR message and bully
them into submission.
[Page 266]
A Ring Algorithm
Another election algorithm
is based on the use of a ring. Unlike some ring algorithms, this one does not
use a token. We assume that the processes are physically or logically ordered,
so that each process knows who its successor is. When any process notices that
the coordinator is not functioning, it builds an ELECTION message containing
its own process number and sends the message to its successor. If the successor
is down, the sender skips over the successor and goes to the next member along
the ring, or the one after that, until a running process is located. At each
step along the way, the sender adds its own process number to the list in the
message effectively making itself a candidate to be elected as coordinator.
Eventually, the message gets
back to the process that started it all. That process recognizes this event
when it receives an incoming message containing its own process number. At that
point, the message type is changed to COORDINATOR and circulated once again,
this time to inform everyone else who the coordinator is (the list member with
the highest number) and who the members of the new ring are. When this message
has circulated once, it is removed and everyone goes back to work.
In Fig. 6-21 we see what
happens if two processes, 2 and 5, discover simultaneously that the previous
coordinator, process 7, has crashed. Each of these builds an ELECTION message
and and each of them starts circulating its message, independent of the other
one. Eventually, both messages will go all the way around, and both 2 and 5
will convert them into COORDINATOR messages, with exactly the same members and
in the same order. When both have gone around again, both will be removed. It
does no harm to have extra messages circulating; at worst it consumes a little
bandwidth, but this not considered wasteful.
Figure 6-21. Election
algorithm using a ring.
[Page 267]
6.5.2. Elections in Wireless
Environments
Traditional election
algorithms are generally based on assumptions that are not realistic in
wireless environments. For example, they assume that message passing is
reliable and that the topology of the network does not change. These
assumptions are false in most wireless environments, especially those for
mobile ad hoc networks.
Only few protocols for
elections have been developed that work in ad hoc networks. Vasudevan et al.
(2004) propose a solution that can handle failing nodes and partitioning networks.
An important property of their solution is that the best leader can be elected
rather than just a random as was more or less the case in the previously
discussed solutions. Their protocol works as follows. To simplify our
discussion, we concentrate only on ad hoc networks and ignore that nodes can
move.
Consider a wireless ad hoc
network. To elect a leader, any node in the network, called the source, can
initiate an election by sending an ELECTION message to its immediate neighbors
(i.e., the nodes in its range). When a node receives an ELECTION for the first
time, it designates the sender as its parent, and subsequently sends out an
ELECTION message to all its immediate neighbors, except for the parent. When a
node receives an ELECTION message from a node other than its parent, it merely
acknowledges the receipt.
When node R has designated
node Q as its parent, it forwards the ELECTION message to its immediate
neighbors (excluding Q) and waits for acknowledgments to come in before
acknowledging the ELECTION message from Q. This waiting has an important
consequence. First, note that neighbors that have already selected a parent
will immediately respond to R. More specifically, if all neighbors already have
a parent, R is a leaf node and will be able to report back to Q quickly. In
doing so, it will also report information such as its battery lifetime and
other resource capacities.
This information will later
allow Q to compare R's capacities to that of other downstream nodes, and select
the best eligible node for leadership. Of course, Q had sent an ELECTION
message only because its own parent P had done so as well. In turn, when Q
eventually acknowledges the ELECTION message previously sent by P, it will pass
the most eligible node to P as well. In this way, the source will eventually
get to know which node is best to be selected as leader, after which it will
broadcast this information to all other nodes.
This process is illustrated
in Fig. 6-22. Nodes have been labeled a to j, along with their capacity. Node a
initiates an election by broadcasting an ELECTION message to nodes b and j, as
shown in Fig. 6-22(b). After that step, ELECTION messages are propagated to all
nodes, ending with the situation shown in Fig. 6-22(e), where we have omitted
the last broadcast by nodes f and i. From there on, each node reports to its
parent the node with the best capacity, as shown in Fig. 6-22(f). For example,
when node g receives the acknowledgments from its children e and h, it will
notice that h is the best node, propagating [h, 8] to its own parent, node b.
In the end, the source will note that h is the best leader and will broadcast
this information to all other nodes.
[Page 268]
Figure 6-22. Election
algorithm in a wireless network, with node a as the source. (a) Initial
network. (b)–(e) The build-tree phase (last broadcast step by nodes f and i not
shown). (f)
Reporting of best node to
source.
[Page 269]
When multiple elections are
initiated, each node will decide to join only one election. To this end, each
source tags its ELECTION message with a unique identifier. Nodes will
participate only in the election with the highest identifier, stopping any running
participation in other elections.
With some minor adjustments,
the protocol can be shown to operate also when the network partitions, and when
nodes join and leave. The details can be found in Vasudevan et al. (2004).
6.5.3. Elections in
Large-Scale Systems
The algorithms we have been
discussing so far generally apply to relatively small distributed systems.
Moreover, the algorithms concentrate on the selection of only a single node.
There are situations when several nodes should actually be selected, such as in
the case of superpeers in peer-to-peer networks, which we discussed in Chap. 2.
In this section, we concentrate specifically on the problem of selecting
superpeers.
Lo et al. (2005) identified
the following requirements that need to be met for superpeer selection:
Normal nodes should have
low-latency access to superpeers.
Superpeers should be evenly
distributed across the overlay network.
There should be a predefined
portion of superpeers relative to the total number of nodes in the overlay network.
Each superpeer should not
need to serve more than a fixed number of normal nodes.
Fortunately, these
requirements are relatively easy to meet in most peer-to-peer systems, given
the fact that the overlay network is either structured (as in DHT-based
systems), or randomly unstructured (as, for example, can be realized with
gossip-based solutions). Let us take a look at solutions proposed by Lo et al.
(2005).
In the case of DHT-based
systems, the basic idea is to reserve a fraction of the identifier space for
superpeers. Recall that in DHT-based systems each node receives a random and
uniformly assigned m-bit identifier. Now suppose we reserve the first (i.e.,
leftmost) k bits to identify superpeers. For example, if we need N superpeers,
then the first log2(N) bits of any key can be used to identify these nodes.
To explain, assume we have a
(small) Chord system with m = 8 and k = 3. When looking up the node responsible
for a specific key p, we can first decide to route the lookup request to the
node responsible for the pattern
p AND 11100000
[Page 270]
which is then treated as the
superpeer. Note that each node id can check whether it is a superpeer by
looking up
id AND 11100000
to see if this request is
routed to itself. Provided node identifiers are uniformly assigned to nodes, it
can be seen that with a total of N nodes the number of superpeers is, on
average, equal 2k-m N.
A completely different
approach is based on positioning nodes in an m-dimensional geometric space as
we discussed above. In this case, assume we need to place N superpeers evenly
throughout the overlay. The basic idea is simple: a total of N tokens are
spread across N randomly-chosen nodes. No node can hold more than one token.
Each token represents a repelling force by which another token is inclined to
move away. The net effect is that if all tokens exert the same repulsion force,
they will move away from each other and spread themselves evenly in the geometric
space.
This approach requires that
nodes holding a token learn about other tokens. To this end, Lo et al. propose
to use a gossiping protocol by which a token's force is disseminated throughout
the network. If a node discovers that the total forces that are acting on it
exceed a threshold, it will move the token in the direction of the combined
forces, as shown in Fig. 6-23.
Figure 6-23. Moving tokens
in a two-dimensional space using repulsion forces.
When a token is held by a
node for a given amount of time, that node will promote itself to superpeer.