Monday, April 26, 2010

Tight bounds for Clock Synchronization [Lenzen, Locher, Wattenhofer PODC '09]

A distributed system consists of autonomous nodes (usually with an individual clock) communicating through a network. Such systems need a common notion of time to run most tasks. For instance, a multi-person video game on a network needs a way to order actions on different computers that is very close to their actual order. Similar requirements might be needed of, say, a global network of sensors or for a system time-sharing a common resource (like TDMA networks). While this may be easy if all the nodes of the system shared an identical copy of a clock that always kept time perfectly, real clocks are not perfect and tend to drift. If left alone uncorrected, the skew (difference) between clocks in these nodes can become too large and intolerable.

One way to improve this situation is to let the nodes have a logical clock on top of their hardware clocks. They could communicate their logical clock values and adjust them based on the values of other nodes in the system (according to some algorithm). The main practical challenges here are that the hardware clock drift rate and the latency of the communications between nodes is variable and unknown to these nodes. At best, the nodes may be aware of some upper bounds on the hardware clock drift rate and message latencies ($\tau$). Further, an algorithm that adjusts logical clock values can be expected to have certain reasonable properties:

  1. The largest skew between any two logical clocks in the network (global skew) is small.
  2. The largest skew between the logical clocks of any pair of adjacent nodes in the network (local skew) is small.
  3. The maximum possible skew between any pair of nodes degrades smoothly as a function of the distance between the nodes (gradient property).
  4. The rate of progress of logical clock is neither too small nor too large. This is to ensure that the spacing between logical times given to events at one node compares well with their real times.

Evidently, the skew depends on the size of network, and more specifically, on the diameter $D$ of the network. It has been shown that no algorithm can achieve a better bound on global skew than $D/2$ (Assume $\tau = 1$) [Biaz, Welch '01]. More surprisingly, [Fan, Lynch '04] showed that no algorithm could achieve a local skew better than $\Omega(\log D/ \log\log D)$. While algorithms that match the bound on global skews were known, matching the local skew bound has been elusive. The paper we will be discussing [LLW-PODC'09, J.ACM'10] solves this problem, and also presents an improved lower bound for local skew.

In the discussion, we will see a more formal statement of the model. We will discuss why certain simple algorithms like setting a node's logical clock to the average or maximum of its neighbors fail. We will then see how ideas from these failed algorithms can be put together to construct a simple algorithm (the proofs, however, are a bit long) that has optimal global and local skews and a good gradient property. We will also talk about lower bounds on global and (time permitting) local skews.

Note: In case you want to read the paper, [LLW-PODC'09] contains only the statements of the results. Also, I found the two associated tech reports diffcult to follow. Their recent journal article in J.ACM-Jan'10 is much more readable and contains proofs in good detail. Locher and Wattenhofer's earlier paper in DISC'06 is a good background reading and helps you appreciate the algorithm better.

No comments:

Post a Comment