Design of a dependable distributed system is a highly complex task that requires a structured approach. In this research, we adopt a layered design philosophy in which fundamental primitive services such as reliable multicast and fault detection are used to build more complex tasks such as consensus, distributed fault diagnosis, and checkpoint/rollback. One of the most fundamental services in a dependable distributed system is clock synchronization. A synchronization primitive simplifies the specification of many important aspects of distributed systems including, among others, process coordination, total event ordering, checkpointing, at-most-once message delivery, cache consistency, atomic broadcast, and deadline observance.
In this talk, we present a novel approach to clock synchronization in large distributed systems known as multistep interactive convergence has a communication cost that is orders of magnitude lower than traditional approaches and it achieves significantly tighter synchronization than approaches of comparable communication cost. We present an overview of m-ICV and its performance, compare it to alternative synchronization approaches, and discuss its implementation and validation on several platforms.