Much of our critical infrastructure is controlled by large software systems whose participants are distributed across the Internet. As our dependence on these critical systems continues to grow, it becomes increasingly important that they meet strict availability and performance requirements, even in the face of malicious attacks, including those that are successful in compromising parts of the system. This talk presents the first replication protocols capable of guaranteeing correctness, availability, and good performance even when some of the servers are compromised, enabling the construction of highly available and highly resilient systems for our critical infrastructure.
Prior to this work, intrusion-tolerant replication protocols were designed to perform well in fault-free executions, and this is how they were evaluated. We point out that many state of the art protocols are vulnerable to significant performance degradation by a small number of malicious processors. We present Prime, a new intrusion-tolerant replication protocol that bounds the amount of performance degradation that can be caused by compromised machines, assuming the network is sufficiently stable. Using Prime as a building block, we show how to design and implement an attack-resilient, large-scale intrusion-tolerant replication system for wide-area networks. Our results provide evidence that it is possible to construct highly resilient, large-scale survivable systems that perform well even when some of the servers (and some entire sites) are compromised.