Highly Available Distributed Storage Systems

Distributed data storage systems play an important role in creating reliable and efficient distributed computing environments, such as the RAIN (Redundant Array of Independent Nodes) at Caltech. This talk will discuss some of the key issues involved in achieving high availability (reliability and efficiency) in distributed storage systems. In particular, I will describe the design and implementation of a novel approach for improving the performance of reliable (n,k) data servers (which are natural generalizations of RAID systems). I will describe a systematic framework for designing such servers, as well as performance evaluation results.

The theory of error-correcting codes serves as the mathematical foundation for creating those novel storage systems. I will describe my recent work on designing new classes of error-correcting codes (one of them is called B-Code) that have efficient encoding/decoding procedures as well as other features that make them suitable for storage systems. The talk will conclude with possible future research directions.