With the explosion of data and the increasing complexity of data analysis, large- scale data analysis imposes significant challenges in systems design. While current research focuses on scaling out to large clusters, these scale-out solutions introduce a significant amount of overhead. This thesis is motivated by the advance of new I/O technologies such as flash memory. Instead of scaling out, we explore efficient system designs in a single commodity machine with non-uniform memory architec- ture (NUMA) and scale to large datasets by utilizing commodity solid-state drives (SSDs). This thesis explores the impact of the new I/O technologies on large-scale data analysis. Instead of implementing individual data analysis algorithms for SSDs, we develop a data analysis ecosystem called FlashX to target a large range of data analysis tasks. FlashX includes three subsystems: SAFS, FlashGraph and Flash- Matrix. SAFS is a user-space filesystem optimized for a large SSD array to deliver maximal I/O throughput from SSDs. FlashGraph is a general-purpose graph analy- sis framework that processes graphs in a semi-external memory fashion, i.e., keeping vertex state in memory and edges on SSDs, and scales to graphs with billions of vertices by utilizing SSDs through SAFS. FlashMatrix is a matrix-oriented program- ming framework that supports both sparse matrices and dense matrices for general data analysis. Similar to FlashGraph, it scales matrix operations beyond memory capacity by utilizing SSDs. We demonstrate that with the current I/O technologies FlashGraph and FlashMatrix in the (semi-)external-memory meets or even exceeds state-of-the-art in-memory data analysis frameworks while scaling to massive datasets for a large variety of data analysis tasks.
Speaker Biography
Da Zheng received the BS degree in computer science from Zhejiang University, China, in 2006, and the MS degree in computer science from Ecole polytechnique federale de Lausanne, in 2009. Since 2010, he is a PhD student of computer science at Johns Hopkins University. His research interests include high-performance computing, large-scale data analysis systems, and large-scale machine learning.