As some scientific projects begin to collect data well into the petabyte range, the the technology used to analyze and retrieve such data must evolve appropriately. When presented with such a large volume, prior data delivery techniques, such as time-honored system of simply copying all data to local storage prior to analysis, become impractical. This thesis studies the use of databases to host large-scale, scientific data. We describe early systems, such as SkyServer and CASJobs, and discuss an implementation of automatic provenance within them. Additionally, we discuss various observations regarding query patterns collected from such systems, over time. Lastly, this thesis concludes with a discussion of TileDB, a novel distributed computing framework using independent shared-nothing databases. In addition to features common to many distributed systems, such as automatic parallelization and incremental fault tolerance, TileDB also combines several features that are fairly novel within the field, such as dynamic allocation of both data and work, long term, adaptive data curation and transparent integration with existing database deployments.
Speaker Biography
Nolan Li first attended Johns Hopkins University as an under-graduate in 1998. After some deliberation, he chose to pursue Computer Science as a course of study. He has been working with the Sloan Digital Sky Survey, as an employee or student, since 2001. He is the author of several tools currently in use by the project, including CASJobs and Open SkyQuery. His research focuses on large-scale data science, and his papers have been presented at conferences such as ADASS, Microsoft E-Science and Super Computing.