Estimating ultra-large phylogenies and alignments

Biomolecular sequences evolve under processes that include substitutions, insertions and deletions (indels), as well as other events, such as duplications. The estimation of evolutionary history from sequences is then used to answer fundamental questions about biology, and also has applications in a wide range of biomedical research.

From a computational perspective, however, phylogenetic (evolutionary) tree estimation is enormously hard: all favored approaches are NP-hard, and even the best heuristics can take months or years on only moderately large datasets. Furthermore, while there are very good heuristics for estimating trees from sequences that are already placed in a multiple alignment (a step that is used when sequences evolve with indels), errors in alignment estimation produce errors in tree estimation, and the standard alignment estimation methods fail to produce highly accurate alignments on large highly divergent datasets. Thus, the estimation of highly accurate phylogenetic trees from large datasets of unaligned sequences is beyond the scope of standard methods.

In this talk, I will describe new algorithmic tools that my group has developed, and which make it possible, for the ﬁrst time, to obtain highly accurate estimates of trees from very large datasets, even when the sequences have evolved under high rates of substitution and indels. In particular, I will describe SAT´e (Liu et al. 2009, Science Vol 324, no. 5934). SAT´e simultaneously estimates a tree and alignment; our study shows that SAT´e is shows that SAT´e is very fast, and produces dramatically more accurate trees and alignments than competing methods, even on datasets with 1000 taxa and high rates of indels and substitutions. I will also describe our new method, DACTAL (not yet submitted). DACTAL stands for “Divide-and-Conquer Trees without Alignments”, and uses an iterative procedure combined with a novel divide-and-conquer strategy to estimate trees from unaligned sequences. Our study, using both real and simulated data, shows that DACTAL produces trees of higher accuracy than SAT´e, and does so without ever constructing an alignment on the entire set of sequences. Furthermore, DACTAL is extremely fast, producing highly accurate estimates of datasets in a few days that take many other methods years. Time permitting, I will show how DACTAL can be used to improve the speed and accuracy of other phylogeny reconstruction methods, and in particular in the context of phylogenetic analyses of whole genomes.

Speaker Biography

Tandy Warnow is David Bruton Jr. Centennial Professor of Computer Sciences at the University of Texas at Austin. Her research combines mathematics, computer science, and statistics to develop improved models and algorithms for reconstructing complex and large-scale evolutionary histories in both biology and historical linguistics. Tandy received her PhD in Mathematics at UC Berkeley under the direction of Gene Lawler, and did postdoctoral training with Simon Tavare and Michael Waterman at USC. She received the National Science Foundation Young Investigator Award in 1994, the David and Lucile Packard Foundation Award in Science and Engineering in 1996, a Radcliffe Institute Fellowship in 2006, and a Guggenheim Foundation Fellowship for 2011. Tandy is a member of five graduate programs at the University of Texas, including Computer Science; Ecology, Evolution, and Behavior; Molecular and Cellular Biology; Mathematics; and Computational and Applied Mathematics. Her current research focuses on phylogeny and alignment estimation for very large datasets (10,000 to 500,000 sequences), estimating species trees from collections of gene trees, and genome rearrangement phylogeny estimation.