Fast and accurate estimation of large-scale phylogenetic alignments and trees
Abstract
Phylogenetics is the study of evolutionary relationships. Phylogenetic trees and alignments play important roles in a wide range of biological research, including reconstruction of the Tree of Life
- the evolutionary history of all organisms on Earth - and the development of vaccines and antibiotics. Today's phylogenetic studies seek to reconstruct trees and alignments on a greater number and variety of organisms than ever before, primarily due to exponential growth in affordable sequencing and computing power. The importance of phylogenetic trees and alignments motivates the need for methods to reconstruct them accurately and efficiently on large-scale datasets.
Traditionally, phylogenetic studies proceed in two phases: first, an alignment is produced from biomolecular sequences with differing lengths, and, second, a tree is produced using the alignment. My dissertation presents the first empirical performance study of leading two-phase methods on datasets with up to hundreds of thousands of sequences. Relatively accurate alignments and trees were obtained using methods with high computational requirements on datasets with a few hundred sequences, but as datasets grew past 1000 sequences and up to tens of thousands of sequences, the set of methods capable of analyzing a dataset diminished and only the methods with the lowest computational requirements and lowest accuracy remained.
Alternatively, methods have been developed to simultaneously estimate phylogenetic alignments and trees. Methods optimizing the treelength optimization problem - the most widely-used approach for simultaneous estimation - have not been shown to return more accurate trees and alignments than two-phase approaches. I demonstrate that treelength optimization under a particular class of optimization criteria represents a promising means for inferring accurate trees and alignments. The other methods for simultaneous estimation are not known to support analyses of datasets with a few hundred sequences due to their high computational requirements.
The main contribution of my dissertation is SATe, the first fast and accurate method for simultaneous estimation of alignments and trees on datasets with up to several thousand nucleotide sequences. SATe improves upon the alignment and topological accuracy of all existing methods, especially on the most difficult-to-align datasets, while retaining reasonable computational requirements.