Browsing by Subject "Multi-species coalescent model"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
Item Estimating species trees from gene trees despite gene tree incongruence under realistic model conditions(2016-12) Bayzid, Md. Shamsuzzoha; Ghosh, Joydeep; Warnow, Tandy, 1955-; Ramachandran, Vijaya; Plaxton, Greg; Ravikumar, PradeepSpecies tree estimation is frequently based on phylogenomic approaches that use multiple genes from throughout the genome. With the rapid growth rate of newly sequenced genomes, species tree inference from multiple genes has become one of the basic and popular tasks in comparative and evolutionary biology. However, combining data on multiple genes is not a trivial task since genes evolve through biological processes that include deep coalescence (also known as incomplete lineage sorting (ILS)), duplication and loss, horizontal gene transfer etc., so that the individual gene histories can differ from each other. In this dissertation, we focus on making advances on phylogenomic analyses with particular attention to the gene tree discordance. In addition to gene tree discordance, we consider other challenging conditions that frequently arise in genome scale data. One of these major challenges is incomplete gene trees, meaning that not all gene trees have individuals from all the species. We performed an extensive simulation study under the multi-species coalescent (MSC) model that shows that existing methods have poor accuracy when gene trees are incomplete. We formalized the optimal completion problem, which seeks to add the missing taxa (species) into the gene trees with respect to a species tree such that the distance (in terms of ILS) between the gene tree and the species tree is minimized. We developed an algorithm for solving this problem. We formalized optimization problems in the context of species tree estimation from a set of incomplete gene trees under the multi-species coalescent model, and proposed algorithms for solving these problems. We formulated different mathematical models for “gene loss” based on different reasons for incompleteness. Next, we addressed the Minimize Gene Duplication (MGD) problem, that seeks to find a species tree from a set of gene trees so as to minimize the total number of duplications needed to explain the evolutionary history. We proposed exact and heuristic algorithms to solve this NP-hard problem. Next, we showed in a comprehensive experimental study that existing methods are susceptible to poorly estimated gene trees in the presence of ILS. We proposed a new technique called “binning” that dramatically improves the performance of species tree estimation methods when gene trees are poorly estimated. We developed a novel technique called “naive binning” and subsequently proposed an improved version called “weighted statistical binning” to address the problem of gene tree estimation error. Finally, we addressed the computational challenges to reconstruct highly accurate species tree from large scale genomic data. We developed divide-and-conquer based meta-methods that can make existing methods scalable to very large datasets (in terms of the number of species). Overall, this dissertation contributes to understanding the limitations of the existing methods under realistic model conditions, developing new approaches to handle the challenging issues that frequently arise in phylogenomics, and improving and scaling the existing methods to larger datasets.Item Novel scalable approaches for multiple sequence alignment and phylogenomic reconstruction(2015-08) Mir arabbaygi, Siavash; Pingali, Keshav; Warnow, Tandy, 1955-; Hillis, David; Gosh, Joydeep; Berger, Bonnie; Mooney, RayThe amount of biological sequence data is increasing rapidly, a promising development that would transform biology if we can develop methods that can analyze large-scale data efficiently and accurately. A fundamental question in evolutionary biology is building the tree of life: a reconstruction of relationships between organisms in evolutionary time. Reconstructing phylogenetic trees from molecular data is an optimization problem that involves many steps. In this dissertation, we argue that to answer long-standing phylogenetic questions with large-scale data, several challenges need to be addressed in various steps of the pipeline. One challenges is aligning large number of sequences so that evolutionarily related positions in all sequences are put in the same column. Constructing alignments is necessary for phylogenetic reconstruction, but also for many other types of evolutionary analyses. In response to this challenge, we introduce PASTA, a scalable and accurate algorithm that can align datasets with up to a million sequences. A second challenge is related to the interesting fact that various parts of the genome can have different evolutionary histories. Reconstructing a species tree from genome-scale data needs to account for these differences. A main approach for species tree reconstruction is to first reconstruct a set of ``gene trees'' from different parts of the genome, and to then summarize these gene trees into a single species tree. We argue that this approach can suffer from two challenges: reconstruction of individual gene trees from limited data can be plagued by estimation error, which translates to errors in the species tree, and also, methods that summarize gene trees are not scalable or accurate enough under some conditions. To address the first challenge, we introduce statistical binning, a method that re-estimates gene trees by grouping them into bins. We show that binning improves gene tree accuracy, and consequently the species tree accuracy. To address the second challenge, we introduce ASTRAL, a new summary method that can run on a thousand genes and a thousand species in a day and has outstanding accuracy. We show that the development of these methods has enabled biological analyses that were otherwise not possible.