Understanding the importance of taxonomic sampling for large-scale phylogenetic analyses by simulating evolutionary processes under complex models



Journal Title

Journal ISSN

Volume Title



Appropriate and extensive taxon sampling is one of the most important determinants of accurate phylogenetic estimation. In addition, accuracy of inferences about evolutionary processes obtained from phylogenetic analyses is improved significantly by thorough taxon sampling efforts. Much of the previous work examining the impact of taxon sampling on phylogenetic accuracy has focused on the effects of random taxon sampling or directed taxon addition/removal. Therefore, the effect of realistic, nonrandom taxon sampling strategies on the accuracy of large-scale phylogenetic reconstruction is not well understood. Typically, broad systematic studies of diverse clades select species according to current classification to span the diversity within the group of interest. I simulated phylogenies under a realistic model of cladogenesis and used these trees to generate sequence data. Using these simulations, I explored the effect of taxonomy-based taxon sampling on the accuracy of maximum likelihood reconstruction. The results demonstrate that taxonomy-based sampling has a stronger, negative, effect on phylogenetic accuracy than random taxon sampling. Therefore, it is recommended that systematists conducting phylogenetic analyses of diverse clades concentrate on improving sampling density within their group of interest by selecting multiple representatives from each taxonomic level. Phylogenetic tree imbalance is often used to make inferences about macroevolutionary processes that generate patterns of tree shape. However these patterns may be obscured by non-biological factors that can bias tree shape. Using published trees inferred from biological data and trees simulated under a realistic branching model; I investigated the affect of random taxon omission on phylogenetic tree imbalance. My results indicate that incomplete taxon sampling in the presence of variable rates of speciation and extinction may be sufficient to explain much of the imbalance observed in empirical phylogenies. Previous research has indicated that some methods of phylogenetic inference can produce biased tree topologies and shapes. Using simulated model tree topologies and sequence data, I investigated the non-biological factors that lead to biases in phylogenetic tree imbalance. Based on my results, I concluded that phylogenetic noise is the primary cause of tree shape bias. Methods that account for unobserved substitutions, such as maximum likelihood, can overcome the systematic bias toward imbalanced topologies.