Investigating the behaviors and limitations of phylogenetic models of protein-coding sequence evolution

dc.contributor.advisorWilke, C. (Claus)en
dc.contributor.committeeMemberBull, Jamesen
dc.contributor.committeeMemberBarrick, Jeffreyen
dc.contributor.committeeMemberHillis, Daviden
dc.contributor.committeeMemberHofmann, Hansen
dc.creatorSpielman, Stephanie Jillen
dc.creator.orcid0000-0002-9090-4788en
dc.date.accessioned2016-06-30T18:49:04Z
dc.date.accessioned2018-01-22T22:30:11Z
dc.date.available2016-06-30T18:49:04Z
dc.date.available2018-01-22T22:30:11Z
dc.date.issued2016-05en
dc.date.submittedMay 2016
dc.date.updated2016-06-30T18:49:04Z
dc.description.abstractProbabilistic models which infer the strength and direction of natural selection from protein-coding sequences are among the most widely-used tools in comparative sequence analysis. A variety of phylogenetic models of coding-sequence evolution have been developed. However, these models have been produced independently from one another. As a consequence, it has been entirely unknown whether inferences from different models reveal similar or incompatible information about the evolutionary process. In this dissertation, I derive and study the mathematical relationship between two probabilistic models of protein-coding sequence evolution: dN/dS-based models, which estimate evolutionary rates, and mutation–selection models, which estimate site-specific amino-acid fitnesses. I demonstrate how this relationship reveals the behavioral properties, limitations, and applicabilities of different inference frameworks, which leads to concrete recommendations for how these models should best be employed in evolutionary sequence analysis. In Chapter 2, I develop a flexible and extendable software, implemented as a module in the Python programming language, for simulating sequences along phylogenies according to standard evolutionary models. This software platform provides an independent and user-friendly platform for testing model behavior, or indeed developing novel evolutionary models, thus enabling robust comparisons of modeling frameworks. In Chapter 3, I derive a mathematical relationship between dN/dS and amino-acid fitness values, and I show that mutation– selection models fully encompass information encoded in dN/dS models, provided that sequences are evolving under purifying selection. I further use this relationship to show that certain commonly-used dN/dS-based models are strongly and systematically biased. I additionally show that standard metrics used for model selection in phylogenetics (e.g. Akaike Information Criterion) may be positively misleading and indicate strong support for incorrect models. Finally, in Chapter 4, I apply the mathematical relationship developed in Chapter 3 to study the accuracy of two competing mutation–selection inference implementations, whose relative merits have been heavily debated in the literature. My approach demonstrates that mutation–selection inference platforms that treat amino-acid fitnesses as fixed-effect variables precisely estimate site-specific evolutionary constraints. By contrast, inference platforms that treat fitnesses as random-effect variables systematically underestimate the strength of natural selection across sites. Taken together, the work presented in this dissertation yields novel insights into how these popular evolutionary models can best be applied to sequence data, how their results should be interpreted, and finally how future model development should be conducted in order to yield robust and reliable inference methods.en
dc.description.departmentEcology, Evolution and Behavioren
dc.format.mimetypeapplication/pdfen
dc.identifierdoi:10.15781/T2GH9B88Ken
dc.identifier.urihttp://hdl.handle.net/2152/38770en
dc.language.isoenen
dc.subjectMolecular evolutionen
dc.subjectProtein evolutionen
dc.subjectPhylogenetic modelsen
dc.titleInvestigating the behaviors and limitations of phylogenetic models of protein-coding sequence evolutionen
dc.typeThesisen
dc.type.materialtexten

Files