Family of Hidden Markov Models and its applications to phylogenetics and metagenomics

Date

2014-08

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

A Profile Hidden Markov Model (HMM) is a statistical model for representing a multiple sequence alignment (MSA). Profile HMMs are important tools for sequence homology detection and have been used in wide a range of bioinformatics applications including protein structure prediction, remote homology detection, and sequence alignment. Profile HMM methods result in accurate alignments on datasets with evolutionarily similar sequences; however, I will show that on datasets with evolutionarily divergent sequences, the accuracy of HMM-based methods degrade. My dissertation presents a new statistical model for representing an MSA by using a set of HMMs. The family of HMM (fHMM) approach uses multiple HMMs instead of a single HMM to represent an MSA. I present a new algorithm for sequence alignment using the fHMM technique. I show that using the fHMM technique for sequence alignment results in more accurate alignments than the single HMM approach. As sequence alignment is a fundamental step in many bioinformatics pipelines, improvements to sequence alignment result in improvements across many different fields. I show the applicability of fHMM to three specific problems: phylogenetic placement, taxonomic profiling and identification, and MSA estimation. In phylogenetic placement, the problem addressed is how to insert a query sequence into an existing tree. In taxonomic identification and profiling, the problems addressed are how to taxonomically classify a query sequence, and how to estimate a taxonomic profile on a set of sequences. Finally, both profile HMM and fHMM require a backbone MSA as input in order to align the query sequences. In MSA estimation, the problem addressed is how to estimate a de novo'' MSA without the use of an existing backbone alignment. For each problem, I present a software pipeline that implements the fHMM specifically for that domain: SEPP for phylogenetic placement, TIPP for taxonomic profiling and identification, and UPP for MSA estimation. I show that SEPP has improved accuracy compared to the single HMM approach. I also show that SEPP results in more accurate phylogenetic placements compared to existing placement methods, and SEPP is more computationally efficient, both in peak memory usage and running time. I show that TIPP more accurately classifies novel sequences compared to the single HMM approach, and TIPP estimates more accurate taxonomic profiles than leading methods on simulated metagenomic datasets. I show how UPP can estimate de novo'' alignments using fHMM. I present results that show UPP is more accurate and efficient than existing alignment methods, and estimates accurate alignments and trees on datasets containing both full-length and fragmentary sequences. Finally, I show that UPP can estimate a very accurate alignment on a dataset with 1,000,000 sequences in less than 2 days without the need of a supercomputer.


Computer Sciences

Description

text

Citation