Probabilistic Approaches in Comparative Analysis of Biological Networks and Sequences

Sahraeian, Sayed 1983-

Probabilistic Approaches in Comparative Analysis of Biological Networks and Sequences

Date

2013-01-07

Authors

Sahraeian, Sayed 1983-

Abstract

Comparative analysis of genomic data investigates the relationship of genome structure and function across different biological species to shed light on their similarities and differences. In this dissertation, we study two important problems in comparative genomics, namely comparative sequence analysis and comparative network analysis.

In the comparative sequence analysis, we study the multiple sequence alignment of protein and DNA sequences as well as the structural alignment of multiple RNA sequences. For closely related sequences, multiple sequence alignment can be efficiently performed through progressive techniques. However, for divergent sequences it is very challenging to predict an accurate alignment. Here, we introduce PicXAA, an efficient non-progressive technique for multiple protein and DNA sequence alignment. We also further extend PicXAA to PicXAA-R for structural alignment of RNA sequences. PicXAA and PicXAA-R greedily build up the alignment from sequence regions with high local similarity, thereby yielding an accurate global alignment that effectively captures local similarities among sequences.

As another important research area in comparative genomics, we also investigate the comparative network analysis problem. Translation of increasing number of large-scale biological networks into meaningful biological insights requires efficient computational techniques. One such example is network querying, which aims to identify subnetwork regions in a large target network that are similar to a given query network. Here, we introduce an efficient algorithm for querying large-scale biological networks, called RESQUE. RESQUE adopts a semi-Markov random walk model to probabilistically estimate the correspondence scores between nodes that belong to different networks. The target network is iteratively reduced based on the estimated correspondence scores until the best matching subnetwork emerges. The proposed network querying scheme is computationally efficient, can handle any network query with an arbitrary topology, and yields accurate querying results. We also extend the idea used in RESQUE to develop an efficient algorithm for alignment of multiple large-scale biological networks, called SMETANA. SMETANA outperforms state-of- the-art network alignment techniques, in terms of both computational efficiency and alignment accuracy.

The accomplished studies have enabled us to provide a coherent framework for probabilistic approach to comparative analysis of biological sequences and networks. Such a probabilistic framework helps us employ rigorous mathematical schemes to find accurate and efficient solutions to these problems.