Browsing by Subject "Gene Ontology"
Now showing 1 - 3 of 3
Results Per Page
Sort Options
Item An algorithm for identifying clusters of functionally related genes in genomes(2009-05-15) Yi, Gang ManAn increasing body of literature shows that genomes of eukaryotes can contain clusters of functionally related genes. Most approaches to identify gene clusters utilize microarray data or metabolic pathway databases to find groups of genes on chromo- somes that are linked by common attributes. A generalized method that can find gene clusters, regardless of the mechanism of origin, would provide researchers with an unbiased method for finding clusters and studying the evolutionary forces that give rise to them. I present a basis of algorithm to identify gene clusters in eukaryotic genomes that utilizes functional categories defined in graph-based vocabularies such as the Gene Ontology (GO). Clusters identified in this manner need only have a common function and are not constrained by gene expression or other properties. I tested the algorithm by analyzing genomes of a representative set of species. I identified species specific variation in percentage of clustered genes as well as in properties of gene clusters, including size distribution and functional annotation. These properties may be diagnostic of the evolutionary forces that lead to the formation of gene clusters. The approach finds all gene clusters in the data set and ranks them by their likelihood of occurrence by chance. The method successfully identified clusters.Item Automatic Assignment of Protein Function with Supervised Classifiers(2010-01-16) Jung, JaeHigh-throughput genome sequencing and sequence analysis technologies have created the need for automated annotation and analysis of large sets of genes. The Gene Ontology (GO) provides a common controlled vocabulary for describing gene function. However, the process for annotating proteins with GO terms is usually through a tedious manual curation process by trained professional annotators. With the wealth of genomic data that are now available, there is a need for accurate auto- mated annotation methods. The overall objective of my research is to improve our ability to automatically an- notate proteins with GO terms. The first method, Automatic Annotation of Protein Functional Class (AAPFC), employs protein functional domains as features and learns independent Support Vector Machine classifiers for each GO term. This approach relies only on protein functional domains as features, and demonstrates that statistical pattern recognition can outperform expert curated mapping of protein functional domain features to protein functions. The second method Predict of Gene Ontology (PoGO) describes a meta-classification method that integrates multiple heterogeneous data sources. This method leads to improved performance than the protein domain method can achieve alone. Apart from these two methods, several systems have been developed that employ pattern recognition to assign gene function using a variety of features, such as the sequence similarity, presence of protein functional domains and gene expression patterns. Most of these approaches have not considered the hierarchical relationships among the terms in the form of a directed acyclic graph (DAG). The DAG represents the functional relationships between the GO terms, thus it should be an important component of an automated annotation system. I describe a Bayesian network used as a multi-layered classifier that incorporates the relationships among GO terms found in the GO DAG. I also describe an inference algorithm for quickly assigning GO terms to unlabeled proteins. A comparative analysis of the method to other previously described annotation systems shows that the method provides improved annotation accuracy when the performance of individual GO terms are compared. More importantly, this method enables the classification of significantly more GO terms to more proteins than was previously possible.Item Statistical Models for Next Generation Sequencing Data(2013-04-01) Wang, YiyiThree statistical models are developed to address problems in Next-Generation Sequencing data. The first two models are designed for RNA-Seq data and the third is designed for ChIP-Seq data. The first of the RNA-Seq models uses a Bayesian non- parametric model to detect genes that are differentially expressed across treatments. A negative binomial sampling distribution is used for each gene?s read count such that each gene may have its own parameters. Despite the consequent large number of parameters, parsimony is imposed by a clustering inherent in the Bayesian nonparametric framework. A Bayesian discovery procedure is adopted to calculate the probability that each gene is differentially expressed. A simulation study and real data analysis show this method will perform at least as well as existing leading methods in some cases. The second RNA-Seq model shares the framework of the first model, but replaces the usual random partition prior from the Dirichlet process by a random partition prior indexed by distances from Gene Ontology (GO). The use of the external biological information yields improvements in statistical power over the original Bayesian discovery procedure. The third model addresses the problem of identifying protein binding sites for ChIP-Seq data. An exact test via a stochastic approximation is used to test the hypothesis that the treatment effect is independent of the sequence count intensity effect. The sliding window procedure for ChIP-Seq data is followed. The p-value and the adjusted false discovery rate are calculated for each window. For the sites identified as peak regions, three candidate models are proposed for characterizing the bimodality of the ChIP-Seq data, and the stochastic approximation in Monte Carlo (SAMC) method is used for selecting the best of the three. Real data analysis shows that this method produces comparable results as other existing methods and is advantageous in identifying bimodality of the data.