Automatic Assignment of Protein Function with Supervised Classifiers



Journal Title

Journal ISSN

Volume Title



High-throughput genome sequencing and sequence analysis technologies have created the need for automated annotation and analysis of large sets of genes. The Gene Ontology (GO) provides a common controlled vocabulary for describing gene function. However, the process for annotating proteins with GO terms is usually through a tedious manual curation process by trained professional annotators. With the wealth of genomic data that are now available, there is a need for accurate auto- mated annotation methods. The overall objective of my research is to improve our ability to automatically an- notate proteins with GO terms. The first method, Automatic Annotation of Protein Functional Class (AAPFC), employs protein functional domains as features and learns independent Support Vector Machine classifiers for each GO term. This approach relies only on protein functional domains as features, and demonstrates that statistical pattern recognition can outperform expert curated mapping of protein functional domain features to protein functions. The second method Predict of Gene Ontology (PoGO) describes a meta-classification method that integrates multiple heterogeneous data sources. This method leads to improved performance than the protein domain method can achieve alone. Apart from these two methods, several systems have been developed that employ pattern recognition to assign gene function using a variety of features, such as the sequence similarity, presence of protein functional domains and gene expression patterns. Most of these approaches have not considered the hierarchical relationships among the terms in the form of a directed acyclic graph (DAG). The DAG represents the functional relationships between the GO terms, thus it should be an important component of an automated annotation system. I describe a Bayesian network used as a multi-layered classifier that incorporates the relationships among GO terms found in the GO DAG. I also describe an inference algorithm for quickly assigning GO terms to unlabeled proteins. A comparative analysis of the method to other previously described annotation systems shows that the method provides improved annotation accuracy when the performance of individual GO terms are compared. More importantly, this method enables the classification of significantly more GO terms to more proteins than was previously possible.