Development of high throughput functional annotation system with distributed capabilities

Zaragoza, Joaquin

Development of high throughput functional annotation system with distributed capabilities

Date

2006-05

Authors

Zaragoza, Joaquin

Publisher

Texas Tech University

Abstract

Background
The genomes of over 250 species including the human genome have been sequenced to date. However, technologies that produce these genetic sequences (genes and/or proteins) are advancing at a much faster rate compared to the science and technologies that make it possible for the scientists to determine how each individual gene/protein functions within the cell and to annotate the genes/proteins based upon these functions. Specifically, this process of functional annotation assigns a functional descriptor to unknown genetic sequences. The process of functional annotation includes (1) formatting raw genetic sequence data, (2) the use of pre-existing annotated databases to assign, based upon sequence similarity searches, putative annotations to the raw genetic sequence data, (3) providing visualization tools to facilitate curation and formal assignment of the automated mappings obtained as part of step two, and finally (4) formatting the final datasets into standard output formats conducive to downstream analysis.

Rationale

Thus the primary need in the field of computational genetics is a unified functional annotation solution providing, (1) improved overall throughput, (2) organized visualizations of complex datasets, (3) user interface interactivity with limited latency, and (4) processing annotated data into formats conducive to downstream analysis. Although, current solutions exist for individual tasks in the annotation process, a unified solution is needed that provides efficient computational methods at key high throughput steps in the process. To address these issues and provide a solution, we have developed the High Throughput Gene Ontology Functional Annotation Toolkit (HTGOFAT).

Material and Methods

HTGOFAT was encoded with C# using the Microsoft .NET Framework. The key high-throughput steps in the annotation process were identified as handling input sequences, the computationally intensive nature of sequence similarity searches, and local and remote database interactions. In addition, data visualizations are generated within HTGOFAT to complete the functional annotation process. First, an indexing schema for handling input sequence databases is integrated into HTGOFAT that allows for indexing and retrieving sequences from a flat textfile while leaving the file intact. Secondly, HTGOFAT integrates a distributed algorithm for the parallelization of the similarity search utilizing the Microsoft .NET Remoting framework. Third, utilizing key indexing terms obtained during the similarity search, further data mining is conducted using data within a remote MySQL database to obtain the actual functional annotations. Lastly, graphical representations of the attained annotations are presented in functional biologic pathways, direct acyclic graphs, and grid formatted tables that allows for curation and analysis. Quantitative assessments of the improvements in these high-throughput steps of the annotation process were performed by comparing similar automated methods as well as manual methods to the methods developed as part of HTGOFAT.

Results

We have developed a standalone, unified application to address the computational requirements of the functional annotation process. Improvements in each of the key high-throughput steps were realized. By using an input file indexing method rather than the common method of reformatting the flat files into single files were seen by a 900% decrease in the time to process an 1616 KB input file containing 5000 sequences. Improvements in the sequence similarity search bottleneck were realized by the implementation of a distributed algorithm. When utilizing eight workstations with 2.4 GHz processors and 1 GB of memory, the parallel BLAST of 100, 500, 1000, and 1500 sequences took 5.04, 29.50, 58.58, and 81.70 minutes, respectively. A serial BLAST on one workstation with the same configuration took 38.15, 194.68, 461.81, and 662.68 minutes, respectively. An average speedup of 7.825 was achieved which correlates to an efficiency of 97.81% for the eight workstation test cases. On average, the parallel BLAST took 13% of the total time compared to a serial BLAST. Computational performance analysis was performed to validate the implementation of fetching associations from the database server where three automated methods were compared to one manual method. From the performance analysis, the optimal automated method completed in nearly half the time as the next automated method which is over 2000 times faster than performing manually. For comparison, manually fetching 40,000 annotations is estimated at over 166.7 hours for an expert as opposed to 303.969 seconds using the optimal automated algorithm. Finally, computational performance analysis for data acquisition methods to access the underlying databases determined the optimal implementation to present visualizations within the user interface with minimal latency.

Conclusions

This thesis describes the High Throughput Gene Ontology Functional Annotation Toolkit (HTGOFAT) that automates the functional annotation process by reducing the bottleneck associated with processing many sequences concurrently, and at the same time, allows for additional post-processing of the resultant data in order to visualize and analyze the attained functional annotations.