Bayesian learning in bioinformatics

Gold, David L.

Bayesian learning in bioinformatics

Date

2009-05-15

Authors

Gold, David L.

Abstract

Life sciences research is advancing in breadth and scope, affecting many areas of life including medical care and government policy. The field of Bioinformatics, in particular, is growing very rapidly with the help of computer science, statistics, applied mathematics, and engineering. New high-throughput technologies are making it possible to measure genomic variation across phenotypes in organisms at costs that were once inconceivable. In conjunction, and partly as a consequence, massive amounts of information about the genomes of many organisms are becoming accessible in the public domain. Some of the important and exciting questions in the post-genomics era are how to integrate all of the information available from diverse sources. Learning in complex systems biology requires that information be shared in a natural and interpretable way, to integrate knowledge and data. The statistical sciences can support the advancement of learning in Bioinformatics in many ways, not the least of which is by developing methodologies that can support the synchronization of efforts across sciences, offering real-time learning tools that can be shared across many fields from basic science to the clinical applications. This research is an introduction to several current research problems in Bioinformatics that addresses integration of information, and discusses statistical methodologies from the Bayesian school of thought that may be applied. Bayesian statistical methodologies are proposed to integrate biological knowledge and improve statistical inference for three relevant Bioinformatics applications: gene expression arrays, BAC and aCGH arrays, and real-time gene expression experiments. A unified Bayesian model is proposed to perform detection of genes and gene classes, defined from historical pathways, with gene expression arrays. A novel Bayesian statistical method is proposed to infer chromosomal copy number aberrations in clinical populations with BAC or aCGH experiments. A theoretical model is proposed, motivated from historical work in mathematical biology, for inference with real-time gene expression experiments, and fit with Bayesian methods. Simulation and case studies show that Bayesian methodologies show great promise to improve the way we learn with high-throughput Bioinformatics experiments.