An Application Of Parallel And Distributed Computing Methods To Approximate Pattern Matching Of Genetic Regulatory Motifs
Bioinformatics is a relatively new scientific field concerned with providing computational means and support to research in molecular biology and genetics. It draws from many different areas of computer science, including database theory, algorithm design and analysis, and artificial intelligence, to name just a few. In many applications, such as one described in this thesis, a biologist is interested in locating a particular pattern, or sequence motif, in a given string or set of strings over the four-letter DNA alphabet.
In this thesis we present an efficient approach to locating promoter and other regulatory sequences in entire genomes or in specific target areas. Promoters are short conserved sequences located upstream of the genes they regulate, and they have an essential role in driving the expression of the genes. However, in higher organisms promoter sequences are very diverse, and their motifs can feature substantial sequence variation, including character (chemical base) substitutions, insertions and deletions. We were thus interested in designing methods for the efficient search for approximate matches of the target sequences to putative promoter consensus. To achieve this goal we have used a combination of classic pattern-matching algorithms and high performance computing, parallelizing the search over a number of processors.
We have used these methods in the genome-wide search for a 14-base promoter element in the genome of the fruit fly Drosophila melanogaster, postulated to have a role in the testis-specific expression of the genes it controls. The list of testis-specific genes has been provided by our collaborator from the UTA Department of Biology, and we have obtained the raw sequence and other genetic data from Flybase, a public database maintained by a consortium of Drosophila researchers. Although the number of genes we were interested in was moderate, 791, the number of exact patterns approximately matching the 14 base consensus was very large, approximately 38,000. This problem has thus guided the design of the methods we describe in this thesis.