Combinatorial motif analysis in yeast gene promoters: the benefits of a biological consideration of motifs



Journal Title

Journal ISSN

Volume Title


Texas A&M University


There are three main categories of algorithms for identifying small transcription regulatory sequences in the promoters of genes, phylogenetic comparison, expectation maximization and combinatorial. For convenience, the combinatorial methods typically define motifs in terms of a canonical sequence and a set of sequences that have a small number of differences compared to the canonical sequence. Such motifs are referred to as (l, d)-motifs where l is the length of the motif and d indicates how many mismatches are allowed between an instance of the motif and the canonical motif sequence. There are limits to the complexity of the patterns of motifs that can be found by combinatorial methods. For some values of l and d, there will exist many sets of random words in a cluster of gene promoters that appear to form an (l, d)-motif. For these motifs, it will be impossible to distinguish biological motifs from randomly generated motifs. A better formalization of motifs is the (l, f, d)-motif that is derived from a biological consideration of motifs. The motivation for (l, f, d)-motifs comes from an examination of known transcription factor binding sites where typically a few positions in the motif are invariant. It is shown that there exist (l, f, d)-motifs that can be found in the promoters of gene clusters that would not be recognizable from random sequences if they were described as (l, d)-motifs. The inclusion of the f-value in the definition of motifs suggests that the sequence space that is occupied by a motif will consist of a several clusters of closely related sequences. An algorithm, CM, has been developed that identifies small sets of overabundant sequences in the promoters from a cluster of genes and then combines these simple sets of sequences to form complex (l, f, d)-motif models. A dataset from a yeast gene expression experiment is analyzed with CM. Known biological motifs and novel motifs are identified by CM. The performance of CM is compared to that of a popular expectation maximization algorithm, AlginACE, and to that from a simple combinatorial motif finding program.