Fragment Based Protein Active Site Analysis Using Markov Random Field Combinations of Stereochemical Feature-Based Classifications
Abstract
Recent improvements in structural genomics efforts have greatly increased the number of hypothetical proteins in the Protein Data Bank. Several computational methodologies have been developed to determine the function of these proteins but none of these methods have been able to account successfully for the diversity in the sequence and structural conformations observed in proteins that have the same function. An additional complication is the flexibility in both the protein active site and the ligand. In this dissertation, novel approaches to deal with both the ligand flexibility and the diversity in stereochemistry have been proposed. The active site analysis problem is formalized as a classification problem in which, for a given test protein, the goal is to predict the class of ligand most likely to bind the active site based on its stereochemical nature and thereby define its function. Traditional methods that have adapted a similar methodology have struggled to account for the flexibility observed in large ligands. Therefore, I propose a novel fragment-based approach to dealing with larger ligands. The advantage of the fragment-based methodology is that considering the protein-ligand interactions in a piecewise manner does not affect the active site patterns, and it also provides for a way to account for the problems associated with flexible ligands. I also propose two feature-based methodologies to account for the diversity observed in sequences and structural conformations among proteins with the same function. The feature-based methodologies provide detailed descriptions of the active site stereochemistry and are capable of identifying stereochemical patterns within the active site despite the diversity. Finally, I propose a Markov Random Field approach to combine the individual ligand fragment classifications (based on the stereochemical descriptors) into a single multi-fragment ligand class. This probabilistic framework combines the information provided by stereochemical features with the information regarding geometric constraints between ligand fragments to make a final ligand class prediction. The feature-based fragment identification methodology had an accuracy of 84% across a diverse set of ligand fragments and the mrf analysis was able to succesfully combine the various ligand fragments (identified by feature-based analysis) into one final ligand based on statistical models of ligand fragment distances. This novel approach to protein active site analysis was additionally tested on 3 proteins with very low sequence and structural similarity to other proteins in the PDB (a challenge for traditional methods) and in each of these cases, this approach successfully identified the cognate ligand. This approach addresses the two main issues that affect the accuracy of current automated methodologies in protein function assignment.