Browsing by Subject "bioinformatics"
Now showing 1 - 10 of 10
Results Per Page
Sort Options
Item Algorithms for Gene Clustering Analysis on Genomes(2012-07-16) Yi, Gang ManThe increased availability of data in biological databases provides many opportunities for understanding biological processes through these data. As recent attention has shifted from sequence analysis to higher-level analysis of genes across multiple genomes, there is a need to develop efficient algorithms for these large-scale applications that can help us understand the functions of genes. The overall objective of my research was to develop improved methods which can automatically assign groups of functionally related genes in large-scale data sets by applying new gene clustering algorithms. Proposed gene clustering algorithms that can help us understand gene function and genome evolution include new algorithms for protein family classification, a window-based strategy for gene clustering on chromosomes, and an exhaustive strategy that allows all clusters of small size to be enumerated. I investigate the problems of gene clustering in multiple genomes, and define gene clustering problems using mathematical methodology and solve the problems by developing efficient and effective algorithms. For protein family classification, I developed two supervised classification algorithms that can assign proteins to existing protein families in public databases and, by taking into account similarities between the unclassified proteins, allows for progressive construction of new families from proteins that cannot be assigned. This approach is useful for rapid assignment of protein sequences from genome sequencing projects to protein families. A comparative analysis of the method to other previously developed methods shows that the algorithm has a higher accuracy rate and lower mis-classification rate when compared to algorithms that are based on the use of multiple sequence alignments and hidden Markov models. The proposed algorithm performs well even on families with very few proteins and on families with low sequence similarity. Apart from the analysis of individual sequences, identifying genomic regions that descended from a common ancestor helps us study gene function and genome evolution. In distantly related genomes, clusters of homologous gene pairs serve as evidence used in function prediction, operon detection, etc. Thus, reliable identification of gene clusters is critical to functional annotation and analysis of genes. I developed an efficient gene clustering algorithm that can be applied on hundreds of genomes at the same time. This approach allows for large-scale study of evolutionary relationships of gene clusters and study of operon formation and destruction. By placing a stricter limit on the maximum cluster size, I developed another algorithm that uses a different formulation based on constraining the overall size of a cluster and statistical estimates that allow direct comparisons of clusters of different size. A comparative analysis of proposed algorithms shows that more biological insight can be obtained by analyzing gene clusters across hundreds of genomes, which can help us understand operon occurrences, gene orientations and gene rearrangements.Item Bayesian methods in bioinformatics(Texas A&M University, 2007-04-25) Baladandayuthapani, VeerabhadranThis work is directed towards developing flexible Bayesian statistical methods in the semi- and nonparamteric regression modeling framework with special focus on analyzing data from biological and genetic experiments. This dissertation attempts to solve two such problems in this area. In the first part, we study penalized regression splines (P-splines), which are low-order basis splines with a penalty to avoid under- smoothing. Such P-splines are typically not spatially adaptive, and hence can have trouble when functions are varying rapidly. We model the penalty parameter inherent in the P-spline method as a heteroscedastic regression function. We develop a full Bayesian hierarchical structure to do this and use Markov Chain Monte Carlo tech- niques for drawing random samples from the posterior for inference. We show that the approach achieves very competitive performance as compared to other methods. The second part focuses on modeling DNA microarray data. Microarray technology enables us to monitor the expression levels of thousands of genes simultaneously and hence to obtain a better picture of the interactions between the genes. In order to understand the biological structure underlying these gene interactions, we present a hierarchical nonparametric Bayesian model based on Multivariate Adaptive Regres-sion Splines (MARS) to capture the functional relationship between genes and also between genes and disease status. The novelty of the approach lies in the attempt to capture the complex nonlinear dependencies between the genes which could otherwise be missed by linear approaches. The Bayesian model is flexible enough to identify significant genes of interest as well as model the functional relationships between the genes. The effectiveness of the proposed methodology is illustrated on leukemia and breast cancer datasets.Item Cuts and Partitions in Graphs/Trees with Applications(2013-07-23) Fan, Jia-HaoBoth the maximum agreement forest problem and the multicut on trees problem are NP-hard, thus cannot be solved efficiently if P /=NP. The maximum agreement forest problem was motivated in the study of evolution trees in bioinformatics, in which we are given two leaf-labeled trees and are asked to find a maximum forest that is a subgraph of both trees. The multicuton trees problem has applications in networks, in which we are given a forest and a set of pairs of termianls and are asked to find a cut that separates all pairs of terminals. We develop combinatorial and algorithmic techniques that lead to improved parameterized algorithms, approximation algorithms, and kernelization algorithms for these problems. For the maximum agreement forest problem, we proceed from the bottommost level of trees and extend solutions to whole trees. With this technique, we show that the maxi- mum agreement forest problem is fixed-parameterized tractable in general trees, resolving an open problem in this area. We also provide the first constant ratio approximation algorithm for the problem in general trees. For the multicut on trees problem, we take a new look at the problem through the eyes of vertex cover problem. This connection allows us to develop an kernelization algorithm for the problem, which gives an upper bound of O(k3) on the kernel size, significantly improving the previous best upper bound O(k6). We further exploit this connection to give a parameterized algorithm for the problem that runs in time O? (1.62k), thus improving the previous best algorithm of running time O? (2k). In the protein complex prediction problem, which comes directly from the study of bioinformatics, we are given a protein-protein interaction network, and are asked to find dense regions in this graph. We formulate this problem as a graph clustering problem and develop an algorithm to refine the results for identifying protein complexes. We test our algorithm on yeast protein- protein interaction networks, and we show that our algorithm is able to identify complexes more accurately than other existing algorithms.Item Finding conserved patterns in biological sequences, networks and genomes(2009-05-15) Yang, QingwuBiological patterns are widely used for identifying biologically interesting regions within macromolecules, classifying biological objects, predicting functions and studying evolution. Good pattern finding algorithms will help biologists to formulate and validate hypotheses in an attempt to obtain important insights into the complex mechanisms of living things. In this dissertation, we aim to improve and develop algorithms for five biological pattern finding problems. For the multiple sequence alignment problem, we propose an alternative formulation in which a final alignment is obtained by preserving pairwise alignments specified by edges of a given tree. In contrast with traditional NPhard formulations, our preserving alignment formulation can be solved in polynomial time without using a heuristic, while having very good accuracy. For the path matching problem, we take advantage of the linearity of the query path to reduce the problem to finding a longest weighted path in a directed acyclic graph. We can find k paths with top scores in a network from the query path in polynomial time. As many biological pathways are not linear, our graph matching approach allows a non-linear graph query to be given. Our graph matching formulation overcomes the common weakness of previous approaches that there is no guarantee on the quality of the results. For the gene cluster finding problem, we investigate a formulation based on constraining the overall size of a cluster and develop statistical significance estimates that allow direct comparisons of clusters of different sizes. We explore both a restricted version which requires that orthologous genes are strictly ordered within each cluster, and the unrestricted problem that allows paralogous genes within a genome and clusters that may not appear in every genome. We solve the first problem in polynomial time and develop practical exact algorithms for the second one. In the gene cluster querying problem, based on a querying strategy, we propose an efficient approach for investigating clustering of related genes across multiple genomes for a given gene cluster. By analyzing gene clustering in 400 bacterial genomes, we show that our algorithm is efficient enough to study gene clusters across hundreds of genomes.Item Functional genomic analysis of PPAR-gamma in human colorectal cancer cells(2006-12-14) Craig Randall Bush; E. Brad Thompson; Rudy Guerra; Larry A. Denner; E. Aubrey Thompson; Bruce A. Luxon; Allan R. BrasierThe gamma isoform of peroxisome-proliferator activated receptor (PPAR-gamma) is a member of the super family of nuclear hormone receptors and shows much promise as a chemopreventative and therapeutic target for colorectal cancer. Activation of PPAR-gamma by thiazolidinediones (TZDs) inhibits proliferation and induces differentiation in human colon cancer cells. RS5444, a novel TZD, is a high affinity and high specificity ligand for PPAR-gamma. We have shown that RS5444 markedly reduced the proliferation of MOSER S human colorectal cancer cells under anchorage dependent and independent conditions. The inhibitory effect of RS5444 was irreversible. RS5444 also significantly repressed the invasive phenotype, but not motility, of these tumor cells.\r\n\r\nTowards elucidating the activated PPAR-gamma controlled genomic program responsible for these observed phenotypes, functional genomic analysis was performed using a two-class longitudinal microarray data set in the presence and absence of RS5444. Differential expression of genes was obtained using an empirical Bayesian modification to the multivariate HotellingT2 score. We have demonstrated this statistical machine learning technique to be superior in controlling type II error in our dataset than more commonly used algorithms for two-class analysis. Likewise, through the use of several bioinformatics techniques, including frequency-based pathway analysis, and functional ontology analysis, we found a yet unappreciated tumor-suppressing network involving a feedback mechanism between PPAR-gamma, DSCR1 and calcineurin-mediated signaling of NFATc in colorectal cancer cells. To this end, we have demonstrated a direct connection between NFATc and DSCR1 in MOSER S colorectal cancer cells. Likewise, we have demonstrated a correlation between the sensitivity of PPAR-gamma in other colorectal cancer cells, and the messenger abundance of DSCR1. Finally, we have demonstrated that knockdown of DSCR1 messenger obviates the phenotypic effects of activated PPAR-gamma in vitro.\r\n\r\nTo our knowledge these data represent, for the first time, a network between PPAR-gamma, DSCR1, and NFATc signaling in the context of tumor-suppressor activity. This preliminary evidence is consistent with our working hypothesis that an oncology patient’s receptiveness to TZD treatment may be largely dependent on the specific tumor’s endogenous abundance of DSCR1. We believe without a critical endogenous level of DSCR1, activated PPARγ may revert to a tumor-activator instead of a tumor-suppressor.\r\nItem Implementation of genomics and bioinformatics approaches for identification and characterization of tomato ripening-related genes(Texas A&M University, 2004-09-30) Fei, ZhangjunInitial activities were focused on isolation and characterization of fruit ripening-related genes from tomato. Screening of four tomato cDNA libraries at low stringency with 10 fruit development and ripening-related genes yielded ~3000 positives clones. Microarray expression analysis of half of these positives in mature green and breaker stage fruits resulted in eight ripening-induced genes. RNA gel-blot analysis and previously published data confirmed expression for seven of the eight. One novel gene, designated LeEREBP1, was chosen for further characterization. LeEREBP1 encodes an AP2/ERF-domain transcription factor and is ethylene inducible. The expression profiles of LeEREBP1 parallel previously characterized ripening-related genes from tomato. Transgenic plants with increased and decreased expression of LeEREBP1 were generated and are currently being characterized to define the function of LeEREBP1. A large public tomato EST dataset was mined to gain insight into the tomato transcriptome. By clustering genes according to the respective expression profiles of individual tissues, tissue and developmental expression patterns were generated and genes with similar functions grouped together. Tissues effectively clustered for relatedness according to their profiles confirming the integrity of the approach used to calculate gene expression. Statistical analysis of EST prevalence in fruit and pathogenesis-related libraries resulted in 333 genes being classified as fruit ripening-induced, 185 as fruit ripening-repressed, and 169 as pathogenesis-related. We performed a parallel analysis on public EST data for grape and compared the results for ripening-induced genes to tomato to identify similar and distinct ripening factors in addition to candidates for conserved regulators of fruit ripening. An online interactive database for tomato gene expression data - Tomato Expression Database (TED) was implemented. TED contains normalized expression data for approximately 12,000 ESTs over ten time points during fruit development. It also contains comprehensive annotation of each EST. Through TED, we provide multiple approaches to pursue analysis of specific genes of interest and/or access the larger microarray dataset to identify sets of genes that may behave in a pattern of interest. In addition, a set of useful data mining and data visualization tools were developed and are under continuing expansion.Item Structural, Functional and Evolutionary Characterization of Sense-Antisense Transcripts in Mammals(2010-07-14) Dickens, CharlesSense-antisense transcripts (SATs) are messenger RNA (mRNA) transcripts that have regions that are complementary to regions of other mRNA transcripts. SATs may play an influential role in the regulation of gene expression. One evolutionary event that has had a dramatic impact on many genomes is the widespread dispersal of repetitive sequences which includes transposable elements (TEs) as well as simple and tandem repeats. Approximately 45% of the human and 37.5% of the mouse genomes are composed of repeats derived from transposable elements. A group of SATs was identified as resulting from transposable elements integrating into the coding strand of some genes and into the template strand of the coding region of other genes. These SATs may add to the complexity of an organism's regulatory network or they may be the result of rather recent TE activities yet to succumb to sequence divergence. The human, mouse and bovine genomes were analyzed for SATs using publicly available datasets and bioinformatics analysis tools. Each sense-antisense binding region (SABR) was aligned to transposable elements from the RepBase repeat database revealing many SABRs containing TE sequence in a large portion of the sequence. A Gene Ontology analysis on subsets of the data showed enrichments for the functional category of "DNA repair" and the component category "cytoplasm". An analysis of the substitution rates in human and mouse across the 3' UTRs of transcripts containing SABRs at the 5' end of their 3' UTRs showed that the substitution rate in the region of the SABR was lower than compared to the beginning of the 3' UTR. The lower percent GC composition found at the 3' end of the 3' UTRs could be attributed to conserved poly-A signals in this region.Item Use of bioinformatics to investigate and analyze transposable element insertions in the genomes of caenorhabditis elegans and drosophila melanogaster, and into the target plasmid pGDV1(Texas A&M University, 2005-02-17) Julian, Andrea MarianTransposable elements (TEs) are utilized for the creation of a wide range of transgenic organisms. However, in some systems, this technique is not very efficient due to low transposition frequencies and integration into unstable or transcriptionally inactive genomic regions. One approach to ameliorate this problem is to increase knowledge of how transposons move and where they integrate into target genomes. Most transposons do not insert randomly into their host genome, with class II TEs utilizing target sequences of between 2 ? 8 bp in length, which are duplicated upon insertion. Furthermore, amongst insertion sites, certain sites are preferred for insertion and hence are classified as hot spots, while others not targeted by TEs are referred to as cold spots. The hypothesis tested in this analysis is that in addition to the primary consensus target sequence, secondary and tertiary DNA structures have a significant influence on TE target site preference. Bioinformatics was used to predict and analyze the structure of the flanking DNA around known insertion sites and cold spots for various TEs, to understand why insertion sites are used preferentially to cold spots for element integration. Hidden Markov Models were modeled and trained to analyze datasets of insertions of the P element in the Drosophila melanogaster genome, the Tc1 element in the Caenorhabditis elegans genome, and insertions of the Mos1, piggyBac and Hermes transposons into the target plasmid pGDV1. Analysis of the DNA structural profiles of the insertion sites for the P element and Hermes transposons revealed that both transposons targeted regions of DNA with a relatively high degree of bendability/flexibility at the insertion site. However, similar trends were not observed for the Tc1, Mos1 or piggyBac transposons. Hence, it is believed that the secondary structural features of DNA can contribute to target site preference for some, but not all transposable elements.Item Wavelet methods and statistical applications: network security and bioinformatics(Texas A&M University, 2005-11-01) Kwon, DeukwooWavelet methods possess versatile properties for statistical applications. We would like to explore the advantages of using wavelets in the analyses in two different research areas. First of all, we develop an integrated tool for online detection of network anomalies. We consider statistical change point detection algorithms, for both local changes in the variance and for jumps detection, and propose modified versions of these algorithms based on moving window techniques. We investigate performances on simulated data and on network traffic data with several superimposed attacks. All detection methods are based on wavelet packets transformations. We also propose a Bayesian model for the analysis of high-throughput data where the outcome of interest has a natural ordering. The method provides a unified approach for identifying relevant markers and predicting class memberships. This is accomplished by building a stochastic search variable selection method into an ordinal model. We apply the methodology to the analysis of proteomic studies in prostate cancer. We explore wavelet-based techniques to remove noise from the protein mass spectra. The goal is to identify protein markers associated with prostate-specific antigen (PSA) level, an ordinal diagnostic measure currently used to stratify patients into different risk groups.