Browsing by Subject "Bioinformatics"
Now showing 1 - 20 of 23
Results Per Page
Sort Options
Item Algorithms for next generation sequencing data analysis(2015-12) Das, Shreepriya; Vikalo, Haris; Dhillon, Inderjit S; Ravikumar, Pradeep; Sanghavi, Sujay; Tewfik, AhmedThe field of genomics has witnessed tremendous achievements in the past two decades. The advances in sequencing technology have enabled acquisition of massive amounts of data that reveals information about individual genetic blueprint and is revolutionizing the field of molecular biology. Interpretation of such data requires solving mathematical (statistical and computational) problems rendered difficult by the complex interacting processes that are characteristic of biological systems; the data is high dimensional, typically noisy and often incomplete. Algorithm design in these settings requires deep understanding of the underlying biological principles, good mathematical abstractions permitting tractable inference and fast, scalable and accurate solutions using ideas from diverse fields such as optimization, probability, statistics and algorithms. This dissertation deals with two such problems occurring in the field of bioinformatics/computational biology. First, for the problem of basecalling for sequencing-by-synthesis (Illumina) platforms, I describe novel computationally tractable statistical models and signal processing schemes that are fast and have lower error rates than existing state-of-the-art basecallers. Extensions to a soft information exchange setup to do joint basecalling and SNP calling are also explored. Next, I describe two novel single individual haplotyping inference schemes using an (optimal) branch and bound framework and (scalable) low rank semidefinite programming ideas for diploid and polyploid species. In addition to improving the quality of basecalling, SNP calling, genotyping and haplotyping, I also developed user-friendly software that can be used by the biological research community for various purposes including cancer genomics and metagenomics studies.Item An object-oriented framework to organize genomic data(2009-05-15) Wei, NingBioinformatics resources should provide simple and flexible support for genomics research. A huge amount of gene mapping data, micro-array expression data, expressed sequence tags (EST), BAC sequence data and genome sequence data are already, or will soon be available for a number of livestock species. These species will have different requirements compared to typical biomedical model organisms and will need an informatics framework to deal with the data. In term of exploring complex-intertwined genomic data, the way to organize them will be addressed in this study. Therefore, we investigated two issues in this study: one is an independent informatics framework including both back end and front end; another is how an informatics framework simplifies the user interface to explore data. We have developed a fundamental informatics framework that makes it easy to organize and manipulate the complex relations between genomic data, and allow for query results to be presented via a user friendly web interface. A genome object-oriented framework (GOOF) was proposed with object-oriented Java technology and is independent of any database system. This framework seamlessly links the database system and web presentation components. The data models of GOOF collect the data relationships in order to provide users with access to relations across different types of data, meaning that users avoid constructing queries within the interface layer. Moreover, the module-based interface provided by GOOF could allow different users to access data in different interfaces and ways. In another words, GOOF not only gives a whole solution to informatics infrastructure, but also simplifies the organization of data modeling and presentation. In order to be a fast development solution, GOOF provides an automatic code engine by using meta-programming facilities in Java, which could allow users to generate a large amount of routine program codes. Moreover, the pre-built data layer in GOOF connecting with Chado simplifies the process to manage genomic data in the Chado schema. In summary, we studied the way to model genomic data into an informatics framework, a one-stop approach, to organize the data and addressed how GOOF constructs a bioinformatics infrastructure for users to access genomic data.Item Automated Prediction of Human Disease Genes(2012-12) Blom, Martin; Marcotte, Edward M; Dhillon, Inderjit S; Gonzalez, Oscar; Press, William H; Wilke, Claus OThe completion of the human genome project has led to a flood of new genetic data, that has proved surprisingly hard to interpret. Network "guilt by association" (GBA) is a proven approach for identifying novel disease genes based on the observation that similar mutational phenotypes arise from functionally related genes. However, GBA has been shown to work poorly in genome-wide association studies (GWAS), where many genes are somewhat implicated, but few are known with very high certainty. In the first part of this work, I resolve this by explicitly modeling the uncertainty of the associations and incorporating the uncertainty for the seed set into the GBA framework. I demonstrate a significant boost in the power to detect validated candidate genes for Crohn’s disease and type 2 diabetes by comparing the predictions from my method to results from follow-up meta-analyses, with incorporation of the network serving to highlight the JAK--STAT pathway and associated adaptors GRB2/SHC1 in Crohn’s disease and BACH2 in type 2 diabetes. Consideration of the network during GWAS thus conveys some of the benefits of enrolling more participants in the GWAS study. More generally, we demonstrate that a functional network of human genes provides a valuable statistical framework for prioritizing candidate disease genes in GWAS-based studies. Furthermore, functional gene networks are not the only kind of information that can be used to predict gene--phenotype associations. In the second part of this thesis, I show that gene-phenotype associations in model species from species as distantly related to humans as E. coli is another valuable source of information, that can be mined using methods similar to those used in recommender systems. Finally, in the last part of this thesis, I present a machine learning formalism that combines the functional gene network and model species phenotype information. I show that this approach outperforms the state of the art methods for gene-phenotype association prediction using cross-validation.Item Bayesian Analysis of Transposon Mutagenesis Data(2012-07-16) DeJesus, Michael A.Determining which genes are essential for growth of a bacterial organism is an important question to answer as it is useful for the discovery of drugs that inhibit critical biological functions of a pathogen. To evaluate essentiality, biologists often use transposon mutagenesis to disrupt genomic regions within an organism, revealing which genes are able to withstand disruption and are therefore not required for growth. The development of next-generation sequencing technology augments transposon mutagenesis by providing high-resolution sequence data that identifies the exact location of transposon insertions in the genome. Although this high-resolution information has already been used to assess essentiality at a genome-wide scale, no formal statistical model has been developed capable of quantifying significance. This thesis presents a formal Bayesian framework for analyzing sequence information obtained from transposon mutagenesis experiments. Our method assesses the statistical significance of gaps in transposon coverage that are indicative of essential regions through a Gumbel distribution, and utilizes a Metropolis-Hastings sampling procedure to obtain posterior estimates of the probability of essentiality for each gene. We apply our method to libraries of M. tuberculosis transposon mutants, to identify genes essential for growth in vitro, and show concordance with previous essentiality results based on hybridization. Furthermore, we show how our method is capable of identifying essential domains within genes, by detecting significant sub-regions of open-reading frames unable to withstand disruption. We show that several genes involved in PG biosynthesis have essential domains.Item Bayesian learning in bioinformatics(2009-05-15) Gold, David L.Life sciences research is advancing in breadth and scope, affecting many areas of life including medical care and government policy. The field of Bioinformatics, in particular, is growing very rapidly with the help of computer science, statistics, applied mathematics, and engineering. New high-throughput technologies are making it possible to measure genomic variation across phenotypes in organisms at costs that were once inconceivable. In conjunction, and partly as a consequence, massive amounts of information about the genomes of many organisms are becoming accessible in the public domain. Some of the important and exciting questions in the post-genomics era are how to integrate all of the information available from diverse sources. Learning in complex systems biology requires that information be shared in a natural and interpretable way, to integrate knowledge and data. The statistical sciences can support the advancement of learning in Bioinformatics in many ways, not the least of which is by developing methodologies that can support the synchronization of efforts across sciences, offering real-time learning tools that can be shared across many fields from basic science to the clinical applications. This research is an introduction to several current research problems in Bioinformatics that addresses integration of information, and discusses statistical methodologies from the Bayesian school of thought that may be applied. Bayesian statistical methodologies are proposed to integrate biological knowledge and improve statistical inference for three relevant Bioinformatics applications: gene expression arrays, BAC and aCGH arrays, and real-time gene expression experiments. A unified Bayesian model is proposed to perform detection of genes and gene classes, defined from historical pathways, with gene expression arrays. A novel Bayesian statistical method is proposed to infer chromosomal copy number aberrations in clinical populations with BAC or aCGH experiments. A theoretical model is proposed, motivated from historical work in mathematical biology, for inference with real-time gene expression experiments, and fit with Bayesian methods. Simulation and case studies show that Bayesian methodologies show great promise to improve the way we learn with high-throughput Bioinformatics experiments.Item Bayesian Semiparametric Models for Heterogeneous Cross-platform Differential Gene Expression(2012-02-14) Dhavala, Soma SekharWe are concerned with testing for differential expression and consider three different aspects of such testing procedures. First, we develop an exact ANOVA type model for discrete gene expression data, produced by technologies such as a Massively Parallel Signature Sequencing (MPSS), Serial Analysis of Gene Expression (SAGE) or other next generation sequencing technologies. We adopt two Bayesian hierarchical models?one parametric and the other semiparametric with a Dirichlet process prior that has the ability to borrow strength across related signatures, where a signature is a specific arrangement of the nucleotides. We utilize the discreteness of the Dirichlet process prior to cluster signatures that exhibit similar differential expression profiles. Tests for differential expression are carried out using non-parametric approaches, while controlling the false discovery rate. Next, we consider ways to combine expression data from different studies, possibly produced by different technologies resulting in mixed type responses, such as Microarrays and MPSS. Depending on the technology, the expression data can be continuous or discrete and can have different technology dependent noise characteristics. Adding to the difficulty, genes can have an arbitrary correlation structure both within and across studies. Performing several hypothesis tests for differential expression could also lead to false discoveries. We propose to address all the above challenges using a Hierarchical Dirichlet process with a spike-and-slab base prior on the random effects, while smoothing splines model the unknown link functions that map different technology dependent manifestations to latent processes upon which inference is based. Finally, we propose an algorithm for controlling different error measures in a Bayesian multiple testing under generic loss functions, including the widely used uniform loss function. We do not make any specific assumptions about the underlying probability model but require that indicator variables for the individual hypotheses are available as a component of the inference. Given this information, we recast multiple hypothesis testing as a combinatorial optimization problem and in particular, the 0-1 knapsack problem which can be solved efficiently using a variety of algorithms, both approximate and exact in nature.Item Biochemical Analysis of the Drosophila RNAI Pathway(2009-01-14) Jiang, Feng; Liu, QinghuaRNA interference is post-transcriptional gene silencing mediated by (21-26 nt) miRNAs and siRNAs. In Drosophila, the RNase III enzymes Dicer-1 and Dicer-2 generate miRNAs and siRNAs, respectively. Nascent miRNA and siRNA duplexes are assembled into distinct RNA induced silencing complexes termed miRISC and siRISC, of which AGO1 and AGO2 are the respective catalytic subunits. My dissertation project is focused on identifying new RNAi components and understanding mechanisms of RISC assembly by biochemical reconstitution. Our group previously identified a novel dsRNA-binding protein named R2D2 which functioned in complex with Dicer-2 to process dsRNA into siRNA. Only the Dicer-2/R2D2 complex, but neither Dicer-2 nor R2D2 alone, efficiently interact with duplex siRNA. Furthermore, the tandem dsRNA binding domains of R2D2 are required for siRNA binding. Therefore, although R2D2 is dispensable for siRNA production, it is required for incorporating siRNA onto the siRISC complex. Generation of recombinant AGO2 protein is essential for in vitro reconstitution of the RNAi pathway. We believe that the unique poly glutamine repeat region of fly AGO2 may be problematic for expression. Thus, a series of truncated AGO2 baculoviruses that remove some or all polyQ repeats of AGO2 were generated. Co-expression with AGO1 increases the expression level of AGO2 by at least 10 fold. Affinity purified full length and one truncated form of AGO2 show minimal RISC activity, i.e. could be programmed with single stranded siRNA and perform sequence specific cleavage of mRNA. Most interestingly, adding purified recombinant Dicer-2/R2D2 complex to recombinant Ago2 generated dsRNA and siRNA initiated RISC activity. Catalytic mutant of Ago2 is unable to reconstitute RISC activity with recombinant Dicer-2/R2D2 complex, showing that the RISC activity is specific. Therefore, the three component system, Dicer-2, R2D2, and Ago2, can reconstitute the RNAi pathway of Drosophila. By a bioinformatics approach, a novel protein named Loquacious (Loqs) was identified with considerable sequence homology to R2D2. Loqs and Dicer-1 interact with each other by co-immunoprecipitation in S2 cell extract. Recombinant Loqs could enhance miRNA production by Dicer-1 by increasing its affinity for the pre-miRNA substrate. Furthermore, depleting Loqs or Dicer-1 by dsRNA knockdown resulted in reduction of the miRNA-generating activity and accumulation of pre-miRNA in S2 cells. To study the physiological function of loqs in flies, we obtained a piggyback (PB) fly strain in which the PB transposon was inserted into the first exon and before the translation start site of loqs gene. Pre-miRNAs accumulate in the loqs PB flies, indicating they are defective for miRNA biogenesis. However, while both siRISC and miRISC activities are greatly reduced in dcr-1 null extract, these activities are not affected in loqs null extract, indicating that loqs is not essential for miRISC assembly. To test whether the known components are sufficient to reconstitute the miRNA pathway, recombinant AGO1 protein was expressed using the insect cell expression system. It is generally believed that siRISC slices, whereas miRISC represses translation of cognate mRNA in animals. However, recombinant AGO1 can be programmed by single stranded miRNA into a minimal miRISC and sequence specifically cleaves complementary mRNA in vitro. Furthermore, the catalytic activity of AGO1 is dependent on the consensus catalytic ?H?otif. My present studies suggest that recombinant Dicer-1, Loqs and AGO1 are not sufficient to reconstitute the miRNA pathway, indicating that there are other unknown components to be discovered.Item Computational identification and evolutionaty enalysis of metazoan micrornas(2009-05-15) Anzola Lagos, Juan ManuelMicroRNAs are a large family of 21-26 nucleotide non-coding RNAs with a role in the post-transcriptional regulation of gene expression. In recent years, microRNAs have been proposed to play a significant role in the expansion of organism complexity. MicroRNAs are expressed in a cell or tissue-specific manner during embryonic development, suggesting a role in cellular differentiation. For example, Let-7 is a metazoan microRNA that acts as developmental timer between larval stages in C. elegans. We conducted a comparative study that determined the distribution of microRNA families among metazoans, including the identification of new family members for several species. MicroRNA families appear to have evolved in bursts of evolution that correlate with the advent of major metazoan groups such as vertebrates, eutherians, primates and hominids. Most microRNA families identified in these organisms appeared with or after the advent of vertebrates. Only a few of them appear to be shared between vertebrates and invertebrates. The distribution of these microRNA families supports the idea that at least one whole genome duplication event (WGS) predates the advent of vertebrates. Gene ontology analyses of the genes these microRNA families regulate show enrichments for functions related to cell differentiation and morphogenesis. MicroRNA genes appear to be under great selective constraints. Identification of conserved regions by comparative genomics allows for the computational identification of microRNAs. We have identified and characterized ultraconserved regions between the genomes of the honey bee (Apis mellifera) and the parasitic wasp (Nasonia vitripennis), and developed a strategy for the identification of microRNAs based on regions of ultraconservation. Ultraconserved regions preferentially localize within introns and intergenic regions, and are enriched in functions related to neural development. Introns harboring ultraconserved elements appear to be under negative selection and under a level of constraint that is higher than in their exonic counterparts. This level of constraint suggests functional roles yet to be discovered and suggests that introns are major players in the regulation of biological processes. Our computational strategy was able to identify new microRNA genes shared between honey bee and wasp. We recovered 41 of 45 previously validated microRNAs for these organisms, and we identified several new ones. A significant fraction of these microRNA candidates are located in introns and intergenic regions and are organized in genomic clusters. Expression of 13 of these new candidates was verified by 454 sequencing.Item A computational systems biology approach to predictive oncology : a computer modeling and bioinformatics study predicting tumor response to therapy and cancer phenotypes(2009-08) Sanga, Sandeep; Cristini, Vittorio, 1970-Technological advances in the recent decades have enabled cancer researchers to probe the disease at multiple resolutions. This wealth of experimental data combined with computational systems biology methods is now leading to predictive models of cancer progression and response to therapy. We begin by presenting our research group’s multis-cale in silico framework for modeling cancer, whose core is a tissue-scale computational model capable of tracking the progression of tumors from a diffusion-limited avascular phase through angiogenesis, and into invasive lesions with realistic, complex morphologies. We adapt this core model to consider the delivery of systemically-administered anticancer agents and their effect on lesions once they reach their intended nuclear target. We calibrate the model parameters using in vitro data from the literature, and demonstrate through simulation that transport limitations affecting drug and oxygen distributions play a significant role in hampering the efficacy of chemotherapy; a result that has since been validated by in vitro experimentation. While this study demonstrates the capability of our adapted core model to predict distributions (e.g., cell density, pressure, oxygen, nutrient, drug) within lesions and consequent tumor morphology, nevertheless, the underlying factors driving tumor-scale behavior occur at finer scales. What is needed in our multi-scale approach is to parallel reality, where molecular signaling models predict cellular behavior, and ultimately drive what is seen at the tumor level. Models of signaling pathways linked to cell models are already beginning to surface in the literature. We next transition our research to the molecular level, where we employ data mining and bioinformatics methods to infer signaling relationships underlying a subset of breast cancer that might benefit from targeted therapy of Androgen Receptor and associated pathways. Defining the architecture of signaling pathways is a critical first step towards development of pathways models underlying tumor models, while also providing valuable insight for drug discovery. Finally, we develop an agent-based, cell-scale model focused on predicting motility in response to chemical signals in the microenvironment, generally accepted to be a necessary feature of cancer invasion and metastasis. This research demonstrates the use of signaling models to predict emergent cell behavior, such as motility. The research studies presented in this dissertation are critical steps towards developing a predictive, in silico computational model for cancer progression and response to therapy. Our Laboratory for Computational & Predictive Oncology, in collaboration with research groups throughout in the United States and Europe are following a computational systems biology paradigm where model development is fueled by biological knowledge, and model predictions are refining experimental focus. The ultimate objective is a virtual cancer simulator capable of accurately simulating cancer progression and response to therapy on a patient-specific basis.Item Distance-based indexing and its applications in bioinformatics(2007-12) Mao, Rui, 1975-; Miranker, Daniel P.Item Evaluation of Microbial Communities from Extreme Environments as Inocula in a Carboxylate Platform for Biofuel Production from Cellulosic Biomass(2013-08-06) Cope, Julia LeeThe carboxylate biofuels platform (CBP) involves the conversion of cellulosic biomass into carboxylate salts by a mixed microbial community. Chemical engineering approaches to convert these salts to a variety of fuels (diesel, gasoline, jet fuel) are well established. However, prior to initiation of this project, little was known about the influence of inoculum source on platform performance. The studies in this dissertation test the hypothesis that microbial communities from particular environments in nature (e.g. saline and/or thermal sediments) are pre-adapted to similar industrial process conditions and, therefore, exhibit superior performances. We screened an extensive collection of sediment samples from extreme environments across a wide geographic range to identify and characterize microbial communities with superior performances in the CBP. I sought to identify aspects of soil chemistry associated with superior CBP fermentation performance. We showed that CBP productivity was influenced by both fermentation conditions and inocula, thus is clearly reasonable to expect both can be optimized to target desired outcomes. Also, we learned that fermentation performance is not as simple as finding one soil parameter that leads to increases in all performance parameters. Rather, there are complex multivariate relationships that are likely indicative of trade-offs associated within the microbial communities. An analysis of targeted locus pyrosequence data for communities with superior performances in the fermentations provides clear associations between particular bacterial taxa and particular performance parameters. Further, I compared microbial community compositions across three different process screen technologies employed in research to understand and optimize CBP fermentations. Finally, we assembled and characterized an isolate library generated from a systematic culture approach. Based on partial 16S rRNA gene sequencing, I estimated operational taxonomic units (OTUs), and inferred a phylogeny of the OTUs. This isolate library will serve as a tool for future studies of assembled communities and bacterial adaptations useful within the CBP fermentations. Taken together the tools and results developed in this dissertation provide for refined hypotheses for optimizing inoculum identification, community composition, and process conditions for this important second generation biofuel platform.Item The fluviageny, a method for analyzing temporal river fragmentation using phylogenetics(2015-05) Gordon, Andrew Lloyd; Howison, James; Arctur, David KPhylogenetic trees have historically been used to determine evolutionary relatedness between organisms. In the past few decades, as we've developed increasingly powerful computational algorithms and toolsets for performing analyses using phylogenetic methods, the use of these trees has expanded into other areas, including biodiversity informatics and geoinformatics. This report proposes using phylogenetic methods to create "fluviagenies" - trees that represent the effects of river fragmentation over time caused by damming. Faculty at the Center for Research in Water Resources at the University of Texas worked to develop tools and documentation for automating the creation of river segment codes (a.k.a., "fluvcodes") based on spatiotemporal data. Python was used to generate fluviageny trees from lists of these codes. The resulting trees can be exported into the appropriate data format for use with various phylogenetics programs. The Fishes of Texas Database (fshesoftexas.org), a comprehensive geospatial database of Texas fish occurrences aggregated and normalized from 42 museum collections around the world, was employed to create an example of how this tool might be used to analyze and hypothesize changes in fish populations as a consequence of river fragmentation. Additionally, this paper serves to theorize and analyze past and future potential uses for phylogenetic trees in various other fields of informatics.Item Framework for automated phylogenetic analysis of molecular sequence databases(Texas Tech University, 2002-12) Shen, LishuangUtilizing the large amount of biological sequences data from the databases is a powerful tool for molecular phylogenetics. But dealing with the storage and analysis of the sequence data is tedious with manual methods, especially when tens of thousands sequences need to be analyzed. To address this problem, I developed the automatic framework for general molecular phylogenetics analysis (AGMPA) system. The system automates the process of the routine work in molecular phylogenetics. Perl scripts were used to glue together the programs used and to parse analysis outputs. This system also implements databases for information storage, retrieval and presentation. The databases integrate different types of data in molecular phylogenetics and supports database query from different ways. The system provides a graphical user interface (GUI) for all the functions and for calling bioinformatics programs of BLAST, FASTA, CLUSTAL, Phylp and TREE-PUZZLE. The system is implemented with perl/TK to ensure its cross-platform compatibility. The system was tested with 52,499 nucleotide sequences and 1165 protein sequences from Gossypium genus and with 36,495 protein sequences from Poaceae family. Phylogenetic analysis results from these two test datasets are presented.Item Functional analysis of select Arabidopsis ABA INSENSITIVE1/2-Like protein phosphatases(2008-05) Zhang, Tiantian; Chris, RockWe have utilized a maize protoplast transient gene expression system to investigate the possible roles of several Arabidopsis protein phosphatase type 2C (PP2Cs), which are predicted to be involved in ABA responses by hierarchical clustering meta-analysis of transcriptome profiling datasets and sequence homologies to known PP2C effector ABA INSENSITIVE1. We recombined Cre-lox UPS host acceptor vectors (pCR701-705) containing the maize Ubiquitin promoter and N-terminal epitope tags with pUNI donor vectors containing various full length Arabidopsis PP2C cDNAs. On the basis of my maize protoplast transient expression data that suggested ABA antagonist effects of several overexpressed PP2Cs, I performed a reverse-genetic screen for AP2C9, AP2C12 and AP2C15 T-DNA insertion mutants and found supportive evidence from physiological assays on seed germination and root growth that these PP2Cs may be involved in ABA signaling. I conclude that the utilization of the maize protoplast transient expression system for moderate-throughput functional screening and systems-approach classification of gene activities can contribute to elucidation of plant abscisic acid (ABA) signal transduction pathways when combined with complimentary genetic and physiological approaches.Item Genetic Analysis of Stem Composition Variation in Sorghum Bicolor(2012-10-19) Evans, JosephSorghum (Sorghum bicolor [L.] Moench) is the world's fifth most economically important cereal crop, grown worldwide as a source of food for both humans and livestock. Sorghum is a C4 grass that is well adapted to hot and arid climes and is popular for cultivation on lands of marginal quality. Recent interest in development of biofuels from lignocellulosic biomass has drawn attention to sorghum, which can be cultivated in areas not suitable for more traditional crops, and is capable of generating plant biomass in excess of 40 tons per acre. While the quantity of biomass and low water consumption make sorghum a viable candidate for biofuels growth, the biomass composition is enriched in lignin, which is problematic for enzymatic and chemical conversion techniques. The genetic basis for stem composition was analyzed in sorghum populations using a combination of genetic, genomic, and bioinformatics techniques. Utilizing acetyl bromide extraction, the variation in stem lignin content was quantified across several sorghum cultivars, confirming that lignin content varied considerably among sorghum cultivars. Previous work identifying sorghum reduced-lignin lines has involved the monolignol biosynthetic pathway; all steps in the pathway were putatively identified in the sorghum genome using sequence analysis. A bioinformatics toolkit was constructed to allow for the development of genetic markers in sorghum populations, and a database and web portal were generated to allow users to access previously developed genetic markers. Recombinant inbred lines were analyzed for stem composition using near infrared reflectance spectroscopy (NIR) and genetic maps constructed using restriction site-linked polymorphisms, revealing 34 quantitative trail loci (QTL) for stem composition variation in a BTx642 x RTx7000 population, and six QTL for stem composition variation in an SC56 x RTx7000 population. Sequencing the genome of BTx642 and RTx7000 to a depth of ~11x using Illumina sequencing revealed approximately 1.4 million single nucleotide polymorphisms (SNPs) and 1 million SNPs, respectively. These polymorphisms can be used to identify putative amino acid changes in genes within these genotypes, and can also be used for fine mapping. Plotting the density of these SNPs revealed patterns of genetic inheritance from shared ancestral lines both between the newly sequenced genotypes and relative to the reference genotype BTx623.Item Global survey of the immunoglobulin repertoire using next generation sequencing technology(2014-12) Hoi, Kam Hon; Georgiou, GeorgeSpecific and sensitive recognition of foreign agents is a critical attribute of the overall effective immune system required for maintaining host protection against challenge from pathogenic cells. In the humoral arm of the immune system, this recognition attribute is carried out by the cell surface bound immunoglobulin-like receptors (BCR) and its soluble forms i.e. antibodies. Over several million years of evolution, the immune system has adopted several strategies for diversifying the antibody sequence and thus its ability to recognize an astronomical variety of molecules through the combinatorial assembly of a small number of DNA segments or genes. Among these immunoglobulin gene diversification strategies, antibody somatic VDJ recombination and junctional diversity are the fundamental mechanisms in generating a broad range of antibody specificities. Understanding how the genetic diversity of antibodies is affected in health and disease is critical for a wide range of medical applications, from vaccine evaluation to diagnostics and therapeutics discovery. Because of the very large number of distinct antibodies encoded by the more than 100 billion B cells in humans, it is essential to use high throughput next generation sequencing technologies in order to obtain an adequate sampling of the sequences and relative abundance of different antibodies expressed by B cells in clinical samples. The process requires rigorous methods for first, experimentally determining the sequences of antibodies in a sample and for second, informatics tools designed for distilling this information for practical purposes. This dissertation describes a variety of experimental approaches and informatics tools developed for the determination and mining of the antibody repertoire. The information from this work has led to major conclusions regarding the nature of the antibody repertoire in healthy individuals, in volunteers following vaccination, and in HIV-1 patients.Item Identification of the Influenza A nucleoprotein sequence that interacts with the viral polymerase(2011-08) Marklund, Jesper Karl; Krug, Robert M.; Marcotte, Edward M.; Sawyer, Sara L.; Stevens, Scott W.; Sullivan, Christopher S.; Wilke, Claus O.Influenza A is a negative stranded RNA virus with a segmented genome. Once the virus infects a cell it must replicate its full length viral genomic RNA (vRNA) through a positive sense complementary intermediate RNA (cRNA) as well as transcribe viral messenger RNA (mRNA) using the vRNA as a template. The regulation of whether the viral polymerase replicates the genome by synthesizing cRNA, or produces mRNA in order to make viral protein involves, the viral nucleoprotein (NP). We tried to find the sequence residues of NP that directly interact with the viral polymerase. We mutated to alanine several residues on NP that are surface exposed on recently solved crystal structures as well as those thought to be oriented toward the viral polymerase complex in cryo-EM studies. As a first screen, we tested these mutants in a mini-genome assay where the NP stimulation of the viral polymerase can be studied in transfected cells. Through this screen we found that the NP mutants that hindered its ability to stimulate polymerase activity the most were located in a loop between two alpha helixes in the head domain of NP located at residues 203 to 209. Specifically, the NP single mutants of R204, W207, and R208 were inactive in the mini-genome assay. Using RT-PCR we found that the cRNA to vRNA step of replication is severely inhibited by these mutations. Immunoprecipitation using transfected cells showed that the NP mutants lost the ability to bind all three polymerase subunits. This indicates that this loss of polymerase binding may be the reason the NP mutant fails to stimulate polymerase activity. To make sure that this loss of polymerase stimulation was not due to altering other functions of NP we made sure that the protein had proper cellular localization, oligomerization, and RNA binding abilities. Using immuniflourescence we found that mutant NP localized to the nucleus just like wild type. In order to test RNA binding and oligomerization we tested NP purified from a baculovirus expressing system. Using fluorescence polarization we found that NP binds single stranded RNA with similar affinity to wild type. Using gel filtration we found that mutant NP forms oligomers just like wild type. Using covariation analysis of how different positions in an amino acid alignment change relative to each other we predicted possible binding sites between NP and the three polymerase subunits PA, PB1 and PB2. Due to more complete crystal structure data we focused on the PA-NP interaction and found that covariation aided in finding binding sequence residues on PA but not NP. Another outcome of developing the covariation method was developing a program to view broad primary structure changes in large sequence alignments. This method has been informative in evaluating how amino acid positions in influenza have changed over time, as well as what defines specific residues as belonging to human or avian viruses.Item Improving the quality of multiple sequence alignment(2009-05-15) Lu, YueMultiple sequence alignment is an important bioinformatics problem, with applications in diverse types of biological analysis, such as structure prediction, phylogenetic analysis and critical sites identification. In recent years, the quality of multiple sequence alignment was improved a lot by newly developed methods, although it remains a difficult task for constructing accurate alignments, especially for divergent sequences. In this dissertation, we propose three new methods (PSAlign, ISPAlign, and NRAlign) for further improving the quality of multiple sequences alignment. In PSAlign, we propose an alternative formulation of multiple sequence alignment based on the idea of finding a multiple alignment which preserves all the pairwise alignments specified by edges of a given tree. In contrast with traditional NP-hard formulations, our preserving alignment formulation can be solved in polynomial time without using a heuristic, while still retaining very good performance when compared to traditional heuristics. In ISPAlign, by using additional hits from database search of the input sequences, a few strategies have been proposed to significantly improve alignment accuracy, including the construction of profiles from the hits while performing profile alignment, the inclusion of high scoring hits into the input sequences, the use of intermediate sequence search to link distant homologs, and the use of secondary structure information. In NRAlign, we observe that it is possible to further improve alignment accuracy by taking into account alignment of neighboring residues when aligning two residues, thus making better use of horizontal information. By modifying existing multiple alignment algorithms to make use of horizontal information, we show that this strategy is able to consistently improve over existing algorithms on all the benchmarks that are commonly used to measure alignment accuracy.Item Logos Ex Machina: A reasoned approach toward Cancer(2012-05) Avila, Andrew; Gollahon, Lauren; Strauss, Richard E.; Rice, Sean H.; Butler, Boyd; Watson, RichardLimitations in our current ability to integrate a diverse spectrum of genetic information in an effort to elucidate the underlying causes of cancer has spawned the need for a novel cancer modeling approach. Public repositories of biological pathways and gene expression experiments were combined in order to provide a systems biology approach toward cancer. Furthermore, by unifying these sources of knowledge, the ability to predict expression levels of unmeasured genes was developed. This technique was then applied to a variety of cancer types in order to resolve commonalities between heretofore divergent (or disparate) cancers. The results generated in this manner revealed characteristics that challenge the current prevailing paradigm of cancer. Specifically, the predicted results, according to the Somatic Mutation Theory of Cancer, of a significant upregulation of oncogenes and a significant downregulation of tumor suppressor genes was not found. In contrast, it was found that oncogenes were significantly downregulated and tumor suppressor genes were upregulated among the cancers examined. Furthermore, the results demonstrate the differential expression, in cancer cells, of genes involved in the cellular differentiation and wound healing processes. These results were used as a springboard to develop a novel oncogenesis hypothesis, named Umbracesis. In short, the Umbracesis hypothesis proposes that disruption of the wound healing process via carcinogens, occurs in such a way as to prevent organismic homeostasis from being recovered or prevent full re-differentiation of dedifferentiated cells. The former concept is implicated in inflammatory cancers. Whereas the latter concept, is implicated in cancers that show characteristics associated with embryonic tissues. It was concluded, that the instrumental use of the modeling approach, developed within this study, has implications beyond cancer and may be of use within other areas of biomedical concern.Item On multiple sequence alignment(2007-12) Wang, Shu, 1973-; Miranker, Daniel P.; Ambler, TonyThe tremendous increase in biological sequence data presents us with an opportunity to understand the molecular and cellular basis for cellular life. Comparative studies of these sequences have the potential, when applied with sufficient rigor, to decipher the structure, function, and evolution of cellular components. The accuracy and detail of these studies are directly proportional to the quality of these sequences alignments. Given the large number of sequences per family of interest, and the increasing number of families to study, improving the speed, accuracy and scalability of MSA is becoming an increasingly important task. In the past, much of interest has been on Global MSA. In recent years, the focus for MSA has shifted from global MSA to local MSA. Local MSA is being needed to align variable sequences from different families/species. In this dissertation, we developed two new algorithms for fast and scalable local MSA, a three-way-consistency-based MSA and a biclustering -based MSA. The first MSA algorithm is a three-way-Consistency-Based MSA (CBMSA). CBMSA applies alignment consistency heuristics in the form of a new three-way alignment to MSA. While three-way consistency approach is able to maintain the same time complexity as the traditional pairwise consistency approach, it provides more reliable consistency information and better alignment quality. We quantify the benefit of using three-way consistency as compared to pairwise consistency. We have also compared CBMSA to a suite of leading MSA programs and CBMSA consistently performs favorably. We also developed another new MSA algorithm, a biclustering-based MSA. Biclustering is a clustering method that simultaneously clusters both the domain and range of a relation. A challenge in MSA is that the alignment of sequences is often intended to reveal groups of conserved functional subsequences. Simultaneously, the grouping of the sequences can impact the alignment; precisely the kind of dual situation biclustering algorithms are intended to address. We define a representation of the MSA problem enabling the application of biclustering algorithms. We develop a computer program for local MSA, BlockMSA, that combines biclustering with divide-and-conquer. BlockMSA simultaneously finds groups of similar sequences and locally aligns subsequences within them. Further alignment is accomplished by dividing both the set of sequences and their contents. The net result is both a multiple sequence alignment and a hierarchical clustering of the sequences. BlockMSA was compared with a suite of leading MSA programs. With respect to quantitative measures of MSA, BlockMSA scores comparable to or better than the other leading MSA programs. With respect to biological validation of MSA, the other leading MSA programs lag BlockMSA in their ability to identify the most highly conserved regions.