Browsing by Subject "Genomics"
Now showing 1 - 20 of 20
Results Per Page
Sort Options
Item An assessment of health educators' likelihood of adopting genomic competencies for the public health workforce(2009-05-15) Chen, Lei-ShihAlthough the completion of the Human Genome Project helps develop efficient treatment/prevention programs, it will raise new and non-trivial public health issues. Many of these issues fall under the professional purview of health educators. Yet, no studies have evaluated if health educators (HEs) are ready to adopt genomic competencies into health promotion. This dissertation addresses this issue by examining three research questions in three separate studies: 1) Why must HEs develop genomic competencies? 2) What are HEs? knowledge of, and attitudes toward genomic competencies? And 3) what is HEs? likelihood of adopting genomic competencies into health promotion? The first theoretical study proposed five arguments supporting the need for HEs to develop their genomic competencies and integrate public health genomics into health promotion. These arguments touched on various dimensions of HEs? professional goals and ranged from professional responsibilities and competencies, to the availability of funding for genomic-related research or interventions and opportunities for future employment. For the second study, a web-based survey was developed and distributed to all members of four major health education organizations. A total of 1,925 HEs? completed the survey and 1,607 responses were utilized in the final analysis. This study indicated that participants had deficient knowledge and unfavorable attitudes toward the CDCproposed genomic competencies. In the third study, a theoretical model was developed to predict HEs? likelihood to incorporate genomic competencies into their practice. Using techniques from Structural Equation Modeling (SEM), the model was tested with the same data of the second study. Findings supported the proposed theoretical model. While genomic knowledge, attitudes, and self-efficacy were significantly associated with HEs? likelihood to incorporate genomic competencies into their practice, attitudes was the strongest predictor of likelihood. In summary, these studies indicated that participating HEs had deficient genomic knowledge, unfavorable attitudes toward a set of CDC-proposed genomic competencies, and low likelihood to adopt genomic competencies into health promotion. Relevant training should be developed and advocated. As the SEM analysis results indicated the survey findings supported the proposed theoretical model, which can be utilized to steer future training for HEs.Item Analysis of Genomic Imprinting of UBE3A in Neurons(2015-05-05) Hillman, Paul RandolphAngelman syndrome (AS), chromosome 15q11-q13 duplication syndrome (Dup15q), and Prader-Willi syndrome (PWS) are neurodevelopmental disorders associated with dysregulated expression of imprinted genes located within the human 15q11-13 imprinted region. Angelman syndrome is caused by loss-of-function or loss-of-expression of the maternally inherited UBE3A allele; Dup15q syndrome is attributed to maternally inherited copy number gains of UBE3A; and, paternally inherited deletions of the SNORD116 cluster cause PWS. The UBE3A gene is imprinted in the brain with maternal-specific expression and biallelically expressed in all other cell types. The imprint is regulated by expression of the UBE3A antisense transcript (UBE3A-AS), which is expressed only in neurons and imprinted with paternal-specific expression. The UBE3A-AS represents the 3` end of a long polycistronic transcript that includes the SNORD116 and SNORD115 gene clusters. Thus, the genes causing AS, Dup15q, and PWS are transcriptionally linked; however, the functional significance of the neuron specific imprint is largely unknown. In this dissertation, it was hypothesized that imprinting of UBE3A evolved as a mechanism to negatively regulate UBE3A protein levels in neurons. This hypothesis was tested by examining allelic expression patterns and associated protein levels of the mouse 7c imprinted region, the orthologous region of human 15q11-q13. Analyses revealed that imprinted expression of Ube3a in the brain resulted in elevated RNA and protein levels compared to tissues where Ube3a was biallelically expressed. Likewise, Snord116, Snord115, and Ube3a-AS transcripts were highly expressed in the brain. The elevated Ube3a protein levels in the brain were due to increased maternal-allelic expression during neurogenesis concurrent with paternal-allelic suppression. Analysis of UBE3A expression in the opossum, a metatherian mammal lacking an orthologous imprinted region, showed that the UBE3A imprint did not evolve to negatively regulate UBE3A protein levels in the brain. Extensive alternative splicing of Ube3a-AS was detected in the brain, which generated at least two transcripts containing novel open reading frames. Novel Ube3a alternatively spliced transcripts were also identified in the brain. Collectively, these data reject the hypothesis that the UBE3A imprint evolved to negatively regulate UBE3A protein levels in the brain; instead, they suggest that the UBE3A imprint may allow co-expression of the UBE3A and SNORD gene cluster in neurons, which may also facilitate or regulate the expression of novel brain-specific UBE3A transcripts.Item Analytic Study of Performance of Error Estimators for Linear Discriminant Analysis with Applications in Genomics(2012-02-14) Zollanvari, AminError estimation must be used to find the accuracy of a designed classifier, an issue that is critical in biomarker discovery for disease diagnosis and prognosis in genomics and proteomics. This dissertation is concerned with the analytical formulation of the joint distribution of the true error of misclassification and two of its commonly used estimators, resubstitution and leave-one-out, as well as their marginal and mixed moments, in the context of the Linear Discriminant Analysis (LDA) classification rule. In the first part of this dissertation, we obtain the joint sampling distribution of the actual and estimated errors under a general parametric Gaussian assumption. Exact results are provided in the univariate case and an accurate approximation is obtained in the multivariate case. We show how these results can be applied in the computation of conditional bounds and the regression of the actual error, given the observed error estimate. In practice the unknown parameters of the Gaussian distributions, which figure in the expressions, are not known and need to be estimated. Using the usual maximum-likelihood estimates for such parameters and plugging them into the theoretical exact expressions provides a sample-based approximation to the joint distribution, and also sample-based methods to estimate upper conditional bounds. In the second part of this dissertation, exact analytical expressions for the bias, variance, and Root Mean Square (RMS) for the resubstitution and leave-one-out error estimators in the univariate Gaussian model are derived. All probabilistic characteristics of an error estimator are given by the knowledge of its joint distribution with the true error. Partial information is contained in their mixed moments, in particular, their second mixed moment. Marginal information regarding an error estimator is contained in its marginal moments, in particular, its mean and variance. Since we are interested in estimator accuracy and wish to use the RMS to measure that accuracy, we desire knowledge of the second-order moments, marginal and mixed, with the true error. In the multivariate case, using the double asymptotic approach with the assumption of knowing the common covariance matrix of the Gaussian model, analytical expressions for the first moments, second moments, and mixed moment with the actual error for the resubstitution and leave-one-out error estimators are derived. The results provide accurate small sample approximations and this is demonstrated in the present situation via numerical comparisons. Application of the results is discussed in the context of genomics.Item Application of Logic Synthesis Toward the Inference and Control of Gene Regulatory Networks(2013-06-27) Lin, Pey Chang KIn the quest to understand cell behavior and cure genetic diseases such as cancer, the fundamental approach being taken is undergoing a gradual change. It is becoming more acceptable to view these diseases as an engineering problem, and systems engineering approaches are being deployed to tackle genetic diseases. In this light, we believe that logic synthesis techniques can play a very important role. Several techniques from the field of logic synthesis can be adapted to assist in the arguably huge effort of modeling cell behavior, inferring biological networks, and controlling genetic diseases. Genes interact with other genes in a Gene Regulatory Network (GRN) and can be modeled as a Boolean Network (BN) or equivalently as a Finite State Machine (FSM). As the expression of genes deter- mine cell behavior, important problems include (i) inferring the GRN from observed gene expression data from biological measurements, and (ii) using the inferred GRN to explain how genetic diseases occur and determine the ?best? therapy towards treatment of disease. We report results on the application of logic synthesis techniques that we have developed to address both these problems. In the first technique, we present Boolean Satisfiability (SAT) based approaches to infer the predictor (logical support) of each gene that regulates melanoma, using gene expression data from patients who are suffering from the disease. From the output of such a tool, biologists can construct targeted experiments to understand the logic functions that regulate a particular target gene. Our second technique builds upon the first, in which we use a logic synthesis technique; implemented using SAT, to determine gene regulating functions for predictors and gene expression data. This technique determines a BN (or family of BNs) to describe the GRN and is validated on a synthetic network and the p53 network. The first two techniques assume binary valued gene expression data. In the third technique, we utilize continuous (analog) expression data, and present an algorithm to infer and rank predictors using modified Zhegalkin polynomials. We demonstrate our method to rank predictors for genes in the mutated mammalian and melanoma networks. The final technique assumes that the GRN is known, and uses weighted partial Max-SAT (WPMS) towards cancer therapy. In this technique, the GRN is assumed to be known. Cancer is modeled using a stuck-at fault model, and ATPG techniques are used to characterize genes leading to cancer and select drugs to treat cancer. To steer the GRN state towards a desirable healthy state, the optimal selection of drugs is formulated using WPMS. Our techniques can be used to find a set of drugs with the least side-effects, and is demonstrated in the context of growth factor pathways for colon cancer.Item Characterizing miRNA mediated regulation of proliferation(2014-05) Polioudakis, Damon Constantine; Iyer, Vishwanath R.Cell proliferation is a fundamental biological process, and the ability of human cells to transition from a quiescent to proliferative state is essential for tissue homeostasis. Most cells in eukaryotic organisms are in a quiescent state, but on appropriate physiological or pathological stimuli, many types of somatic cells may exit quiescence, re-enter the cell cycle and begin to proliferate. The ability of cells to remain viable while quiescent, exit quiescence and re-enter into the cell cycle is the basis for varied physiological processes such as wound healing, lymphocyte activation and hepatocyte regeneration, but is also a hallmark of cancer. The transition of mammalian cells from quiescence to proliferation is accompanied by the differential expression of several microRNAs (miRNAs) and transcription factors. Our understanding of miRNA biology has significantly improved, but the miRNA regulatory networks that govern cell proliferation are still largely unknown. We characterized a miR-22 Myc network that mediates proliferation through regulation of the interferon response and multiple cell cycle arrest genes. We identified several cell cycle arrest genes that regulate the effects of the tumor suppressor p53 as direct targets of miR-22, and discovered that miR-22 suppresses interferon gene expression. We go on to show that miR-22 is activated by the transcription factor Myc as quiescent cells enter proliferation, and that miR-22 represses the Myc transcriptional repressor MXD4, mediating a feed forward loop to elevate Myc expression levels. To more effectively determine miRNA targets, we utilized a combination of RNA-induced silencing complex immunoprecipitations and gene expression profiling. Using this approach for miR-191, we constructed an extensive transcriptome wide miR-191 target set. We show that miR-191 regulates proliferation, and targets multiple proto-oncogenes, including CDK9, NOTCH2, and RPS6KA3. Recent advances in determining miRNA targetomes have revealed widespread non-canonical miRNA-target pairing. We experimentally identified the transcriptome wide targets of miR-503, miR-103, and miR-494, and observed evidence of non-canonical target pairing for these miRNAs. We went on to confirm that miR-503 requires pairing outside of the canonical 5' seed region to directly target the oncogene DDHD2. Further bioinformatics analysis implicated miR-503 and DDHD2 in breast cancer tumorigenesis.Item Comparative genomics, antimicrobial resistance determinants, and pathogenicity of community-associated Staphylococcus aureus(2016-05) Lee, Grace Choi; Frei, Christopher R.; Lawson, Kenneth A; Wilson, James P; Wang, Yufeng; Olsen, RandallStaphylococcus aureus is a major human pathogen and a global public health issue. It is considered an opportunistic pathogen as it asymptomatically colonizes its host, but can occasionally cause diseases that range in severity from relatively minor skin and soft tissue infections (SSTI) to life-threatening cases of pneumonia and endocarditis. There is a critical need to better understand mechanisms that lead to the evolution, resistance, and severity of S. aureus infections. Bacterial whole genome sequencing (WGS) techniques have offered new insights into S. aureus genomic populations and have the potential to predict antimicrobial resistance and infection severity. This study applied WGS 1) to describe the diversity and distribution of resistance mechanisms among community-associated S. aureus isolates, and 2) to identify S. aureus genetic signatures associated with SSTI isolates and derive a predictive risk model. WGS was performed on S. aureus isolates from patients within 14 primary care clinics in the South Texas Ambulatory Research Network from 2007 to 2015. The bacterial genomes were compared to a reference genome, FPR3757 (USA300 strain) to identify single nucleotide polymorphisms (SNPs). Phylogenetic analyses were conducted using concatenated SNP nucleotides in the core genomes. In the first study, the resistome was assembled by identifying antimicrobial resistance determinants related to the phenotypically derived antibiogram. The findings of this study identified that multidrug-resistant S. aureus isolates have emerged in the South Texas community; approximately one-third were multidrug-resistant. There was an increasing resistance pattern to fluoroquinolones. Furthermore, the genotype demonstrated to be highly predictive of antimicrobial resistance (very major error rate=0% and major error rate=1.4%). These findings highlight the genomic diversity of S. aureus strains in the South Texas community and demonstrate the utility of next generation sequencing to define the diversity and distribution of resistance mechanisms within S. aureus. Further work to explore antimicrobial selective pressures is needed. The second study utilized a bacterial genome-wide association study to identify specific variants associated with S. aureus pathogenicity. This study revealed the heterogeneity of S. aureus SSTI and nasal colonization isolates and identified potentially novel pathogenic mechanisms.Item Computational and experimental methods in functional genomics : the good, the bad, and the ugly of systems biology(2008-08) Hart, Glen Traver; Marcotte, Edward M.Seven years into the postgenomic era, we sit atop a mountain of data whose generation was enabled by gene sequencing. The creation, integration, and analysis of these large scale data sets allow us to move forward toward the complementary goals of determining the individual roles of the thousands of uncharacterized mammalian genes and understanding how they work together to produce a healthy human being -- or, perhaps more importantly, how their malfunction results in disease. Collapsing the results of large-scale assays into gene networks provides a useful framework from which we can glean information that advances both of these goals. However, the utility of networks is limited by the quality of the data that goes into them. This study offers seeks to shed some light on the quality and breadth of protein interaction networks, describes a new experimental technique for functional genetic assays in mammalian cell lines, and ultimately suggests a strategy for how to improve the overall utility of the output generated by the systems biology community.Item Development of high throughput functional annotation system with distributed capabilities(Texas Tech University, 2006-05) Zaragoza, Joaquin; Temkin, Bharti H.; Dowd, Scot E.Background The genomes of over 250 species including the human genome have been sequenced to date. However, technologies that produce these genetic sequences (genes and/or proteins) are advancing at a much faster rate compared to the science and technologies that make it possible for the scientists to determine how each individual gene/protein functions within the cell and to annotate the genes/proteins based upon these functions. Specifically, this process of functional annotation assigns a functional descriptor to unknown genetic sequences. The process of functional annotation includes (1) formatting raw genetic sequence data, (2) the use of pre-existing annotated databases to assign, based upon sequence similarity searches, putative annotations to the raw genetic sequence data, (3) providing visualization tools to facilitate curation and formal assignment of the automated mappings obtained as part of step two, and finally (4) formatting the final datasets into standard output formats conducive to downstream analysis. Rationale Thus the primary need in the field of computational genetics is a unified functional annotation solution providing, (1) improved overall throughput, (2) organized visualizations of complex datasets, (3) user interface interactivity with limited latency, and (4) processing annotated data into formats conducive to downstream analysis. Although, current solutions exist for individual tasks in the annotation process, a unified solution is needed that provides efficient computational methods at key high throughput steps in the process. To address these issues and provide a solution, we have developed the High Throughput Gene Ontology Functional Annotation Toolkit (HTGOFAT). Material and Methods HTGOFAT was encoded with C# using the Microsoft .NET Framework. The key high-throughput steps in the annotation process were identified as handling input sequences, the computationally intensive nature of sequence similarity searches, and local and remote database interactions. In addition, data visualizations are generated within HTGOFAT to complete the functional annotation process. First, an indexing schema for handling input sequence databases is integrated into HTGOFAT that allows for indexing and retrieving sequences from a flat textfile while leaving the file intact. Secondly, HTGOFAT integrates a distributed algorithm for the parallelization of the similarity search utilizing the Microsoft .NET Remoting framework. Third, utilizing key indexing terms obtained during the similarity search, further data mining is conducted using data within a remote MySQL database to obtain the actual functional annotations. Lastly, graphical representations of the attained annotations are presented in functional biologic pathways, direct acyclic graphs, and grid formatted tables that allows for curation and analysis. Quantitative assessments of the improvements in these high-throughput steps of the annotation process were performed by comparing similar automated methods as well as manual methods to the methods developed as part of HTGOFAT. Results We have developed a standalone, unified application to address the computational requirements of the functional annotation process. Improvements in each of the key high-throughput steps were realized. By using an input file indexing method rather than the common method of reformatting the flat files into single files were seen by a 900% decrease in the time to process an 1616 KB input file containing 5000 sequences. Improvements in the sequence similarity search bottleneck were realized by the implementation of a distributed algorithm. When utilizing eight workstations with 2.4 GHz processors and 1 GB of memory, the parallel BLAST of 100, 500, 1000, and 1500 sequences took 5.04, 29.50, 58.58, and 81.70 minutes, respectively. A serial BLAST on one workstation with the same configuration took 38.15, 194.68, 461.81, and 662.68 minutes, respectively. An average speedup of 7.825 was achieved which correlates to an efficiency of 97.81% for the eight workstation test cases. On average, the parallel BLAST took 13% of the total time compared to a serial BLAST. Computational performance analysis was performed to validate the implementation of fetching associations from the database server where three automated methods were compared to one manual method. From the performance analysis, the optimal automated method completed in nearly half the time as the next automated method which is over 2000 times faster than performing manually. For comparison, manually fetching 40,000 annotations is estimated at over 166.7 hours for an expert as opposed to 303.969 seconds using the optimal automated algorithm. Finally, computational performance analysis for data acquisition methods to access the underlying databases determined the optimal implementation to present visualizations within the user interface with minimal latency. Conclusions This thesis describes the High Throughput Gene Ontology Functional Annotation Toolkit (HTGOFAT) that automates the functional annotation process by reducing the bottleneck associated with processing many sequences concurrently, and at the same time, allows for additional post-processing of the resultant data in order to visualize and analyze the attained functional annotations.Item Estimating population histories using single-nucleotide polymorphisms sampled throughout genomes(2013-05) McTavish, Emily Jane Bell; Hillis, David M., 1958-Genomic data facilitate opportunities to track complex population histories of divergence and gene flow. We used 47,506 single-nucleotide polymorphisms (SNPs) to investigate cattle population history. Cattle are descendants of two independently domesticated lineages, taurine and indicine, that diverged 200,000 or more years ago. We found that New World cattle breeds, as well as many related breeds of cattle in southern Europe, exhibit ancestry from both the taurine and indicine lineages. Although European cattle are largely descended from the taurine lineage, gene flow from African cattle (partially of indicine origin) contributed substantial genomic components to both southern European cattle breeds and their New World descendants. We extended these analyses to compare timing of admixture in several breeds of taurine-indicine hybrid origin. We developed a metric, scaled block size (SBS), that uses the unrecombined block size of introgressed regions of chromosomes to differentiate between recent and ancient admixture. By comparing test individuals to standards with known recent hybrid ancestry, we were able to differentiate individuals of recent hybrid origin from other admixed individuals using the SBS metric. We genotyped SNP loci using the bovine 50K SNP panel. The selection of sites to include in SNP analyses can influence inferences from the data, especially when particular populations are used to select the array of polymorphic sites. To test the impact of this bias on the inference of population genetic parameters, we used empirical and simulated data representing the three major continental groups of cattle: European, African, and Indian. We compared the inference of population histories for simulated data sets across different ascertainment conditions using F[subscript ST] and principal components analysis (PCA). Ascertainment bias that results in an over-representation of within-group polymorphism decreases estimates of F[subscript ST] between groups. Geographically biased selection of polymorphic SNPs changes the weighting of principal component axes and can bias inferences about proportions of admixture and population histories using PCA. By combining empirical and simulated data, we were able to both test methods for inferring population histories from genomic SNP data and apply these methods to practical problems.Item Functional characteristics of genes involved in brassinosteroid signaling in cotton(Texas Tech University, 2003-12) Sun, YanCotton fibers are highly elongated single celled trichomes that grow from the seed integument. Elongation of fiber cells begins almost after ovule fertilization and continues for approximately 20 days. In this study, I have shown that BR signaling is necessary for cotton fiber initiation and elongation, using in vitro cultured cotton ovule with Brassinosteroids (BR) and brassinazole (Brz, BR biosynthesis inhibitor). BRs are polyhydroxylated sterol derivatives of plant origin that are required for normal plant development. Several Arabidopsis genes that encode critical components of this pathway have been identified through genetic screening. BRM encodes a membrane-bound leucine-rich receptor-like kinase that apparently acts as the BR receptor. BIN2, which acts downstream of BRM in this pathway, encodes a GSK3/SHAGGY-like kinase that down-regulates BR signaling. To understand the role of cotton orthologous genes in fiber development, cotton ESTs similar to the Arabidopsis BRM and BIN2 genes were identified. These ESTs were used to clone the corresponding full-length GhBRM and GhBIN2 cDNAs. The GhBRM was cloned from a cotton cDNA library and then amplified from cotton genomic DNA. This 3561 bp gene contains no introns and encodes a protein with 1187 amino acids. Database analyses shows that the GhBRM protein has all the distinct domains characteristic of BRM. Four GhBIN2 cDNAs were also cloned. They all include a coding sequence of 1146 base pairs in length and encode derived proteins of 381 amino acids. Sequence comparison with mammalian GSK3p and Drosophila GSK3/SHAGGY-like kinase showed that GhBIN2 proteins share many conservative regions with these two GSK3/SHAGGY-like kinases. Analysis of the expression patterns of the GhBRM and GhBIN2s genes using quantitative real-time PCR showed that they are expressed throughout cotton plants, including leaves, buds, hypocotyls, roots, sepals, ovules, bolls, and fibers. To identify the functions of these genes, gene constructs that express GhBRM and GhBIN2 under control of a CaMV 35S promoter were developed. Expression of the GhBRM transgene in the dwarf bri1-5 mutant Arabidopsis plants restored them to normal height. Expression analysis showed that the heights of the transgenic plants were significantly correlated with the GhBRH expression level (r = 0.97). These results strongly suggest that the GhBRH gene encodes a functional BR receptor protein. Conversely, expression of the GhBIN2 transgene in wildtype Arabidopsis plants resulted in severe stunting similar to strong BR deficient or insensitive mutants. Expression analysis showed that the heights of the transgenic plants were inversely correlated with these GhBlN2 expression levels (the average correlation level r= -0.90). These results indicate that the GhBlN2 genes function as negative regulators of BR signal transduction pathway. These results confirm that the GhBRM and GhBIN2 cDNAs encode proteins that are capable of functioning in the BR signaling pathway. BR signal transduction pathway could provide the basis for genetic modification of fiber development.Item A functional genomics approach to map transcriptional and post-transcriptional gene regulatory networks(2009-08) Bhinge, Akshay Anant; Iyer, Vishwanath R.It has been suggested that organismal complexity correlates with the complexity of gene regulation. Transcriptional control of gene expression is mediated by binding of regulatory proteins to cis-acting sequences on the genome. Hence, it is crucial to identify the chromosomal targets of transcription factors (TFs) to delineate transcriptional regulatory networks underlying gene expression programs. The development of ChIP-chip technology has enabled high throughput mapping of TF binding sites across the genome. However, there are many limitations to the technology including the availability of whole genome arrays for complex organisms such human or mouse. To circumvent these limitations, we developed the Sequence Tag Analysis of Genomic Enrichment (STAGE) methodology that is based on extracting short DNA sequences or “tags” from ChIP-enriched DNA. With improvements in sequencing technologies, we applied the recently developed ChIP-Seq technique i.e. ChIP followed by ultra high throughput sequencing, to identify binding sites for the TF E2F4 across the human genome. We identified previously uncharacterized E2F4 binding sites in intergenic regions and found that several microRNAs are potential E2F4 targets. Binding of TFs to their respective chromosomal targets requires access of the TF to its regulatory element, which is strongly influenced by nucleosomal remodeling. In order to understand nucleosome remodeling in response to transcriptional perturbation, we used ultra high throughput sequencing to map nucleosome positions in yeast that were subjected to heat shock or were grown normally. We generated nucleosome remodeling profiles across yeast promoters and found that specific remodeling patterns correlate with specific TFs active during the transcriptional reprogramming. Another important aspect of gene regulation operates at the post-transcriptional level. MicroRNAs (miRNAs) are ~22 nucleotide non-coding RNAs that suppress translation or mark mRNAs for degradation. MiRNAs regulate TFs and in turn can be regulated by TFs. We characterized a TF-miRNA network involving the oncofactor Myc and the miRNA miR-22 that suppresses the interferon pathway as primary fibroblasts enter a stage of rapid proliferation. We found that miR-22 suppresses the interferon pathway by inhibiting nuclear translocation of the TF NF-kappaB. Our results show how the oncogenic TF Myc cross-talks with other TF regulatory pathways via a miRNA intermediary.Item Functional genomics of a model ecological species, Daphnia pulex(2013-12) Malcom, Jacob Wesley; Leibold, Mathew A.; Juenger, ThomasDetermining the molecular basis of heritable variation in complex, quantitative ecologically important traits will provide insight into the proximate mechanisms driving phenotypic and ecological variation, and the molecular evolutionary history of these traits. Furthermore, if the study organism is a “keystone species” whose presence or absence shapes ecological communities, then we extend our understanding of the effects of molecular variation to the level of communities. I examined the molecular basis of variation in 32 ecologically important traits in the freshwater pond keystone species Daphnia pulex, and identified thousands of candidate genes for which variation may affect not just Daphnia phenotypes, but the structure of communities. I extended the basic results to address two questions: what genes are associated with the offspring size-number trade-off in Daphnia; and can we identify candidate “keystone gene networks” for which variation may have a particularly strong influence on eco-evolutionary dynamics of limnetic communities? I found that different genes, with different biological functions, are associated with the trade-off in subsequent broods, and propose a model linking evolutionary frameworks to molecular biological functions. Next I found that quantitative genetic variation in keystone traits appears to co-vary with the selection regimes to which Daphnia is subject, and identified two candidate gene networks that may underpin this genetic variation. Not only do these results provide a host of molecular hypotheses to be tested as Daphnia matures as a model genomic organism, but they also suggest models that link molecular research with broader themes in ecology, evolution, and behavior.Item Genetic Analysis of Stem Composition Variation in Sorghum Bicolor(2012-10-19) Evans, JosephSorghum (Sorghum bicolor [L.] Moench) is the world's fifth most economically important cereal crop, grown worldwide as a source of food for both humans and livestock. Sorghum is a C4 grass that is well adapted to hot and arid climes and is popular for cultivation on lands of marginal quality. Recent interest in development of biofuels from lignocellulosic biomass has drawn attention to sorghum, which can be cultivated in areas not suitable for more traditional crops, and is capable of generating plant biomass in excess of 40 tons per acre. While the quantity of biomass and low water consumption make sorghum a viable candidate for biofuels growth, the biomass composition is enriched in lignin, which is problematic for enzymatic and chemical conversion techniques. The genetic basis for stem composition was analyzed in sorghum populations using a combination of genetic, genomic, and bioinformatics techniques. Utilizing acetyl bromide extraction, the variation in stem lignin content was quantified across several sorghum cultivars, confirming that lignin content varied considerably among sorghum cultivars. Previous work identifying sorghum reduced-lignin lines has involved the monolignol biosynthetic pathway; all steps in the pathway were putatively identified in the sorghum genome using sequence analysis. A bioinformatics toolkit was constructed to allow for the development of genetic markers in sorghum populations, and a database and web portal were generated to allow users to access previously developed genetic markers. Recombinant inbred lines were analyzed for stem composition using near infrared reflectance spectroscopy (NIR) and genetic maps constructed using restriction site-linked polymorphisms, revealing 34 quantitative trail loci (QTL) for stem composition variation in a BTx642 x RTx7000 population, and six QTL for stem composition variation in an SC56 x RTx7000 population. Sequencing the genome of BTx642 and RTx7000 to a depth of ~11x using Illumina sequencing revealed approximately 1.4 million single nucleotide polymorphisms (SNPs) and 1 million SNPs, respectively. These polymorphisms can be used to identify putative amino acid changes in genes within these genotypes, and can also be used for fine mapping. Plotting the density of these SNPs revealed patterns of genetic inheritance from shared ancestral lines both between the newly sequenced genotypes and relative to the reference genotype BTx623.Item Genome-wide approaches to explore transcriptional regulation in eukaryotes(2014-05) Park, Daechan; Iyer, Vishwanath R.; Marcotte, Edward M; Paull, Tanya T; Miller, Kyle M; Stevens, Scott WTranscriptional regulation is a complicated process controlled by numerous factors such as transcription factors (TFs), chromatin remodeling enzymes, nucleosomes, post-transcriptional machineries, and cis-acting DNA sequence. I explored the complex transcriptional regulation in eukaryotes through three distinct studies to comprehensively understand the functional genomics at various steps. Although a variety of high throughput approaches have been developed to understand this complex system on a genome wide scale with high resolution, a lack of accurate and comprehensive annotation transcription start sites (TSS) and polyadenylation sites (PAS) has hindered precise analyses even in Saccharomyces cerevisiae, one of the simplest eukaryotes. We developed Simultaneous Mapping Of RNA Ends by sequencing (SMORE-seq) and identified the strongest TSS and PAS of over 90% of yeast genes with single nucleotide resolution. Owing to the high accuracy of TSS identified by SMORE-seq, we detected possibly mis-annotated 150 genes that have a TSS downstream of the annotated start codon. Furthermore, SMORE-seq showed that 5’-capped non-coding RNAs were highly transcribed divergently from TATA-less promoters in wild-type cells under normal conditions. Mapping of DNA-protein interactions is essential to understanding the role of TFs in transcriptional regulation. ChIP-seq is the most widely used method for this purpose. However, careful attention has not been given to technical bias reflected in final target calling due to many experimental steps of ChIP-seq including fixation and shearing of chromatin, immunoprecipitation, sequencing library construction, and computational analysis. While analyzing large-scale ChIP-seq data, we observed that unrelated proteins appeared to bind to the gene bodies of highly transcribed genes across datasets. Control experiments including input, IgG ChIP in untagged cells, and the Golgi factor Mnn10 ChIP also showed the strong binding at the same loci, indicating that the signals were obviously derived from bias that is devoid of biological meaning. In addition, the appearance of nucleosomal periodicity in ChIP-seq data for proteins localizing to gene bodies is another bias that can be mistaken for false interactions with nucleosomes. We alleviated these biases by correcting data with proper negative controls, but the biases could not be completely removed. Therefore, caution is warranted in interpreting the results from ChIP-seq. Nucleosome positioning is another critical mechanism of transcriptional regulation. Global mapping of nucleosome occupancy in S. cerevisiae strains deleted for chromatin remodeling complexes has elucidated the role of these complexes on a genome wide scale. In this study, loss of chromodomain helicase DNA binding protein 1 (Chd1) resulted in severe disorganization of nucleosome positioning. Despite the difficulties of performing ChIP-seq for chromatin remodeling complexes due to their transient and dynamic localization on chromatin, we successfully mapped the genome-wide occupancy of Chd1 and quantitatively showed that Chd1 co-localizes with early transcription elongation factors, but not late transcription elongation factors. Interestingly, Chd1 occupancy was independent of the methylation levels at H3K36, indicating the necessity of a new working model describing Chd1 localization.Item Genomic Insights into Sexual Selection and the Evolution of Reproductive Genes in Teleost Fishes(2012-10-19) Small, ClaytonSexual selection has long been a working explanation for the elaboration of appreciable traits in plants and animals, but the idea that it is an equally potent agent of change at the level of individual molecules is relatively recent. Indications that genes associated with reproductive biology evolve especially rapidly planted this notion, but many details about the genomics of sex remain elusive. Numerous studies have characterized rapid sequence and expression divergence of sex-related molecules, but few if any have demonstrated convincingly that these patterns exist as a result of sexual selection. This dissertation describes several genome-scale studies related to reproduction and the sexes in teleost fishes, a group of animals underexploited in regard to this topic. Using commercial microarrays I measured the extent of sexually dimorphic gene expression in the zebrafish, Danio rerio. Sex-biased patterns of gene expression in this species are similar to those described in other animals. A number of genes expressed at high levels in ovaries and testes relative to the body were identified as a product of the study, and these data may be useful for future studies of reproductive genes in Danio fishes. In a second study, the recent advent of high throughput cDNA pyrosequencing was leveraged to characterize the relationships between tissue-, sex-, and species-specific expression patterns of genes and rates of sequence evolution in swordtail fishes (Xiphophorus). I discovered ample evidence for expression biases of all three types, and a generally positive but idiosyncratic relationship between the magnitude of expression bias and rates of protein-coding sequence evolution. Pyrosequencing of cDNA was also used to explore the possibility that postcopulatory sexual selection drives the rapid evolution of male pregnancy genes, a novel class of reproductive molecules unique to syngnathid fishes (seahorses and pipefishes). Genes differentially expressed in the male brooding tissues as a function of pregnancy status evolve more rapidly at the amino acid level than genes exhibiting static expression. Brooding tissue genes expressed during male pregnancy have evolved especially rapidly in polyandrous lineages, a finding that supports the hypothesized relationship between postcopulatory sexual selection and the adaptive evolution of reproductive molecules.Item Genomics analysis on the responses of E. coli cells to varying environmental conditions(2016-05) Yan, Xiwei; Wilke, C. (Claus); Lin, LizhenThe natural living environments of E. coli cells are diverse, varying from mammalian gastrointestinal tracts and soil. Each environment might require distinct metabolic pathways and transporter systems, and long-term evolution has established elaborate regulatory system for E. coli cells to quickly adapt to the changing conditions. Sensing outside stresses and then adopting a different phenotype enable them to take advantage of any possible nutrients and defend against hostile environment. A lot of regulatory mechanisms have been identified by genetic, biochemical and molecular biology methods, and our study aim to build a systematic view on the response of the whole genome to four different environmental conditions. We used statistical tests including Pearson’s tests and Spearman’s tests and multiple testing adjustments to identify feature genes that are induced or repressed significantly across treatment levels. The feature genes identified were partially supported by previous literatures, and some of the novel genes not found in any previous studies may infer a potential research blind spot. Additionally, we compared the correlation tests to the implementation of machine learning algorithms, and discussed the advantage and drawbacks of each method.Item Plastid genome rearrangement, gene loss, and sequence divergence in geraniaceae, passifloraceae, and annonaceae.(2013-12) Blazier, John Christensen; Jansen, Robert K., 1954-Plastid genomes of flowering plants are largely identical in gene order and content, but a few lineages have been identified with many gene and intron losses, genomic rearrangements, and accelerated rates of nucleotide substitutions. These aberrant lineages present an opportunity to understand the modes of selection acting on these genomes as well as their long-term stability. My research has focused on two areas within plastid genome evolution in Geraniaceae: first, an investigation of the diversity of unusual plastid genomes in a single genus, Erodium (Geraniaceae) for chapters one and three. Chapter two focuses on the evolution of subunits of the plastid-encoded RNA polymerase (PEP). The first chapter described the loss of plastid-encoded NADPH dehydrogenase (ndh) genes from a clade of 13 Erodium species. Divergence time estimates indicate this clade is less than 5 million years old. This recent loss of ndh genes in Erodium presents an opportunity to investigate changes in photosynthetic function through comparative biochemistry between Erodium species with and without plastid-encoded ndh genes. Second, I examined the evolution of the gene encoding the alpha subunit (rpoA) of PEP in three disparate angiosperm lineages—Pelargonium (Geraniaceae), Passiflora (Passifloraceae), and Annonaceae—in which this gene has diverged so greatly that it is barely recognizable. PEP is conserved in the plastid genomes of all photosynthetic angiosperms. I found multiple lines of evidence indicating that the genes remain functional despite retaining only ~30% sequence identity with rpoA genes from outgroups. The genomes containing these divergent rpoA genes have undergone significant rearrangement due to illegitimate recombination and gene conversion, and I hypothesized that these phenomena have also driven the divergence of rpoA. Third, I conducted a survey of plastid genome evolution in Erodium with the completion of 15 additional whole genomes. Except for Erodium and some legumes, all angiosperm plastid genomes share a quadripartite structure with large and small single copy regions (LSC, SSC) and two inverted repeats (IR). I discovered a species of Erodium that has re-formed a large inverted repeat. Demonstrating a precedent for loss and regain of the IR also impacts models of evolution for other highly rearranged plastid genomes.Item Regulation of Brain-Derived Neurotrophic Factor in the Adult Mouse Brain(2005-08-11) Malkovska, Irena; Parada, LuisIn the adult central nervous system (CNS) brain-derived neurotrophic factor (BDNF) has been implicated in neuroprotection and synaptic plasticity among other functions. However, relatively little is known of its regulation. In this thesis, we attempted to learn more about BDNF regulation by means of: an in situ hybridization study of the four distinct untranslated exons in the adult mouse brain; use of transgenic animals to define BDNF promoter regions; and use of comparative genomics to identify evolutionarily conserved regions of BDNF. The in situ hybridization study suggests that the four distinct BDNF promoters are differentially regulated and that neighboring promoters are coregulated. Also it appears that all four promoters function in most of the same nuclei of the adult CNS. Inspite of the large size of the transgenic constructs used in this study specific to exons 1/2 and 3/4 (11.4 kb and 16 kb respectively), they were insufficient to mediate endogenous-like BDNF expression in the adult CNS. However, this study suggests that these regions may drive endogenous-like expression in a subset of nuclei (random chance integration cannot however be ruled out). The bioinformatics study revealed 9 highly conserved elements that are good candidates for cis-regulatory elements of BDNF. In conclusion, the regulation of the BDNF gene appears far more complicated than was previously predicted.Item Statistical Methods for High Dimensional Biomedical Data(2013-03-27) Ball, Robyn LynnThis dissertation consists of four different topics in the areas of proteomics, genomics, and cardiology. First, a data-based method was developed to assign the subcellular localization of proteins. We applied the method to data on the bacteria Rhodobacter sphaeroides 2.4.1 and compared the results to PSORTb v.3.0. We found that the method compares well to PSORTb and a simulation study revealed that the method is sound and produces accurate results. Next, we investigated genomic features involved in the lethality of the knockout mouse using the random forest technique. We achieved an accuracy rate of 0.725 and found that among other features, the evolutionary age of the gene was a good predictor of lethality. Third, we analyzed DNA breakpoints across eight different cancer types to determine if common hotspots or cancer-type specific hotspots can be well-predicted by various genomic features and investigated which of the genomic features best predict the number of breakpoints. Using the random forest technique, we found that cancer- type specific hotspots are poorly predicted by genomic features but common hotspots can be predicted using the relevant genomic features. Additionally, we found that among the genomic features analyzed, indel rate and substitution rate were consistently chosen as the top predictors of breakpoint frequency. Lastly, we developed a method to predict the hypothetical heart age of a subject based on the subject?s electrocardiogram (ECG). The heart age predictions are consistent with current ECG science and knowledge of cardiac health.Item The Bootstrap in Supervised Learning and its Applications in Genomics/Proteomics(2012-07-16) Vu, ThangThe small-sample size issue is a prevalent problem in Genomics and Proteomics today. Bootstrap, a resampling method which aims at increasing the efficiency of data usage, is considered to be an effort to overcome the problem of limited sample size. This dissertation studies the application of bootstrap to two problems of supervised learning with small sample data: estimation of the misclassification error of Gaussian discriminant analysis, and the bagging ensemble classification method. Estimating the misclassification error of discriminant analysis is a classical problem in pattern recognition and has many important applications in biomedical research. Bootstrap error estimation has been shown empirically to be one of the best estimation methods in terms of root mean squared error. In the first part of this work, we conduct a detailed analytical study of bootstrap error estimation for the Linear Discriminant Analysis (LDA) classification rule under Gaussian populations. We derive the exact formulas of the first and the second moment of the zero bootstrap and the convex bootstrap estimators, as well as their cross moments with the resubstitution estimator and the true error. Based on these results, we obtain the exact formulas of the bias, the variance, and the root mean squared error of the deviation from the true error of these bootstrap estimators. This includes the moments of the popular .632 bootstrap estimator. Moreover, we obtain the optimal weight for unbiased and minimum-RMS convex bootstrap estimators. In the univariate case, all the expressions involve Gaussian distributions, whereas in the multivariate case, the results are written in terms of bivariate doubly non-central F distributions. In the second part of this work, we conduct an extensive empirical investigation of bagging, which is an application of bootstrap to ensemble classification. We investigate the performance of bagging in the classification of small-sample gene-expression data and protein-abundance mass spectrometry data, as well as the accuracy of small-sample error estimation with this ensemble classification rule. We observed that, under t-test and RELIEF filter-based feature selection, bagging generally does a good job of improving the performance of unstable, overtting classifiers, such as CART decision trees and neural networks, but that improvement was not sufficient to beat the performance of single stable, non-overtting classifiers, such as diagonal and plain linear discriminant analysis, or 3-nearest neighbors. Furthermore, the ensemble method did not improve the performance of these stable classifiers significantly. We give an explicit definition of the out-of-bag estimator that is intended to remove estimator bias, by formulating carefully how the error count is normalized, and investigate the performance of error estimation for bagging of common classification rules, including LDA, 3NN, and CART, applied on both synthetic and real patient data, corresponding to the use of common error estimators such as resubstitution, leave-one-out, cross-validation, basic bootstrap, bootstrap 632, bootstrap 632 plus, bolstering, semi-bolstering, in addition to the out-of-bag estimator. The results from the numerical experiments indicated that the performance of the out-of-bag estimator is very similar to that of leave-one-out; in particular, the out-of-bag estimator is slightly pessimistically biased. The performance of the other estimators is consistent with their performance with the corresponding single classifiers, as reported in other studies. The results of this work are expected to provide helpful guidance to practitioners who are interested in applying the bootstrap in supervised learning applications.