Introduction to Genomics and Bioinformatics

Introduction to Genomics and Bioinformatics Maureen J. Donlin Departments of Molecular Microbiology & Immunology Biochemistry & Molecular Biology donlinmj@slu.edu 6/3/2014

Goals for the course • Finding and using publicly available datasets and tools for genomics and bioinformatics • Utilize these tool and datasets in your research • Interpret the output from various analysis and prediction programs • Learn to write a results section for a manuscript

Exercise format • Each exercise will consist of 2-4 sections which represent a biological question to be answered with bioinformatics tools/resources • You’ll provide the answer in the same format as you would write for the results section of a paper • Why did you do this experiment or analysis? • What did you actually do? • What did you observe? • What does it mean?

Grading • Grading: • Exercises 70 % • Final exam 20 % • Class attendance 10 % • Grading policy handout • Details about late assignment and tests

Logistics • Course website: • http://biochem.slu.edu/bchm628/ • Contact: • Phone: 977-8858 • Email: donlinmj@slu.edu • Office – DRC 507 • Call or email. • Usually at WashU on Wednesdays

Lecture outline • Overview of theme for this course • Large datasets = long lists of genes • How to interrogate gene lists using publicly available data • Introduction to sequence databases • Quality control and annotation

Host-pathogen interactions in model organisms • Caenorhabditis elegans will be the model organism • Various bacteria (S. aureus) and fungi (C. albicans) will be the pathogens • Examine data from microarrays, RNA sequencing and proteomic studies • Use various public databases and tools to interrogate and analyze the data

Aspects of host-pathogen interactions • Pathogen virulence factors • High-throughput expression analysis of pathogens during infection • Genetic differences between closely related species that differ in their ability to infect & kill C. elegans • Host innate immune response • High-throughput expression analysis of host during infection • Comparison of host response to different pathogens • Factors that mediate infection • Screen for pathogen & host factors that affect virulence and susceptibility to infection

Types of worm killing Disease Models & Mech. (2008) 1:205 CurrOpinMicrobiol. (2008) 11:251 App. & Env. Microbiol. (2012) 78:2075

Dataset 1: Response to fungal infection • “Candida albicansInfection of Caenorhabditis elegans Induces Antifungal Immune Defenses”Pukkila-Worley R., Ausubel FM and Mylonakis E(2011) PLoS Pathogens 7:e1002074 PMID: 21731485 • Study innate immune response to C. albicans in a model host • Live yeast establish intestinal infection but heat-killed yeast are avirulent • Identified 313 genes differentially expressed (DE or DEG) with infection by C. albicans • 56% of those genes were also DE with heat-killed yeast • Not much overlap with genes DE in response to S. aureusor P. aeruginosa

Starting point for Exercise 1 • Supplementary table 3 which lists the >300 genes DE in response to C. albicans and also gives the overlap with the heat killed C. ablicans • Goals are to use NCBI to find information about a few genes from the list • Use Excel to bring in additional data into your list of genes

Biological Databases • DNA -> RNA -> Protein • DNA archives – genomes, ESTs (Genbank/EMBL) • Annotated mRNAs/Genes (Gene) • RNA (miRNAs, snoRNA, structures) • Protein databases • Automated translation (GenPept/TrEMBL) • Curated (Uniprot) • Structures (PDB)

Biological databases • Store data in a form that allows users to search and retrieve • Use defined relationships between data to allow finding related records • Genome linked to genes • Genes linked to transcript isoforms • Each transcript linked to encoded protein • Genbank records include all cross-database records as active links

Quality control and annotation • Genbank – an archive • Users submit data and own exclusive rights for all updates to those records • All submissions reviewed/approved by NCBI

Growth of Genbank

NCBI

Gene annotation • Assign or define: • Gene name • Gene structure • Molecular Function • Biological process • Cellular component • Ect…. • Ideally, this data is known experimentally • Curate: pull this data from the literature

Annotation • Time consuming and costly • Not keeping pace with rate of genome sequencing • 2008: 2nd assembly of C. neoformanstype A • 2013: Only 1st assembly in Genbank • 2014: Refined gene models using NGS data • Organism specific databases often have better annotation • Curated databases aims at a particular field • EuPath (Eukaryotic pathogens)

Gene database • Gene – derived database • Curators at NCBI review submissions/literature and create annotated records of every gene and gene product for a subset of organisms • Currently: • ~244 million sequence records in Genbank • ~16 million records in the Gene database

Curation& annotation of all known proteins • Provide […] comprehensive, high-quality and freely accessible resource of protein sequence and functional information. • www.uniprot.org

Uniprot databases • 545,000 reviewed (UniprotKB/Swiss-Prot) • ~56 million not yet reviewed (UniprotKB/TrEMBL)

Other databases • Genome databases (Thursdays topic) • Organism specific (Yeast, Drosophilia, C. elegans, ect.) • Expression patterns • Protein domains • Metabolic pathway • ….. • NAR Database issue: 1st issue of every year • See handout • http://nar.oxfordjournals.org/content/42/D1.toc

Introduction to Genomics and Bioinformatics