1 / 39

Course Expectations High-throughput sequencing technology Very large datasets

Learn how next-generation sequencing technologies are used in biomedical research, analyze gene lists, and write a results section. Course website: http://biochem.slu.edu/bchm628/

freddies
Download Presentation

Course Expectations High-throughput sequencing technology Very large datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 6/13/19 Course Expectations High-throughput sequencing technology Very large datasets

  2. Goals for the course • Understand how next-generation sequencing technologies are used in biomedical research • Learn how to use publicly available databases/websites to find specific information about genes • Learn how to analyze gene lists to form hypotheses that can be tested experimentally • Learn to write a results section for a manuscript

  3. Logistics • Course website: • http://biochem.slu.edu/bchm628/ • Contact: • Phone: 977-8858 • Email: Maureen.donlin@health.slu.edu • Office – DRC 503 • Call or email. • Usually at WashU on Thursday afternoons • Lab – DRC 551

  4. Grading • Grading: • Exercises 65 % • Final exam 25 % • Class attendance 10 % • Grading policy handout • Details about late assignment and tests • Computer exercise handout • Example computer exercise

  5. Exercise format • There will be 7 exercises, each consisting of 2-4 sections which represent a biological question to be answered with bioinformatics tools/resources from that week or earlier weeks. • You’ll provide the answer in the same format as you would write for the results section of a paper • Why did you do this experiment or analysis? • What did you actually do? • What did you observe? • What does it mean? • Include supporting data • Figures with figure legends • Correctly formatted tables of data.

  6. Exercise due dates Exercises handed out on Thursday are due the following Wednesday at 4:00 pm Exercises handed out on Tuesday are due on Friday of the same week at 4:00 pm All exercises are to be sent by email There is a penalty for turning in your exercises after the deadline. The timestamp on your email is the final determination of whether an exercise is on-time or not.

  7. Exercises, cont • Exercise in Word or PDF format • Supplemental data in Excel, Word or PDF format. • The exercise should print in portrait orientation. • The exercise should include a header with your name at the top. • All files should have a name that includes your last name: • Your Name-Ex # or Name_SuppData#

  8. Final project • This will be a project summary of the analyses that you will have done in exercises 1-7. • You will be asked to choose 3 genes from your gene lists that you would follow-up on at the bench. • You will be asked to give a rationale for making the choices that you did. • You will analyze the three genes virtually using some of the tools from the exercises • You will be asked to propose hypothetical bench experiments for the genes • Final project will be due July 18th at 4:00 pm.

  9. A few tips on data presentation

  10. Data tables Table 1: Gene expression for WT cells under conditions X,Y, Z. Table 2: Comparison of clinical parameters for groups 1 and 2. 1 Statistical significance was determined by a Mann-Whitney test 2 Statistical significance was determined by 2-tailed t-test Columns describe attributes Rows contain the individual data. The first row contains a header. If you have lots of data, it is generally formatted to have more rows than columns.

  11. Data tables, cont • For the purposes of this class, the tables should be formatted to fit onto a letter size page in portrait orientation. • If your table is so wide that it forces the page into landscape orientation, then it should be included as a supplemental attachment to the exercise. If the table extends past 1 page, then include it as a supplemental attachment. • Refer to supplemental tables in your write-up and number then and the file as YourName_SuppTable1, ect. • Supplemental tables can be in Excel format.

  12. Figures • Export the figure from whatever program in jpeg or png format; those can be inserted into a Word document easily. • PDFs can be converted to other formats using Illustrator • There are some online converters • http://www.wikihow.com/Convert-PDF-to-JPEG • Screen capture and placement may also work. • Talk to me if you have issues. • Super high resolution is not necessary.

  13. Figures, cont. • Figures should have figure legends. The figure legends should describe the experiment that lead to the data in the figure and include an explanation for any symbols used. • Figures should be numbered consecutively and should not take up more than ¼ of the page. If larger than that, include as supplemental data. • Create a text box in Word, write the figure legend and then insert the figure above the figure legend. This will allow you to resize as necessary. • Again, talk to me if you have issues.

  14. Remainder of this lecture Overview of sequencing a genome Next generation sequencing High-throughput experiments by sequencing Biomolecular databases

  15. Genome sequencing Approach depends on the source, size, complexity and goals for a given organism • Goal? • De novo sequencing • Re-sequencing for annotation • Sequencing to identify variations • Size and complexity • Virus, bacterial, single-celled eukaryote, mammal, plant • Quasi-species or repetitive sequences • Sample prep • Can it be cultured? • Tissue source: unlimited or limited quantities? • Virus levels, RNA or DNA

  16. Genome sizes Homo sapiens 3 Bbp Hepatitis C virus 10,000 bp Arabidopsis thaliana 135 Mbp Saccharomyces cerevisiae 12 Mbp Axoloti 32 Bbp

  17. Types of sequencing Throughput Accuracy Read-length Cost Library prep

  18. Sanger sequencing technology 1970s thru 1980s: >SEQ ATAGCCGTACTTAGCTGAGGAGTCGATAAC 1990s to today: Long read lengths (500-900 bp) & >99.9% correct Need to clone or PCR amplify the DNA to obtain enough for sequencing reaction, no library preparation Very high accuracy, relatively long reads, very low throughput

  19. Illumina NGS • Read lengths up to 150 bp • Need to make bar coded libraries, which can be technically challening • Longer run time • Very high throughput: UP to 1TB of data per run • High accuracy • Very high throughput • Short reads

  20. PacBio Single molecule sequencing Very long read lengths (up to 10 Kb) High error rate, but stochastic and can be dealt with by multiple passes No cloning Very long read lengths, low-medium accuracy, medium throughput

  21. Ion Torrent: semi-conductor sequencing

  22. Illuminavs Ion Torrent • Illumina has greater capacity but longer run times • Ion torrent has longer read lengths (~200 bp) • Library prep similar to Illumina in complexity • SLU has an Ion Torrent machine • Cost is ~$250/sample, including the sequencing • Get strand specific sequencing without additional library prep

  23. Oxford nanopore sequencing A protein nanopore is set in an electrically resistant polymer membrane. An ionic current is passed through it to generate a charge. As analyte passes through the pore it creates a characteristic disruption in current which is different depending on the base. MinION flow cell Attaches directly to computer for data analysis Long read lengths High error rate No cloning and direct RNA sequencing Medium to high-throughput Cheapest of the current technologies with simplest library prep

  24. Bioinformatics challenges • Each flow cell in the IlluminaHiseq 2500 can generate a billion bases of sequence • Raw read files are Tb in size • Processed read files are several 700-800 Mb • Alignment files 150-300 Mb • Assembly of millions of short (75-100 bp) reads into vertebrate genome • Need high-performance compute (HPC) cluster for vertebrate sized genomes* • What biomolecular species to interrogate? • 25,000 genes • 160,000 transcripts • miRNA, non-coding RNA

  25. Sequencing has become a standard technique • RNA sequencing for expression • ChIP sequencing for TF site identification • DNA sequencing for variants • Identification of populations/genetic changes in highly variable viruses and bacteria • Single cell RNA sequencing (Rich DiPaolo) • Metagenomics • Identification of unknown/non-culturable communities of bacteria/viruses/fungi

  26. Which technology? • De novo genome sequencing and assembly: • Combination of Illumina and PacBio • Or PacBio and Nanopore • Resequencing for variant analysis • Illumina, Ion Torrent (smaller genomes) • RNAseq: • Illumina or Ion Torrent • Nanopore for direct RNA sequencing (no cDNA step) • Exome sequencing • Illumina or Ion Torrent • Metagenomics (16S sequencing) • Nanopore

  27. Where is all this data stored? • National Center for Biotechnology Information (NCBI): • >45,000 genomes • >26,000 RNAseq datasets • European Bioinformatics Institute (EBI): • >51,000 genomes • No simple web interface to expression data • Joint Genome Institute (JGI, DOE funded): • >150,000 genomes • No expression data • Genome data by download or programming interface

  28. Who analyzes all of this data? Our ability to generate sequencing data vastly outstripping our ability to analyze and annotate new genomes Annotation: prediction or identification of all functional genetic elements including protein coding genes, ncRNAs, ect. Two major public databases store and annotate or validate annotation for all genomes

  29. Main sequence archives https://www.ncbi.nlm.nih.gov/ https://useast.ensembl.org/index.html

  30. Pros and Cons of different archives • NCBI: National Center for Biotechnology Information • Databases are well integrated • Well integrated with literature (PubMed) • EBI: European Bioinformatics Institute • Same base data as NCBI, but offers different front-end • Much better list-based searching • Not as well integrated with literature • Transcript variants differ from NCBI because of different annotation pipelines • UNIPROT • All protein information from EBI is hosted here

  31. Exercise 1: Finding gene related information • Gene are annotated by: • NCBI/EBI for certain organisms (Human, Chicken, Dog) • Organism specific groups for model organisms like yeast, mouse, C. elegans, Drosophilia • Genome sequencing centers • Ideally, genes have an official gene symbol and gene name and that is what is used in manuscripts, ect. • In reality, genes are identified by different groups at same time and named different things • The Human Genome Organization (HUGO), Mouse Genome Informatics (MGI), ect. are organizations that define official gene nomenclature

  32. Exercise 1: Transcripts • Transcripts: • NCBI & EBI use different computational pipelines for predicting and annotating transcripts • There can be differences between them but typically at least the verified transcripts agree • Transcript isoforms have different accession numbers • BRACA1 has 4 transcript isoforms:

  33. Human TDP-43 gene HGNC: HUGO gene nomenclature committee RNA binding protein with role in neurodegenerative disorders HGNC official symbol: TARDBP HGNC official name: TAR DNA binding protein However: A search of PubMed using TARDBP returns 434 records while using TDB-43 returns 2543 records Use the most common but refer to the official gene symbol at least once in your manuscript and especially in the abstract

  34. Genome viewers • Provides chromosomal context to the gene(s) of interest • See transcript variants in graphical view • Have “tracks” of additional information: • Variants (SNPs) • Expression data • Repetitive sequences • Comparative data (with other species) • Download genomic sequence • Ensembl genome viewer (useast.ensembl.org) • UCSC genome viewer (genome.ucsc.edu)

  35. TARDBP (TDP-43) in UCSC Genome Browser UCSC has their own annotation pipeline and so have different annotated transcripts

  36. Take home points Rapid, high-throughput sequencing has opened up new ways to interrogate biological systems Generates Tb (Fb, Pb) of data Need computers to find it and analyze it Hypothesis generating (usually) with follow-up in bench experiments on limited number of genes Public databases treasure trove of data that can be mined with different questions

  37. Today in computer lab Exercise 1 is due on Wednesday, June 19th Finding genes and transcripts using NCBI and EBI Finding gene specific information using NCBI & EBI Visualization of genes and transcripts with the EBI and UCSC genome browsers Extra credit: write a 500-700 word abstract on why you would study Lyme disease and/or B. burgdorferi

  38. Source of data for the exercises Lyme Borreliosis Ixodesscapularis Borrelia burgdorferi LB most prevalent arthropod-borne infectious disease in N. America Photo Credit: Content Providers(s): CDC - This media comes from the Centers for Disease Control and Prevention's Public Health Image Library (PHIL), with identification number #6631.  Carreras-Gonzalez A. et. al. “A multi-omics analysis reveals the regulatory role of CD180 during the response of macrophages to Borrelia burgdorferi” Emerging Microbes and Infections (2018) 7: 1-13

  39. A few background papers (optional) Available for download or linked from the course website https://www.ncbi.nlm.nih.gov/pubmed/27976670 “Lyme Borreliosis”, Nat. Rev. Dis. Primers (2016) https://www.ncbi.nlm.nih.gov/pubmed/27900646 “The Potential of Omics Technologies in Lyme Disease Biomarker Discovery and Early Detection”, Infect. Dis. Ther. (2017) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5986905/ “Ixodes Immune Responses Against Lyme Disease Pathogens” Frontiers in Cell. and Infec. Microbiology (2018) https://www.ncbi.nlm.nih.gov/books/NBK532894/ Borrelia Burgdorferi. StatPearls (2018). Full text available.

More Related