1 / 63

BIO-454 Bio Computing

BIO-454 Bio Computing. Lecture 19: Next Generation Sequencing (NGS ) (Cont’d). Dr. Mohammad Nassef Computer Science Department Faculty of Computers and Information Cairo University. The Human Genome Project. The Human Genome Project.

ruthc
Download Presentation

BIO-454 Bio Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BIO-454Bio Computing Lecture 19: Next Generation Sequencing (NGS) (Cont’d) Dr. Mohammad Nassef Computer Science Department Faculty of Computers and Information Cairo University

  2. The Human Genome Project Setia Pramana

  3. The Human Genome Project • First draft genome of human in 2001, final draft in 2004 • Estimated costs: $3 billion (one dollar per nucleotide) • Time: 13 years • Used Sanger Sequencing • Today: Illumina: 1 week, 9500$ Exome: 6 weeks, $1000 Setia Pramana

  4. Next Generation Sequencing (NGS) • New technologies allowing the massive production of tens of millions of short sequencing fragments. • These techniques could be used to • deal with similar problems than microarrays, • but also with many other. • They raised the promise of Personalized Medicine Setia Pramana

  5. Next Generation Sequencing (NGS) • Also called: • Second Generation Sequencing • High-throughput Sequencing • Massively-parallel Sequencing Setia Pramana

  6. Next Generation sequencing (NGS) • Based on sequencing huge number of short DNA fragments, the resulting short reads can either be: • Overlapped to form the original genome from scratch (Denovo Assembly) • This is similar to the Newspaper problem. • Aligned to a previously sequenced reference genome (Reference-based Assembly) • The short reads that align with specific locations in genome can provide information about the active/genetic regions in these locations.

  7. The Newspaper Problem Sequencing of Genomes Biological Genomes Short Digital DNA Reads Need to assemble these Reads to form the entire genome!! FCI-CU-EG

  8. The Newspaper Problem as an Overlapping Puzzle FCI-CU-EG

  9. Modern Sequencing • Researchers take a small tissue or blood sample containing millions of cells with identical DNA, • They use biochemical methods to break (at random locations) the DNA of identical copies of a genome into fragments, and then, • They sequence these fragments to produce short reads. FCI-CU-EG

  10. Challenges FCI-CU-EG

  11. NGS vs. Microarray Technologies • The most common reasons for preferring to use Microarrays by researchers are: • (Economically) Cheap • Well-established technologies through around two decades, • Abundant datasets, • Enormous data analysis tools, • Can work with large number of samples! • However, NGS technologies are more accurate!

  12. NGS Technologies/Platforms Setia Pramana

  13. NGS • Reduced sequencing costs significantly, making large-scale or WGS studies much more affordable Setia Pramana

  14. NGS Technologies/Platforms Setia Pramana

  15. Differences between platforms • Run times vary from hours to days • Production range from Mb to Gb • Read length from <100 bp to > 1500 bp • Accuracy per base from 0.1% to 15% • Cost per base varies Setia Pramana

  16. NGS Application RNA-seq Whole Genome Seq Gene Regulation NGS ExomeSeq Epigenetic Resequencing Metagenomics Setia Pramana

  17. NGS Application • Whole genome re-sequencing • Ancient genomes • Metagenomics • Cancer genomics • Exome sequencing (targeted) • RNA sequencing • Chromatin immunoprecipitation (CHiP)-Seq: Protein interaction with DNA • Genomic Epidemiology • Epigenomic • Genetic human variation : SNP, CNV (diseases) • anything with DNA Setia Pramana

  18. Sequencing Factory:Beijing Genome Institute • Purchased 128 HiSeq2000 sequencers from Illumina in January 2010 • each of which can produce 25 billion base pairs of sequence a day

  19. NGS Application: Whole Genome Seq Setia Pramana

  20. NGS Application: Exome Genome Seq Setia Pramana

  21. NGS Application: RNA Sequencing Setia Pramana

  22. Bioinformatics Challenges of NGS Setia Pramana

  23. Sequencing has gotten Cheaper and Faster Cost of one human genome • HGP $ 3 billion (13 yrs) • 2004: $ 30,000,000 • 2008: $100,000 • 2010: $ 30,000 • 2011: $10,000 • 2012-13: $7,000 • 2014: $4,000 (~1 week) • ???: $1,000 The Race for the $1,000 Genome

  24. (Sequencing) Cost is Getting Cheaper • Reduced sequencing costs significantly, making large-scale or WGS studies much more affordable Setia Pramana

  25. NGS Challenges Setia Pramana

  26. Huge Data Storage and HPC Demand

  27. Generalized NGS Analysis Setia Pramana

  28. NGS Challenges • Highest cost is (almost) not the sequencing but storage and analysis. • A standard human (30-40x) whole genome sequencing would create 100 Gb of data • Extreme data size causes problems • Just transferring and storing the data • Standard comparisons fail (N*N) • Standard tools can not be used • Think in fast and parallel programs Setia Pramana

  29. Bioinformatics Challenges of NGS • Need for large amount of CPU power - Informatics groups must manage compute clusters -Challenges in parallelizing existing software or redesign of algorithms to work in a parallel environment - Another level of software complexity and challenges to interoperability Setia Pramana

  30. Bioinformatics Challenges of NGS • VERY large text files (~10 million lines long) - Can’t do ‘business as usual’ with familiar tools such as Perl/Python. - Impossible memory usage and execution time - Impossible to browse for problems • Need sequence Quality filtering Setia Pramana

  31. Data Management Issues • Raw data are large. How long should be kept? • Processed data are manageable for most people • 20 million reads (50bp) ~1Gb • More of an issue for a facility: HiSeq recommends 32 CPU cores, each with 4GB RAM • Certain studies much more data intensive than other • Whole genome sequencing 30X coverage genome pair (tumor/normal) ~500 GB 50 genome pairs ~ 25 TB Setia Pramana

  32. Bioinformatics Challenges of NGS • In NGS we have to process really big mounts of data, which is not trivial in computing terms. • Big NGS projects require super computinginfrastructures: it's not the case that any one can study everything. Small facilities must carefully choose their projects to be scaled with their computing capabilities Setia Pramana

  33. Computational Infrastructure for NGS We can start with: - Computing cluster: Multiple nodes(servers) with of course multiple cores • High performance storage (TB, PB level) • Fast networks (10Gb ethernet, infiniband) - Enough space and conditions for the equipment ("servers room") - Skilled people (sys admin, developers) Setia Pramana

  34. Big Computing Infrastructure • Distributed memory cluster Starting at 20 computing nodes 60 to240 cores At least 48GB RAM per node • Fast networks 10Gbit Infiniband • Optional MPI and GPUs environment depending on project requirements • Starting at 200.000€ (hardware only) Setia Pramana

  35. Middle size infrastructure • "Small” distributed file system( around 50TB). • "Small” cluster (around 10 nodes, 80 to 120 cores). • At least giga bit ethernet network. • Price range: 50.000 –100.000 € (just hard ware) Setia Pramana

  36. Small Infrastructure • Recommended at least 2 machines • 8 or 12 cores each machine • 48 Gb RAM minimum each machine. • BIG local disk. At least 4 TB each machine As much local disks as we can afford Price range: starting at 8.000€-10.000 € (2 machines) Setia Pramana

  37. Alternatives • Cloud Computing • Grid Computing Setia Pramana

  38. Swedish National Infrastructure for Large Scale DNA sequencing (SNISS) Setia Pramana

  39. UPPNEX • UPPmaxNEXt generation sequence cluster & storage • Located at UPPMAX - Uppsala Multidisciplinary Center for Advanced • Computational Science (UPPMAX) • Dedicated computer cluster (500 nodes) • Uppnex is serving over 240 projects and hosting over 800 TB of data Setia Pramana

  40. Interpretation Bottleneck

  41. Big Collaboration • Need Collaborative expertise (human intelligence and intuition) are required for meaning and interpretation (Bergeron 2002) • Including on-demand communication & sharing of protocols, electronic resources, data, and findings among the stakeholders • Collaboration with other Big DATA sources: National Registers, BPJS, Hospitals, etc.

  42. Next Generation Projects • 1000 Genomes Project (to provide a comprehensive resource on human genetic variation. ) • TCGA (The Cancer Genome Atlas) • MalariaGen: Sequencing thausands malaria isolates • 1001 Genome Project: Arabidopsis WGS • UK10K: Sequencing 10.000 healthy and disease affected individuals. • Southeast Asia Mycobacterium tuberculosis complex (MTBC) DB: Sequencing MTBC Isolates • Many more…..

  43. Collaboration Challenges • Potential conflict between traditional silo researchers and those embracing Big Collaboration • Compatible technologies and Cloud infrastructures • IT management of groups with different tools, requirements and expectations • Ownership of data • Government regulations and policies • Accessible data repositories and lack of transparency in findings • Resources to support bioinformatics • Patient privacy

  44. Five Domains of Genomic Research Green. 2011. Nature470, 204-213

  45. Summary • Unraveling the Bioinformatics (Big) Data would provide right decisions at the right time for the right patients. • The problem is not producing data, but more on how to interpret them • Bioinformatician is one of the hotestjob 

  46. Summary • Challenges: • Still expensive • Lack of Infrastructure (in developing countries) • Lack of skilled personal on Bioinformatics • Need (large scale) collaborations • Integrate different technologies and system • Making it all clinically relevant Setia Pramana

  47. Manipulating RNA-seq Data in R The processed RNAseq datasets come in two formats: • A large dataset that contains all the sequenced RNA reads along with other information regarding each read. • This kind of dataset should be used in case you are interested in analyzing the RNA sequences (matches and differences between RNA sequences of different samples) • A dataset that reflects the expression level of genes according to the amount of sequenced RNA that have been aligned to a genetic regions in a reference genome. • This kind of dataset should be used when you are interested in comparing the gene expression levels between different samples.

  48. Manipulating RNA-seq Data in R: Kind 1 • This kind of dataset is stored in FASTA/FASTQ files. • The difference: • FASTA files: Each RNA read has 2 lines (Info + sequence) • FASTQ files: Each RNA read has 4 lines (Info + sequence + Quality Info)

  49. Sample FASTA File

More Related