1 / 27

Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells

Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells. Zemin Ning The Wellcome Trust Sanger Institute. Outline of the Talk:. Project Background Why De novo Assembly The New Phusion Pipeline Kmer Words Hashing Relational Matrix 454 Reads and Assembly

khoi
Download Presentation

Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute

  2. Outline of the Talk: • Project Background • Why De novo Assembly • The New Phusion Pipeline • Kmer Words Hashing • Relational Matrix • 454 Reads and Assembly • Cancer Genome Assemblies from Solexa Reads • Variations between Cell Samples

  3. ICGC- International Cancer Genome Consortium

  4. Large-Scale Studies of Cancer Genomes • Johns Hopkins > 18,000 genes analyzed for mutations 11 breast and 11 colon tumors L.D. Wood et al, Science, Oct. 2007 • Wellcome Trust Sanger Institute 518 genes analyzed for mutations 210 tumors of various types C. Greenman et al, Nature, Mar. 2007 • TCGA (NIH) Multiple technologies brain (glioblastoma multiforme), lung (squamous carcinoma), and ovarian (serous cystadenocarcinoma). F.S. Collins & A.D. Barker, Sci. Am, Mar. 2007

  5. Melanoma cell line COLO-829 Paul Edwards, Departments of Pathology and Oncology, University of Cambridge

  6. Melanoma-Skin Cancer Disease

  7. Sequencing COLO-829 on Illumina: Strategy Tumour or normal genomic DNA Fragments of defined size 0.2, 2, 3, 4 kb Sequencing 75 bp reads short insert 50 bp reads long insert Sequencing performed at Illumina Alignment using bwa, ssaha2 De novo Assembly Somatic mutations Germline variants

  8. Read Coverage 40X tumour 32X normal COLO-829

  9. Why De novo Assemblies • Reference is not complete • There are hundreds of contigs in the current form of human genome reference and the sequence representation is only ~90%; • Reference is mosaic • The DNA samples of the current reference were from 8 individuals, although there is a dominant individual, representing > 80%; • Limitations of alignment against reference • Using read alignment, it can reliably call SNPs and short indels, where the indel length is dependent up the read length. But it is very hard to find structural variants, particularly long novel insertion elements; • Genomes without references • Loss of one haplotype in a diploid sample

  10. De Bruijnvs Read overlap

  11. New Phusion Assembler Assembly Data Process Solexa Reads Supercontig Long Insert Reads PRono Contigs Reads Group Fuzzypath 2x75 or 2x100 Velvet Phrap

  12. Repetitive Contig and Read Pairs Depth Depth Depth Grouped Reads by Phusion

  13. ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGT TGGCGTGCAGTC GGCGTGCAGTCC GCGTGCAGTCCA CGTGCAGTCCAT ATGGCGTGCAGTCCATGTTCGGATCA ATGGGCAGATGT TGGCCAGTTGTT GGCGAGTCGTTC GCGTGTCCTTCG Kmer Word Hashing Contiguous Base Hash K = 12 Gap-Hash 4x3

  14. Useful Region Real Data Curve Poisson Curve Word use distribution for the mouse sequence data at ~7.5 fold

  15. Sorted List of Each k-Mer and Its Read Indices ACAGAAAAGC 10h06.p1c High bits Low bits ACAGAAAAGC 12a04.q1c ACAGAAAAGC 13d01.p1c ACAGAAAAGC 16d01.p1c ACAGAAAAGC 26g04.p1c ACAGAAAAGC 33h02.q1c ACAGAAAAGC 37g12.p1c ACAGAAAAGC 40d06.p1c ACAGAAAAGG 16a02.p1c ACAGAAAAGG 20a10.p1c ACAGAAAAGG 22a03.p1c ACAGAAAAGG 26e12.q1c ACAGAAAAGG 30e12.q1c ACAGAAAAGG 47a01.p1c 64 -2k 2k

  16. Relation Matrix: R(i,j) – number of kmer words shared between read i and read j 1 2 3 4 5 6 … j … N 41 0 0 0 0 1 2 41 37 0 0 0 3 0 37 0 22 0 4 0 0 0 0 27 Group 2: (4,6) 5 0 0 22 0 0 6 0 0 0 27 0 i R(i,j) Group 1: (1,2,3,5) N

  17. Relation Matrix: R(i,j) – Implementation 1 2 3 4 5 6 … j … 500 1 2 3 4 Number of shared kmer words (< 63) 5 . . . Read index R(i,j) N

  18. Stats of 454 Reads – NA12878 Number of reads: 160.86 m; Total number of bases: 35.9 GbReference genome size: 3.0 Gb; Sequencing platform: FLX&Titanium Read length: 50-500 bp; Average read length: 224 bp; Estimated read coverage: ~10X;Number of reads uniquely placed: 152.81 m; Ratio of uniquely placed reads: 95.0%; Vector sources: Unknow

  19. Stats of The Assembly Contigs: Total assembled bases: 2.78 GbNumber of contigs: 526,437; Average contig length: 5,280 Contig N50: 11,000; Largest contig: 85,538; Supercontigs: Total assembled bases: 3.17 GbNumber of contigs: 54,487 Gb; Average contig length: 58,263 Contig N50: 1,122,317; Largest contig: 8,015,559;

  20. Paired Reads Separated by “NN”

  21. Error Bases Correction

  22. Genome Assembly – Normal Cell Solexa reads: Number of reads: 557 Million;Finished genome size: 3.0 GB; Read length: 2x75bp; Estimated read coverage: ~25X; Insert size: 190/50-300 bp; Number of reads clustered: 458 Million Assembly features: - contig statsTotal number of contigs: 1,020,346; Total bases of contigs: 2.713 Gb N50 contig size: 8,344; Largest contig: 107,613 Averaged contig size: 2,659; Contig coverage over the genome: ~90 %; Mis-assembly errors: ?

  23. Genome Assembly – Tumour Cell Solexa reads: Number of reads: 562 Million;Finished genome size: 3.0 GB; Read length: 2x75bp; Estimated read coverage: ~25X; Insert size: 190/50-300 bp; Number of reads clustered: 449 Million Assembly features: - contig statsTotal number of contigs: 1,249,719; Total bases of contigs: 2.690 Gb N50 contig size: 6,073; Largest contig: 72,123 Averaged contig size: 2,152; Contig coverage over the genome: ~90 %; Mis-assembly errors: ?

  24. Deletions– Normal Cell Alus : ~300bp LINEs : ~6000bp

  25. Deletions– Tumour Cell Alus : ~300bp LINEs : ~6000bp

  26. Tumour Specific Indels Number of Deletions: 18,449Number of Insertions: 15,899 The numbers seem to be more than what should be expected: 3000-4000 deletion/insertion; Experimental validation: ?

  27. Acknowledgements: • Jim Mullikin • Yong Gu • Tony Cox • Elizabeth Murchuson • Erin Preasance • Mike Stratton

More Related