1 / 45

THESIS DEFENSE

THESIS DEFENSE. Jonathan Laserson advised by Daphne Koller October 20 th , 2011. Bayesian Assembly of Reads from High Throughput Sequencing. High-Throughput Sequencing. CGGACT. Output. CGGACTATCC ACTATCCACCG ACCCTGTGGGTCC CCGTCAGA ACCCGCCCTA.

manchu
Download Presentation

THESIS DEFENSE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. THESIS DEFENSE Jonathan Laserson advised by Daphne Koller October 20th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing

  2. High-Throughput Sequencing CGGACT Output CGGACTATCC ACTATCCACCG ACCCTGTGGGTCC CCGTCAGA ACCCGCCCTA . . . • Millions of noisy short reads from a sample • A snapshot of the genomic material in the sample • Roche 454 GS-FLX Titanium • Avg read length 380b • reads in a run 500k • Base error rate 0.74% (84% indels) • Cost per base 0.05$/kb genomes [You et al., 2011] reads

  3. Population Sequencing sequences noise mutations • Depth • Phylogenetic trees • Immune-seq • Breadth • Metagenomics • de novo assembly x x x x reads x x x x x x x x x x x x x x

  4. ACTGCCCTGGGAAAAGCCCAGCCCCTCATTTC ACTGCCCTGG ACTGCCCTG GCCCTGGGA CTGGGAAAAG CTGGGAAAAG AAAGCCCAG AGCCCAGCCC AGCCCAGCCCC CCCTCATTTC reads Input CTGGGAAAAG AGCCCAGCCCC CGTAGATCA CTGGAAAAG GCTGACGTAG AAAGCCCAG ACTGACCTG CCCTCATTTC ACTGCCCTGG GACGTAGATCA GCCCTGGTGA CGGCTGACGT AGATCATAGA AGCCCAGCCC CATAGAAACGAT GENOVO reads Output contigs CGGCTGACGTAGATCATAGAAACGAT CGGCTGACGT GCTGACGTAG GACGTAGATCA CGTAGATCA AGATCATAGA CATAGAAACGAT an assembly

  5. Previous Work • Velvet, Euler: • Assume most reads have no error • Assume a single genome • Trash low-abundance reads • Fix reads with one error a-priori • Newbler: No code / publication

  6. Our Approach: Probabilistic Model • Fully Bayesian model for high-throughput sequencing • Jointly performing 3 tasks: • de-noising the reads • de novo sequence assembly • multiple alignment • We do not throw away reads • Can handle unbalanced populations of genomes

  7. Our Approach: Probabilistic Model bso ~ U[A,C,G,T] π α β s ~ CRP(α, N) s=1..∞ ρs ~ Beta(1,1+β) oi ~ G(ρs_i) si oi ρs i=1..N yi ~ read_model(b,si,oi) yi bso o=-∞..+∞ log P(y,b,s,o|ρ) ≈#errors · log(ε) -#bases · log(4) + #seq · log(α) min read errors min bases we want to find argmax P(b,s,o | y) b,s,o

  8. Inference • MCMC • Making moves in a probabilistic space • Stochastic process • escape local optimum • multiple hypotheses

  9. Move 1: Read Mapping 10 18 candidate locations P(si=s,oi=o | y,b,ρ,s-i,o-i) α Geometric Term Stickiness Term Likelihood Term A READ samples a sequence and an offset (inside the seq) oi si

  10. Comparison:How many bases did you BLAST into? horizontal line: total basescovered byall methodstogether vertical line: threshold for a perfect match

  11. Comparison:How many bases did you BLAST into? horizontal line: total basescovered byall methodstogether vertical line: threshold for a perfect match

  12. A Score for an Assembly min bases min read errors α controls minimal overlapbetween adjacent reads #errors · log(ε) -#bases · log(4) + #seq · log(α) =

  13. Contributions • Fully Bayesian model for high-throughput sequencing • An algorithm jointly performing 3 tasks: • de-noising the reads • de novo sequence Assembly • multiple alignment • A score for de novo assembly • Code freely available Laserson J, Jojic V, Koller D, Genovo: de novo Assembly for Metagenomes,Journal of Computational Biology, 2011. RECOMB 2010.

  14. Population Sequencing sequences noise mutations • Depth • Phylogenetic trees • Immune-seq • Breadth • Metagenomics • de novo assembly x x x x reads x x x x x x x x x x x x x x

  15. Immune System • 1010 B-Cells in an individual

  16. Immune System 57 x 27 x 6 • 1010 B-Cells in an individual • Phase 1: generate basic seq • choose V,D,J from catalog [Illustration: Wikipedia] ~24 bases ~52 bases ~297 bases V D J

  17. Immune System • 1010 B-Cells in an individual • Phase 1: generate basic seq • choose V,D,J • stitch V-D and D-J by: • eat up a few letters from ends • insert a few letters [Illustration: Wikipedia] ~360 bases V N1 D N2 J

  18. Immune System • 1010 B-Cells in an individual • Phase 1: generate basic seq • Phase 2: hyper-mutate • If detecting an invader, clone yourself with lots-of-mutations • Even better detection? keep cloning. • Remember the successful sequences for future reference, and keep producing them. • This set of sequences is one clone. [Illustration: Janeway, 2001] V N1 D N2 J

  19. 6 1 V1 V4 V7 J J J D D D x x 6 2 x x 1 1 x x x x x x x x x x x x x x x Input reads igForest germline Output clones sequencing noise mutations

  20. ReperTree De Novo Annotation forB-Cells • Phase 1: Divide reads into clones • 1M reads! • In a healthy individual almost every readis from a different clone. • Phase 2: Construct a mutation tree for each clone. • How to tell a mutation from a sequencing error?

  21. Overview J J J J J J J J J J J J J J J J J J D D D D D D D D D D D D D D D D D D • 10 (healthy) Organ donors • 4 samples from each individual • Blood, spleen, two lymph nodes • 380k reads total igForest • 8,000 clones • 165k other B-cells variants (not clones)

  22. A Repertoire Vs (57) Js (6)

  23. Variants (173k) Vs (57) patients patients Js (6) spleen lymph node 1 lymph node 2 blood

  24. Repertoire Factorization RMSE = 1.8e-3 X(v,j,donor, source) = a(v,j,source) xb(v,j,donor) *1e-3

  25. Properties Of Variants J J J J J J J J J J D D D D D D D D D D distance to germline fraction of variants length of variable region length of N1 length of N2

  26. blood spleen Mesenteric Mediastinal J J J D D D Clones (8,000) shared 2 sources shared 3 sources pure clones patients Evidence of clone migration How do the trees look like?

  27. Current Phylogenetic Trees igForest traditional phylo trees • X*100 sequences • Weeks to run • No sequencing/PCR errors • Single solution • Binary tree • Data at the leaves • Continuous mutation model • X*10,000 √ • Hours √ • Noise model √ • Multiple/statistics √ • √ • √ • Discrete √

  28. Previous Work on B Cells • ihmmune-align [Gaëta et al. 2007], SoDA2[Munshaw and Kepler 2010] • examine each read independently, no tree • igTree [Barak et al. 2008] • heuristic search, no high-throughput support • Our Approach: • Joint probabilistic model for hypermutation and sequencing noise igForest

  29. Generate Tree (Birth/death process) • Start with one live cell at t=0 • Give birth at rate λ • Die at rate δ • Stop when t= “snapshot time”. t=0 time t=snapshot root

  30. Generate Sequences • Start with the inferred germline at the root • Generate each child sequencefrom the sequence of its parent • Using a Mutation Model • Continue so, recursively. t=0 time t=snapshot x root x x x x x x x x x x x x x x

  31. Generate Reads • Total N reads. • Assign reads to live cells (uniformly). • Copy read sequence from cell, based on read model. live cells t=0 t=snapshot reads

  32. Inference • MCMC • Moves to manipulate tree and read assignments • Generating samples of trees along the way

  33. Output Tree t=0 t=snapshot 2 1 1 0 1 2 mutation distance 1

  34. Output Tree 2 1 1 8 6 2 1 6 2 t=0 t=snapshot 2 1 1 0 1 2 1

  35. Sample Trees blood spleen Mesenteric Mediastinal mutations compared to germline reads

  36. Sample Trees blood spleen Mesenteric Mediastinal

  37. Properties Of Trees max mutational height fraction of clones median mutational height median mutational distance (edge length) max mutational distance (edge length)

  38. Mutation Analysis – V region fraction of mutations mutations binned by their aligned codon position variants CDR3 FR3 FR2 CDR2 J J J J J J J J D D D D D D D D clones clones FR3 FR3 CDR3 CDR3 FR2 FR2 CDR2 CDR2 Clones root down the tree variants fraction of mutations Clones germline to root

  39. Mutation Analysis – V region fraction of mutations mutations binned by their aligned codon position variants CDR3 FR3 FR2 CDR2 J J J J J J J J D D D D D D D D clones clones FR3 FR3 CDR3 CDR3 FR2 FR2 CDR2 CDR2 Clones root down the tree variants fraction of mutations Clones germline to root

  40. Contributions • Fully Bayesian phylogenetic analysis • O(10k) reads • Handles sequencing noise • Big Picture view of the B cell repertoire • Intuitive clone description • Insight on the Immune System • Evidence of clone migration • Differences between tissues • Deep hypermutation analysis • Code (will be) freely available

  41. Thank you! sequences reads noise mutations • Depth • Breadth x x x x x x x x x x x x x x x x x x

  42. Acknowledgements Daphne Koller

  43. Acknowledgements Funding • National Science Foundation • Stanford Graduate Fellowship • Vladimir Jojic • Chair: Richard Olshen • Scott Boyd • Andrew Fire • Katherine Jackson • Elyse Hope

  44. DAGS team • Ben Packer • Karen Sachs • Geremy Heitz • Serafim Batzoglou and his group • Orly Fuhrman • Leor Vartikovski • My Parents

More Related