1 / 64

Bioinformatics: Variant Identification, Focusing on SVs

Bioinformatics: Variant Identification, Focusing on SVs. Mark Gerstein, Yale University gersteinlab.org /courses/452 (last edit in spring ’17). Main Steps in Genome Resequencing. [Snyder et al. Genes & Dev. ('10)]. Main Steps in Genome Resequencing. [Snyder et al. Genes & Dev. ('10)].

snow
Download Presentation

Bioinformatics: Variant Identification, Focusing on SVs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics:Variant Identification, Focusing on SVs Mark Gerstein, Yale Universitygersteinlab.org/courses/452 (last edit in spring ’17)

  2. Main Steps in Genome Resequencing [Snyder et al. Genes & Dev. ('10)]

  3. Main Steps in Genome Resequencing [Snyder et al. Genes & Dev. ('10)]

  4. Bayes’ Theorem to detect genomic variant • In the above equation: • refers to the observed data • is the genotype whose probability is being calculated • refers to the th possible genotype, out of n possibilities

  5. Calculating the conditional distribution Assuming an error free model, for each heterozygous SNP site of the diploid genome, covered by K reads, the number of reads representing one of the two alleles follows binomial distribution. With errors, the calculation is more complicated. In general:

  6. Main Steps in Genome Resequencing [Snyder et al. Genes & Dev. ('10)]

  7. 1. Paired ends Methods toFind SVs Deletion Reference Mapping Genome Reference Sequenced paired-ends 2. Split read 3. Read depth (or aCGH) Deletion Deletion Reference Reference Genome Genome Read Reads Mapping Mapping Read count Reference Zero level 4. Local Reassembly [Snyder et al. Genes & Dev. ('10)]

  8. Read Depth

  9. Patient 98-135 Patient 99-199 Patient 97-237 LCR A B C D Array Signal Read depth Individual genome Reads Mapping Reference genome [Urban et al. ('06) PNAS; Wang et al. Gen. Res. ('09); Abyzov et al. Gen. Res. (’11)] Counting mapped reads Read depth signal Zero level

  10. Reads to Signal Track Reads (fasta) + quality scores (fastq) + mapping (BAM) Reads => Signal (Intermediate file) Accumulating @ >1 Pbp/yr (currently), ~20% of tot. HiSeq output [PLOS CB 4:e1000158]

  11. Example of Application to RD data [Abyzov et al. Gen. Res. (’11)] NA12878, Solexa 36 bp paired reads, ~30x coverage

  12. 0.5 0 Fluorescence log2 ratio -0.5 ACGTGACAC AT AAGCACACCA A TTGCTTGAGGGACCT T AGGCACAGT T AAC A TG AT AAGCACACCA A TTGCTTGAGGTGAC DNA NO T T O SCALE sequence HMM • To get highest resolution on breakpoints need to smooth & segment the signal • BreakPtr: prediction of breakpoints, dosage and cross-hybridization using a system based on Hidden Markov Models Korbel*, Urban* et al., PNAS (2007)

  13. alues v y a r r A alues S equen c e v y a Transition B Transition A r r A S equen c e Duplication Normal Deletion Transition A’ Transition B’ Statistically integrates array signal and DNA sequence signatures (using a discrete-valued bivariate HMM) Korbel*, Urban* et al., PNAS (2007)

  14. CNVnator Mean-shift-based (MSB) segmentation: no explicit model • For each bin attraction (mean-shift) vector points in the direction of bins with most similar RD signal • No prior assumptions about number, sizes, haplotype, frequency and density of CNV regions • Not Model-based (e.g. like HMM) with global optimization, distr. assumption & parms. (e.g. num. of segments). • Achieves discontinuity-preserving smoothing • Derived from image-processing applications [Abyzov et al. Gen. Res. (’11)]

  15. Observed depth of coverage counts as samples from PDF Kernel-based approach to estimate local gradient of PDF Iteratively follow grad to determine local modes Region of interest Intuitive Description of MSB Center of mass [ Adapted from S Ullman et al. "Advanced Topics in Computer Vision," www.wisdom.weizmann.ac.il/~vision/courses/2004_2 ] Mean Shift vector Objective : Find the densest region Distribution of identical billiard balls

  16. Split Read

  17. Read-depth works well on a variety of sequencing platforms but provides imprecise breakpoints [Abyzov et al. Gen. Res. (’11)] [NA18505]

  18. Split-read Analysis Breakpoint Breakpoint Reference Deletion Read Target Genome Breakpoint Reference Read Target Genome Insertion

  19. Deletions are the Easiest to Identify [Zhang et al. ('11) BMC Genomics]

  20. Creative application of dynamic programming to a new problem • Problem: Map insertions and deletions to a reference genome: • Solution: SW alignment from both ends; combine max scoring alignments AGEAlignment with Gap Excision [Abyzov et al. Bioinfo. (’11)] • much more detail in SV section later

  21. Difficulties in Defining Exact Breakpoints [Abyzov & Gerstein (’11) Bioinfo.]

  22. Paired-End

  23. Paired-End Mapping Breakpoint Breakpoint Breakpoint Breakpoint Breakpoint Reference Inversion Deletion Target Insertion Inversion Paired-End Sequencing and Mapping Reference Span >> expected Altered end orientation Span << expected • Both paired-ends map within repeats. • Limited the distance between pairs; therefore, neither large nor very small rearrangements can be detected

  24. High-Resolution Paired-End Mapping (HR-PEM) Bio Bio Bio Bio • Shear to 3 kb • Adaptor ligation • Circularize Bio Bio Genomic DNA Fragments Select Random Cleavage 200-300bp 454 Massively Parallel Sequencing (250bp/reads, 400k reads/run) Map paired ends to human reference genome Korbel et al., 2007 Science 3kb

  25. Overall Strategy for Analysis of NextGen Seq. Data to Detect Structural Variants [Korbel et al., Science ('07); Korbel et al., GenomeBiol. (‘09)]

  26. Pseudogenes & Genomic Duplications

  27. Pseudogenes are among the most interesting intergenic elements • Formal Properties of Pseudogenes (G) • Inheritable • Homologous to a functioning element – ergo a repeat! • Non-functional • No selection pressure so free to accumulate mutations • Frameshifts & stops • Small Indels • Inserted repeats (LINE/Alu) • What does this mean? no transcription, no translation?… [Mighell et al. FEBS Letts, 2000]

  28. Identifiable Features of a Pseudogene (yRPL21) [Gerstein & Zheng. Sci Am 295: 48 (2006).]

  29. Two Major Genomic Remodeling Processes Give Rise to Distinct Types of Pseudogenes [Gerstein & Zheng. Sci Am 295: 48 (2006).]

  30. Impact of Genetic Variability: Loss-of-function Gene Polymorphic Pseudogene • Previous LoFs are considered as having high probability of being deleterious • Surprisingly, ~ 100 LoF variants per genome, 20 genes are completely inactivated • Among ~100 LoFs, we estimate 2 recessive, close to 0 dominant disease nonsense variants per healthy genome. • - Truncating nonsense SNPs • - Splice-disrupting SNPs • - Frameshift-causing indels • - Disrupting structural variants

  31. Genomic Variation Alu Gene Ancestral State Gene Alu Gene Alu The Genome Remodeling Process

  32. Genomic Variation Gene Dup. Gene Alu Gene Ancestral State Non-allelic homologous recombination (NAHR) Gene Alu Gene Alu The Genome Remodeling Process Segmental Duplication (SD)

  33. Genomic Variation Gene Dup. Gene Dup. Gene Gene Gene Gene Dup. Gene Alu Gene Ancestral State Non-allelic homologous recombination (NAHR) Gene Alu Gene Alu The Genome Remodeling Process Segmental Duplication (SD) Syntenic Ortholog SD duplicate Paralog Dup. Gene family Dup. ψgene

  34. Genomic Variation Gene Dup. Gene Dup. Gene Gene Gene Gene Dup. Gene Alu Gene Ancestral State Non-allelic homologous recombination (NAHR) Gene Alu Gene Alu The Genome Remodeling Process Segmental Duplication (SD) Syntenic Ortholog SD duplicate Paralog Dup. Gene family Pssd. ψgene Dup. ψgene Retro-transpose

  35. GenomicVariation Gene Dup. Gene Dup. Gene Gene Gene Gene Dup. Gene Alu Gene Ancestral State Non-allelic homologous recombination (NAHR) Gene Alu Gene Alu The Genome Remodeling Process Segmental Duplication (SD) Syntenic Ortholog SD Insertion Insertion VNTR Pssd. ψgene duplicate Deletion Paralog Insertion VNTR L1 Dup. Gene Deletion Retro-elements Inversion family VNTR Pssd. ψgene Dup. ψgene Retro-transpose

  36. Genomic Variation Gene Dup. Gene Dup. Gene Gene Gene Gene Dup. Gene Alu Gene Ancestral State Non-allelic homologous recombination (NAHR) Gene Alu Gene Alu The Genome Remodeling Process Segmental Duplication (SD) Syntenic Ortholog CNV (type of SV) Insertion Insertion VNTR Pssd. ψgene duplicate Deletion Insertion VNTR L1 Dup. Gene Deletion Retro-elements Inversion VNTR Pssd. ψgene Dup. ψgene Retro-transpose "Polymorphic" Genes & Pseudogenes

  37. RDV & Mobile Elements

  38. Retroduplication variation (RDV) Retroduplications (pot. retro-pseudogenes or retro-genes) mRNA Gene … Reference … Person 1 … Person 2 Known retroduplication … Person 3 . . . … Reference … Person 1 … Person 2 Novel retroduplication … Person 3 . . . mRNA [Abyzov et al. Gen. Res. ('13) ]

  39. Novel retroduplication Gene Read pairs 1 2 3 4 … Alignment to the reference Reference … … 3 1 Unaligned reads Aligned reads Evidence from alignment Splice-junction library 2 … Evidence from cluster 4 1 [Abyzov et al. Gen. Res. ('13) ] 3 … Evidence from read depth 2 3 4 1 Pipeline to identify novel retro-dups. from 3 evidence sources Zero level

  40. Frequency of novel retroduplications by populations. Abyzov A et al. Genome Res. 2013;23:2042-2052

  41. Can Alkan, Bradley P. Coe & Evan E. Eichler Nature Reviews Genetics 12, 363-376 (May 2011)

  42. 1000G summary

  43. 1000G SV (Pilot, Phase I & III) • Many different callers compared & used • including SRiC & CNVnator but also VariationHunter, Cortex, NovelSeq, PEMer, BreakDancer, Mosaik, Pindel, GenomeSTRiP, mrFast…. • Merging • Genotyping (GenomeSTRiP) • Breakpoint assembly (AGE & Tigra_SV) • Mechanism Classification [1000 Genomes Consortium, Nature (2010, 2012); Mills et al., Nature (2011)]

  44. SummaryStats of 1000GP SV Phase3 •68,818SVs •2,504 unrelated individuals •26 populaSons •37,250 SVs with resolved breakpoints [2] 1000GP Phase3 SV paper. Submided to Nature, 2015. [3] 1000GP ConsorSum. Submided to Nature, 2015. 8

  45. Phase 3: MedianAutosomal Variant Sites PerGenome 33 [3] 1000GP Consortium. Submitted to Nature, 2015.

  46. A Typical Genome •A typical genome differs from thereferencegenomeat4.09 –5.02million sites. •The typical genomecontains2,100 – 2,500 SVs,covering~20million bases. •A typical genomecontains149 – 182 siteswith protein truncatingvariants,10 – 12 thousand siteswith peptidesequence alteringvariants, and459 – 565 thousand variantsitesoverlappingregulatoryregions. 5 [3] 1000GP ConsorSum. Submided to Nature, 2015.

  47. Human Genetic Variation Population of 2,504 peoples A Typical Genome A Cancer Genome Class of Variants Origin of Variants Prevalence of Variants Common Common Passenger Rare (~75%) Rare* (1-4%) Driver (~0.1%) * Variants with allele frequency < 0.5% are considered as rare variants in 1000 genomes project. The 1000 Genomes Project Consortium, Nature. 2015. 526:68-74 Khurana E. et al. Nat. Rev. Genet. 2016. 17:93-108

  48. Association of Variants with Diseases Common Variants Healthy Rare or Somatic Variants Pooled Variants High Function Impact Diseased Burden Test GWAS Positive

  49. Structural Variations (SVs) •SVs make up themajorityof varyingnucleotides among humans. •More base pairs are altered as a resultof SVs, than of single-nucleotidevariations. –On the haploid reference assembly, a mediumof 8.9Mbp are affected by SVs,while3.6Mbp affected by SNPs. [1]Weischenfeldt J,et al. NatRevGenet,2013. [2] 1000GP Phase3 SV paper. Submided to Nature, 2015. 6

  50. Distribution of Different SVsin Normal Human Populations Total ~70K SVs from over 2,500 normal individuals (the 1000 Genomes Project)

More Related