1 / 53

SNP Resources: Finding SNPs Discovery and Databases

SNP Resources: Finding SNPs Discovery and Databases Mark J. Rieder, PhD SeattleSNPs Workshop March 20-21, 2006 SNP Resources: SNP discovery and cataloging SNP discovery/genotyping: Genome-wide approaches The current state of SNP resources Comprehensive SNP discovery

jacob
Download Presentation

SNP Resources: Finding SNPs Discovery and Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SNP Resources: Finding SNPs Discovery and Databases Mark J. Rieder, PhD SeattleSNPs Workshop March 20-21, 2006

  2. SNP Resources: SNP discovery and cataloging • SNP discovery/genotyping: Genome-wide approaches • The current state of SNP resources • Comprehensive SNP discovery Seattle SNPs - Program for Genomic Application SNP Databases - “How to” Manual for finding SNPs In class - Tutorial

  3. Genetic Markers: Overview • RFLPs (SNPs circa 1980) - sequence variants that lead to a change in restriction site detected by Southern Blot Analysis 2. Microsatellites (di-, tri-, tetranucleotide repeats) • 1/50,000 bp • Linkage Studies - 300-600 markers (~1 Mbp) • Multi-allelic/High heterozygosity/informative • Complex genotyping assays 3. Single Nucleotide Polymorphisms (SNPs) • Most frequent genetic variant (base substitutions) • 1/1000 bp (comparing two randomly selected chromosomes) • Biallelic/less informative • Simplified genotyping platforms (+/- calling)

  4. Development of a genome-wide SNP map: How many SNPs? Nickerson and Kruglyak, Nature Genetics, 2001 ~ 10 million common SNPs (> 1- 5% MAF) - 1/300 bp How has SNP discovery progressed toward this goal?

  5. Finding SNPs: Marker Discovery and Methods SNP discovery has proceeded in two distinct phases: 1 - SNP Identification Define the alleles Map this to a unique place in the genome 2 - SNP Characterization Determination of the genotype in many individuals Population frequency of SNPs

  6. Finding SNPs: Marker Discovery and Methods SNP Discovery has proceeded in two distinct phases: 1 - SNP Discovery** Human Genome Project - overlaps of BAC clones and ESTs The SNP Consortium - Reduced Representation Sequencing The HapMap 2 - SNP Discovery/Characterization** The HapMap

  7. Genomic mRNA BAC Library RRS Library cDNA Library BAC Overlap Shotgun Overlap EST Overlap Finding SNPs: Sequence-based SNP Mining How do you find SNPs if you don’t have a reference sequence RT errors? DNA SEQUENCING Sequencing Quality Sequence Overlap - SNP Discovery GTTACGCCAATACAGGATCCAGGAGATTACC GTTACGCCAATACAGCATCCAGGAGATTACC

  8. Finding SNPs: Sequence-based SNP Mining BAC = Bacterial Artificial Chromosome Primary vector for DNA cloning in the HGP DNA from multiple individuals Clone large fragments into BACs (unknown sequence) Fragment DNA Sequence and Reassemble (known sequence) Assembly with other overlapping BACs GTTACGCCAATACAGGATCCAGGAGATTACC GTTACGCCAATACAGCATCCAGGAGATTACC

  9. Finding SNPs: Marker Discovery and Methods $ 45 Million - 2 years (1999, 2001 - 2003) Goals: Identify 300,000 SNPs and map 150,000 (April 1999) Determine allele frequency of SNPs If you don’t have a reference genome - how do you find SNPs?

  10. Finding SNPs: Sequence-based SNP Mining RRS = Reduced Representation Sequencing Genomic DNA (multiple individuals) RE to generate fragments Clone DNA fragments into plasmid vectors Altshuler, et al. Nature (2000) Sequence and align and cluster GTTACGCCAATACAGGATCCAGGAGATTACC GTTACGCCAATACAGCATCCAGGAGATTACC From overlap identify mismatches = SNPs

  11. TSC and HGP: High Resolution SNP Map Feb. 2001 - Human Genome Project and TSC

  12. Development of a genome-wide SNP map: How many SNPs? Nickerson and Kruglyak, Nature Genetics, 2001 ~ 10 million common SNPs (> 1 - 5% MAF) - 1/300 bp Feb 2001 - 1.42 million (1/1900 bp)

  13. SNP Discovery: dbSNP database dbSNP -NCBI SNP database

  14. Unique mapping to a genome location (reference SNP = rs#) SNPs submitted By research communtiy (submitted SNPs = ss#) (by 2hit-2allele) SNP data submitted to dbSNP: Clustering dbSNP processing of SNPs

  15. Finding SNPs: Marker Discovery and Methods SNP Discovery has proceeded in two distinct phases: 1 - SNP Discovery** Human Genome Project - overlaps of BAC clones and ESTs The SNP Consortium - Reduced Representation Sequencing The HapMap 2 - SNP Discovery/Characterization** The HapMap

  16. HapMap Project Proposed: Map more SNPs and genotype • Increase SNP density over the first 6 - 12 months • Ultimately produce a fine scale genetic map (HapMap) which • would serve as a common resource for all biomedical reseseachers • Genotype 600,000 SNPs genome-wide • Four populations: CEPH (Europe), Yoruban (Africa), Japanese/ • Chinese (Asian)

  17. Nov 2003 - 5.7 million (2 million validated) - 1/1500 bp Feb 2004 - 7.2 million (3.3 million validated) - 1/900 bp Genomic DNA (multiple individuals) Random Shotgun Sequencing Sequence and align (reference sequence) Draft Human Genome GTTACGCCAATACAGGATCCAGGAGATTACC HapMap SNP Discovery: Prior to Genotyping Initiation of project planning (July 2001): 2.8 million SNPs (1.4 million validated) - 1/1900 bp Generate more SNPs: Other Sources of SNPs: Perlegen (Affymetrix chips) SNP data (chr22) Sequence chromatograms from Celera project TACGCCTATA TCAAGGAGAT

  18. HapMap SNP Discovery HapMap Discovery Increased SNP Density and Validated SNPs 10 million rs SNPs 5 million validated rs SNPs

  19. Development of a genome-wide SNP map: How many SNPs? Nickerson and Kruglyak, Nature Genetics, 2001 ~ 10 million common SNPs (> 1- 5% MAF) - 1/300 bp Feb 2001 - 1.42 million (1/1900 bp) Nov 2003 - 2.0 million (1/1500 bp) Feb 2004- 3.3 million (1/900 bp) Mar 2005 - 5.0 million (validated - 1/600 bp) When will we have them all?

  20. mRNA cDNA Library BAC Library EST Overlap BAC Overlap Finding SNPs: Sequence-based SNP Mining Genomic RRS Library Random Shotgun DNA SEQUENCING Shotgun Overlap Align to Reference RANDOM Sequence Overlap - SNP Discovery GTTACGCCAATACAGGATCCAGGAGATTACC GTTACGCCAATACAGCATCCAGGAGATTACC

  21. 1.0 8 Fraction of SNPs Discovered 0.5 2 0.0 0.0 0.1 0.2 0.3 0.4 0.5 Minor Allele Frequency (MAF) SNP discovery is dependent on your sample population size { GTTACGCCAATACAGGATCCAGGAGATTACC GTTACGCCAATACAGCATCCAGGAGATTACC 2 chromosomes 8

  22. SNP Characterization/Genotyping Nickerson and Kruglyak, Nature Genetics, 2001 ~ 10 million common SNPs (>1- 5% MAF) - 1/300 bp Mar 2005 - 5.0 million (validated/mapped - 1/600 bp) 5.0/10.0 = 50% of all common SNPs (validated)!

  23. HapMap Project Proposed: Map more SNPs and genotype • Genotype 600,000 SNPs genome-wide • Four populations: • CEPH (CEU) (Europe - n = 90, trios) • Yoruban (YRI) (Africa - n = 90, trios) • Japanese (JPT) (Asian - n = 45) • Chinese (HCB) (Asian - n =45)

  24. Finding SNPs: Genotype Data Adds Value to SNPs HapMap Genotyping • Confirms SNP as “real” and “informative” • Minor Allele Frequency (MAF) - common or rare • MAF in different populations • Detection of SNP x SNP correlations (Linkage Disequilibrium) • Determine haplotypes

  25. Few SNPs in dbSNPs had Genotype Data

  26. Perlegen Large-scale Genotyping Capacity 1.58 millions SNPs genotyped 71 individuals from 3 American populations European, African and Asian ancestry

  27. HapMap Completion Nature - Oct 27 (2005) HapMap + Perlegen

  28. dbSNP: Increasing numbers of SNPs now have genotype data HapMap Phase II Perlegen Perlegen Data

  29. Current State of dbSNP Many SNPs left to validate and characterize.

  30. 15,367 dbSNP 16,248 New SNPs 50% of SNPs in dbSNP 5 Mb/31,500 SNPs = 1/160 bp Increasing SNP Density: HapMap ENCODE Project ENCODE = ENCyclopedia Of DNA Elements Catalog all functional elements in 1% of the genome (30 Mb) 10 Regions x 500 kb/region (Pilot Project) David Altschuler (Broad), Richard Gibbs (Baylor) 16 CEU, 16 YRI, 8 HCB, 8 JPT Comprehensive PCR based resequencing across these regions

  31. Development of a genome-wide SNP map: How many SNPs? Nickerson and Kruglyak, Nature Genetics, 2001 ~ 10 million common SNPs (>1- 5% MAF) - 1/300 bp Mar 2005 - 5.0 million (validated - 1/600 bp) ~4.0 million validated SNPs with genotypes! (HapMap confirmed, allele frequency/population, SNPxSNP correlations (LD), haplotypes)

  32. 1.0 96 48 24 16 8 8 Fraction of SNPs Discovered 0.5 2 0.0 0.0 0.1 0.2 0.3 0.4 0.5 Minor Allele Frequency (MAF) SNP discovery is dependent on your sample population size { GTTACGCCAATACAGGATCCAGGAGATTACC GTTACGCCAATACAGCATCCAGGAGATTACC 2 chromosomes

  33. Goal: Comprehensively identify all common sequence variation in candidate genes Initial biological focus: Inflammation, Coagulation, Complement Approach: Direct resequencing of genes Sample:p1 = 24 African-Americans, 23 CEPH parents, 1 chimp p2 = 24 HapMap-YRI, 23 HapMap-CEU, 1 gorilla Status: 271 genes (21 kb ave)

  34. Targeted SNP Discovery Directed analysis: cSNPs 5’ 3’ Val-Val Arg-Cys PCR amplicons Complete analysis: cSNP and Haplotype Structure Analysis 5’ 3’ Arg-Cys Val-Val PCR amplicons • Generate SNP data from complete genomic resequencing • (i.e. 5’ regulatory, exon, intron, 3’ regulatory sequence)

  35. Comprehensive SNP Discovery: Resequencing • Overlapping PCR Amplicons across entire gene • Make no assumptions about sequence function • Sequence diversity and genetic structure for each gene is different • Proper association studies can only be designed in this context • Complete resequencing facilitates population genetics methods

  36. Sequence-based SNP Identification Sequence Amplify DNA Phred Phrap Base-calling Contig assembly Sequence each end 5’ 3’ Quality determination Final quality determination of the fragment. PolyPhred Polymorphism detection Sequencing Capacity: ABI 3730 – 96 capillary ~2000 individual chromatograms/day ~1,00,000 bp/day ~20 kbp sequence scanned in 48 individuals/wk Consed Sequence viewing Polymorphism tagging Analysis Polymorphism reporting Individual genotyping Phylogenetic analysis

  37. Homozygous C/C Heterozygous C/T Homozygous T/T

  38. Ford

  39. Aston-Martin of SNP Detection - PolyPhred 5.0 * Matthew Stephens Peggy Dyer-Robertson Jim Sloan C/C C/T C/T T/T

  40. Comparison PolyPhred v4.29 versus v5.0 PolyPhred v4.29 PolyPhred v5.0

  41. PolyPhred 5 Scores - Provide Quantitative Assessment of SNP Genotype PolyPhred 5.0 Perlegen Double-Coverage - Automation = 93% of all SNPs, 100% of high-frequency SNPs, with no false positive SNPs identified, and 99.9% genotyping accuracy.

  42. Comparison of PolyPhred v5.0 with others Mutation Surveyor PolyPhred v4.29 novoSNP PolyPhred v5.0

  43. SeattleSNPs Candidate Gene Resource

  44. 60% SNP Discovery: dbSNP database NIEHS SNPs dbSNP (Perlegen/HapMap) { { 15% 25% Minor Allele Freq. (MAF) Rarer and population specific SNPs are found by resequencing

  45. Development of a genome-wide SNP map: How many SNPs? Nickerson and Kruglyak, Nature Genetics, 2001 ~ 10 million common SNPs (>1- 5% MAF) - 1/300 bp HapMap ENCODE = 1/160 bp (n = 48, 4 pops) SeattleSNPs = 1/180 bp (n = 48, 2 pops) NIEHS SNPs = 1/180 bp (n = 95, 4 pops) Comprehensive resequencing can identify the vast majority of SNPs in a region

  46. Study Designs Utilizing Resequencing Approaches Detection of SNPs in an unstudied population (or not genotyped) Population specific rare SNPs Detection of SNPs directly in study samples (no discovery panel) Sequencing for Association Detection of SNPs in samples representing the tails of the phenotypic distribution How much do rare, possibly more penetrant SNPs, explain of extreme phenotypes

  47. p1 = 47 individuals - 23 CEPH and 24 African-American (Non-HapMap, mostly) Selection of informative (high frequency, coding, etc) SNPs to be genotyped in defined populations (~4500 SNPs) SeattleSNPs Characterization HapMap Populations European (CEU,n=60) African (YRI, n =60) Asian (HCB, n = 45 and JPT, n = 45) Non-HapMap Populations Hispanic (n = 60) African-American (n = 62)

  48. Illumina SeattleSNPs Genotyping • Each well samples 1536 SNPs in one individual • For each HapMap sample 3 x 1536 ( 4600 genotyped SNPs) • 1,764,472 genotypes generated (total ~384 samples)

More Related