1 / 38

Haplotype Blocks

Haplotype Blocks. An Overview A. Polanski Department of Statistics Rice University. Key Papers. N. Patil et al., (2001), Blocks of Limited Haplotype Diversity Revealed by High-Resolution Scanning of Human Chromosome 21, Science, vol. 294, pp. 1719-1723

marin
Download Presentation

Haplotype Blocks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University

  2. Key Papers • N. Patil et al., (2001), Blocks of Limited Haplotype Diversity Revealed by High-Resolution Scanning of Human Chromosome 21, Science, vol. 294, pp. 1719-1723 • N. Wang et al., (2002), Distribution of Recombination Crossovers and the Origin of Haplotype Blocks: The Interplay of Population History, Recombination and Mutation, Am. J. Hum. Genet., vol. 71, pp. 1227-1234. • K. Zhang et al., (2002), A Dynamic Programming Algorithm for Haplotype Block Partitioning, PNAS, vol. 99, pp. 7335-7339

  3. Supplementary Papers • R. Hudson, N. Kaplan, (1985), Statistical Properties of the Number of Recombination Events in The History of a Sample of DNA sequences, Genetics, vol. 111, pp. 147-164 • R. Hudson, 2002, Generating Samples under a Wright-Fisher Neutral Model of Genetic Variation, Bioinformatics, vol. 18, pp. 337-338 • D. Reich et al., (2001), Linkage Disequilibrium in the Human Genome, Nature, vol. 411, pp. 199-204

  4. What are Haplotype Blocks ? Haplotype block = a sequence of contiguous markers on DNA, homogeneous according to some criterion Markers = Single Nucleotide Polymorphisms (SNPs)

  5. Data (Patil et al. 2001) Chromosome 21 Physically separated the two copies of chromosome 21 using a rodent-human somatic cell hybrid technique Sample of 20 copies of chromosome 21 (32397439 bases) Found: 35989 SNPs

  6. Fig. 2 from (Patil et al. 2001)

  7. SNP no i 0100000000000000000010000000000000010000111000000000100000001001000000001001000000000000000000001000000001101000010101010 0000000010000000000010000000000100100001000000000000001011001001001010001001000000000010010001011000000001101010010101010 0000000001000100010110001010000000010100011000000000010100000000000100000100110000011101001000000110000110001000100011010 0000000000000100010010001010000000010100011000000000010100000000000100000100110000011101001000000110000110001000100011010 0000000010000000000010000100000100100000000000000000001001001001001010001001000000000010010001011000000001100100000000000 0010000000100001000010010000000000010000011000000000010100000000100100110100010000000010000001001000001001110100000000000 0000000010000000000010000100110100100000000000000000001001001001001010001001000000000010010001011000000001100100000000000 1000100000000000000001000001000101000000000000000001000000001001000001001001000000100000000100001000000001101010010101010 0000000000001000000010000000000000010000011000000000000000001001000000001001000000000000000000001000000001101000010101010 0000000010000000000010000100000100100000000000000000001001001101001010001001000000000010010001011000000001100100000000000 1000100000000000000001000001000101000000000000000001000000001001000000001001000000000000000000001000010001101010010101010 0000100000000000100001000000000101000000000000000000000000001001010000001001000000000000000000001000000001101000010101010 1000100000000000000001000001000101000000000000000001000000001001000001001001000000100000000100001000000001101010010101010 0000000100100000000010010000000000011000011010000000010100000010100100100100010010000010100001001000001001110100000000000 1000100000000010000001000001000101000000000000000001000000001001000001001001000000100000000100001000000001101010010101010 0000000000100000000010010000000000010000011000000000010100000000100100100100010000000010000001001000001001110101000000001 0000000000100000000010010000000000010000011010000000010100000010100100100100010010000010000001001001001001110100000000000 0001001000010000001000100000001010000000011001111110000000110000000000000010011101010000001010100100000000001000001011110 0000100000000000100001000000000101000000000000000000000000001001010000001001000000000000000000001000000001101000010101010 0001010000000000001000000000000010000010011101000010000000100000000000000010010001010000001000100100100000001000001011010 …… 20 i = 1, 2, …, 35989

  8. Problems

  9. How do we determine boundaries between blocks ? • Average value of standarized coefficient of linkage disequilibrium is greater thansome threshold (Wang et al. 2002, Reich et al. 2001) • Infer sites in the sample of DNA sequences where recombination events happened in the past history (Wang et al. 2002, Hudson, 2002) • Chromosome coverage – minimum number of SNPs to account for majority of haplotypes (Patil et al. 2001, Zhang et al. 2002)

  10. What evolutionary forces are responsible for haplotype blocks formation ? • Mutation • Genetic drift • Recombination • Recombination hot spots

  11. Methods

  12. Method 1 (Wang et al. 2002) Infer sites in the sample of DNA sequences where recombination events happened in the past history

  13. Three gamete condition Consider a pair of SNPs, SNP1 and SNP2. If there was no recombination between SNP1 and SNP2, they must satisfy three gamete condition GC SNP1 SNP2 SNP1 SNP2 AC GT A C AG CT G C G T

  14. Four gamete test (Hudson and Kaplan, 1985) If we see all four gametes at SNP1 and SNP2 SNP1 SNP2 A C 4GT G C G T A T Then there must have been a recombination event between these sites in their past history

  15. Array of pairwise 4GT test results Hudson and Kaplan, 1985 0, if there are less then 4 gametes D, dij= 1, if there are 4 gametes What is the minimal number of recombinations that could explain observed data ? Statistics FR (Hudson and Kaplan, 1985)

  16. Fig. 1 from Wang et al., 2002 D Block 1 Block 2 Block 3

  17. Wang et al., 2002 - Study • R. Hudson’s program for simulating genealogies with mutation, drift and recombination under various demographic scenarios • Study of dependence of average lengths of blocks on different factors • Comparison of simulation results to data from Patil et al., 2002

  18. Dependence of average lengths of blocks on recombination frequency

  19. … on sample size

  20. ... on mutation intensity

  21. Comparison to data from Patil et al. 2001 • Compute distribution of haplotype block lengths in the data from Patil et al. 2001 • Try to tune parameters  and R to obtain similar distribution in the simulations

  22. … Failed

  23. Try a mixture of two different recombination frequencies - better

  24. Method 2 (Patil, 2001) Chromosome coverage – minimum number of SNPs to account for majority of haplotypes

  25. Fig. 2 from (Patil et al. 2001)

  26. Problem formulation Define block boundaries to minimize the number of SNPs that distinguish at least  percent of the haplotypes in each block

  27. Common haplotypes Those represented more than one in the block

  28. Condition Common haplotypes must constitute at least =80 percent of all haplotypes in the block Blocks that do not satisfy this are not allowed

  29. Fragment of Fig. 2 from Patil et al., 2001

  30. Notation • B – block defined as numbers of SNPs, e.g., B = 45, 46,….50, or B = i, i+1,…, j • L(B) length of the block (number of SNPs) • f(B) – minimum number of SNP’s required to distinguish common haplotypes

  31. Greedy Solution 0100000000000000000010000000000000010000111000000000100000001001000000001001000000000000000000001000000001101000010101010 0000000010000000000010000000000100100001000000000000001011001001001010001001000000000010010001011000000001101010010101010 0000000001000100010110001010000000010100011000000000010100000000000100000100110000011101001000000110000110001000100011010 0000000000000100010010001010000000010100011000000000010100000000000100000100110000011101001000000110000110001000100011010 0000000010000000000010000100000100100000000000000000001001001001001010001001000000000010010001011000000001100100000000000 0010000000100001000010010000000000010000011000000000010100000000100100110100010000000010000001001000001001110100000000000 0000000010000000000010000100110100100000000000000000001001001001001010001001000000000010010001011000000001100100000000000 1000100000000000000001000001000101000000000000000001000000001001000001001001000000100000000100001000000001101010010101010 0000000000001000000010000000000000010000011000000000000000001001000000001001000000000000000000001000000001101000010101010 0000000010000000000010000100000100100000000000000000001001001101001010001001000000000010010001011000000001100100000000000 1000100000000000000001000001000101000000000000000001000000001001000000001001000000000000000000001000010001101010010101010 0000100000000000100001000000000101000000000000000000000000001001010000001001000000000000000000001000000001101000010101010 1000100000000000000001000001000101000000000000000001000000001001000001001001000000100000000100001000000001101010010101010 0000000100100000000010010000000000011000011010000000010100000010100100100100010010000010100001001000001001110100000000000 1000100000000010000001000001000101000000000000000001000000001001000001001001000000100000000100001000000001101010010101010 0000000000100000000010010000000000010000011000000000010100000000100100100100010000000010000001001000001001110101000000001 0000000000100000000010010000000000010000011010000000010100000010100100100100010010000010000001001001001001110100000000000 0001001000010000001000100000001010000000011001111110000000110000000000000010011101010000001010100100000000001000001011110 0000100000000000100001000000000101000000000000000000000000001001010000001001000000000000000000001000000001101000010101010 0001010000000000001000000000000010000010011101000010000000100000000000000010010001010000001000100100100000001000001011010 ……. Start End 0. Fix Start =End 1. Increment end 2. Compute ratio L(B)/f(B) 3. Stop at max 4. Go to 0

  32. Results • 4563 representative SNPs (13%) • 4135 blocks

  33. Method 3 (Zhang et al. 2002) Solves the same problem of 80% chromosome coverage, but using the better method of dynamic programming

  34. Dynamic programming solution i B1(i) B2(i) B3(i) 0100000000000000000010000000000000010000111000000000100000001001000000001001000000000000000000001000000001101000010101010 0000000010000000000010000000000100100001000000000000001011001001001010001001000000000010010001011000000001101010010101010 0000000001000100010110001010000000010100011000000000010100000000000100000100110000011101001000000110000110001000100011010 0000000000000100010010001010000000010100011000000000010100000000000100000100110000011101001000000110000110001000100011010 0000000010000000000010000100000100100000000000000000001001001001001010001001000000000010010001011000000001100100000000000 0010000000100001000010010000000000010000011000000000010100000000100100110100010000000010000001001000001001110100000000000 0000000010000000000010000100110100100000000000000000001001001001001010001001000000000010010001011000000001100100000000000 1000100000000000000001000001000101000000000000000001000000001001000001001001000000100000000100001000000001101010010101010 0000000000001000000010000000000000010000011000000000000000001001000000001001000000000000000000001000000001101000010101010 0000000010000000000010000100000100100000000000000000001001001101001010001001000000000010010001011000000001100100000000000 1000100000000000000001000001000101000000000000000001000000001001000000001001000000000000000000001000010001101010010101010 0000100000000000100001000000000101000000000000000000000000001001010000001001000000000000000000001000000001101000010101010 1000100000000000000001000001000101000000000000000001000000001001000001001001000000100000000100001000000001101010010101010 0000000100100000000010010000000000011000011010000000010100000010100100100100010010000010100001001000001001110100000000000 1000100000000010000001000001000101000000000000000001000000001001000001001001000000100000000100001000000001101010010101010 0000000000100000000010010000000000010000011000000000010100000000100100100100010000000010000001001000001001110101000000001 0000000000100000000010010000000000010000011010000000010100000010100100100100010010000010000001001001001001110100000000000 0001001000010000001000100000001010000000011001111110000000110000000000000010011101010000001010100100000000001000001011110 0000100000000000100001000000000101000000000000000000000000001001010000001001000000000000000000001000000001101000010101010 0001010000000000001000000000000010000010011101000010000000100000000000000010010001010000001000100100100000001000001011010 …… Optimal partition of SNPs 1,2, … i Assume that for all i=1, 2, …, j-1 we know optimal block partition, B1(i), B2(i), …, Bk(i) that minimizes:

  35. Bellman’s equation

  36. Results • 3582 representative SNPs (compared to 4563 from greedy algorithm) • 2575 blocks (compared to 4135 blocks from greedy algorithm)

  37. Conclusions • Studying haplotype block partitions is very important to 1. Constructing haplotype maps for genetic traits 2. Understanding recombination in human genome

  38. To expect • A lot of papers in this area appearing in scientific journals

More Related