Loading in 5 sec....

Introduction to SNP and Haplotype AnalysisPowerPoint Presentation

Introduction to SNP and Haplotype Analysis

- 620 Views
- Updated On :
- Presentation posted in: Travel / Places

Introduction to SNP and Haplotype Analysis. Yao-Ting Huang. Kun-Mao Chao. Algorithms and Computational Biology Lab, Department of Computer Science & Information Engineering, National Taiwan University, Taiwan. Genetic Variations.

Introduction to SNP and Haplotype Analysis

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Introduction to SNP and Haplotype Analysis

Yao-Ting Huang

Kun-Mao Chao

Algorithms and Computational Biology Lab,

Department of Computer Science & Information Engineering,

National Taiwan University, Taiwan.

- The genetic variations in DNA sequences (e.g., insertions, deletions, and mutations) have a major impact on genetic diseases and phenotypic differences.
- All humans share 99% the same DNA sequence.
- The genetic variations in the coding region may change the codon of an amino acid and alters the amino acid sequence.

- A Single Nucleotide Polymorphisms (SNP), pronounced “snip,” is a genetic variation when a single nucleotide (i.e., A, T, C, or G) is altered and kept through heredity.
- SNP: Single DNA base variation found >1%
- Mutation: Single DNA base variation found <1%

C T T A G C T T

C T T A G C T T

99.9%

94%

C T T A G T T T

C T T A G T T T

0.1%

6%

SNP

Mutation

Mutations

SNPs

time

present

Observed genetic variations

Common Ancestor

- SNPs are the most frequent form among various genetic variations.
- 90% of human genetic variations come from SNPs.
- SNPs occur about every 300~600 base pairs.
- Millions of SNPs have been identified (e.g., HapMap and Perlegen).

- SNPs have become the preferred markers for association studies because of their high abundance and high-throughput SNP genotyping technologies.

A SNP is usually assumed to be a binary variable.

The probability of repeat mutation at the same SNP locus is quite small.

The tri-allele cases are usually considered to be the effect of genotyping errors.

The nucleotide on a SNP locus is called

a major allele (if allele frequency > 50%), or

a minor allele (if allele frequency < 50%).

A C T T A G C T T

T: Major allele

94%

C: Minor allele

A C T T A G C T C

6%

CTC

Haplotype 1

-A C T T A G C T T-

-A C T T T G C T C-

CAT

Haplotype 2

ATC

-A A T T T G C T C-

Haplotype 3

SNP1

SNP2

SNP3

SNP1

SNP2

SNP3

- A haplotype stands for a set of linked SNPs on the same chromosome.
- A haplotype can be simply considered as a binary string since each SNP is binary.

A

G

C

T

A

T

A T

AC

GT

C

G

C G

SNP1

SNP2

SNP1

SNP2

SNP1

SNP2

SNP1

SNP2

Haplotype data

Genotype data

- The use of haplotype information has been limited because the human genome is a diploid.
- In large sequencing projects, genotypesinstead of haplotypes are collected due to cost consideration.

A

G

C

T

AC

GT

SNP1

SNP2

SNP1

SNP2

Genotype data

A

G

A

T

C

T

C

G

SNP1

SNP2

SNP1

SNP2

- Genotypesonly tell us the alleles at each SNP locus.
- But we don’t know the connection of alleles at different SNP loci.
- There could be several possible haplotypes for the same genotype.

or

We don’t know which haplotype pair is real.

SNPDatabase

HaplotypeInference

Tag SNPSelection

…

MaximumParsimony

Perfect

Phylogeny

Statistical

Methods

Haplotype

block

LD bin

PredictionAccuracy

- The problem of inferring the haplotypes from a set of genotypes is called haplotype inference.
- This problem is already known to be not only NP-hard but also APX-hard.

- Most combinatorial methods consider the maximum parsimony model to solve this problem.
- This model assumes that the real haplotypes in natural population is rare.
- The solution of this problem is a minimum set of haplotypesthat can explain the given genotypes.

A

G

A

T

A

A

G

h3

h1

G1

T

A

C

C

T

C

G

h4

h2

SNP1

SNP2

A

T

h1

T

G2

A

T

T

h1

SNP1

SNP2

A

T

A

G

C

G

C

T

A

T

- Find a minimum set of haplotypes to explain the given genotypes.

or

- Statistical methods:
- Niu, et al. (2002) developed a PL-EM algorithm called HAPLOTYPER.
- Stephens and Donnelly (2003) designed a MCMC algorithm based on Gibbs sampling called PHASE.

- Combinatorial methods:
- Gusfield (2003) proposed an integer linear programming algorithm.
- Wang and Xu (2003) developed a branching and bound algorithm called HAPAR to find the optimal solution.
- Brown and Harrower (2004) proposed a new integer linear formulation of this problem.

- We formulated this problem as an integer quadratic programming (IQP) problem.
- Weproposed an iterative semidefinite programming (SDP) relaxation algorithm to solve the IQP problem.
- This algorithm finds a solution of O(log n) approximation.

- We implemented this algorithm in MatLab and compared with existing methods.
- Huang, Y.-T., Chao, K.-M., and Chen, T., 2005, “An Approximation Algorithm for Haplotype Inference by Maximum Parsimony,” Journal of Computational Biology, 12: 1261-1274.

A

A

T

T

A

A

G

h1

h1

G1

T

C

A

C

C

G

G

h2

h2

SNP1

SNP2

A

T

h1

T

G2

A

T

T

h1

SNP1

SNP2

- Input:
- A set of n genotypes and m possible haplotypes.

- Output:
- A minimum set of haplotypes that can explain the given genotypes.

- Define xi as an integer variable with values 1 or -1.
- xi = 1 if the i-th haplotype is selected.
- xi = -1 if the i-th haplotype is not selected.

- Minimizing the number of selected haplotypes is to minimize the following integer quadratic function:

A

C

1

1

G1

SNP1

SNP2

A

A

T

G

G

h3

h1

T

C

C

G

T

h2

h4

- Each genotype must be resolved by at least one pair of haplotypes.
- For genotype G1, the following integer quadratic function must be satisfied.

Suppose h1 and h2 are selected

or

Objective Function

Constraint Functions

- Maximum parsimony:
- We use the SDP-relaxation technique to solve this IQP problem.

Find a minimum set of haplotypes

to resolve all genotypes.

NP-hard

P

Relax the integer constraint

Reformulation

No, repeat this algorithm.

Existing SDP solver

All genotypesresolved?

Yes, done.

Randomizedrounding

IncompleteCholeskydecomposition

Integer Quadratic

Programming

Vector

Formulation

Semidefinite

Programming

Vector

Solution

SDP

Solution

Integral Solution

SNPDatabase

HaplotypeInference

Tag SNPSelection

…

MaximumParsimony

Perfect

Phylogeny

Statistical

Methods

Haplotype

block

LD bin

PredictionAccuracy

- The number of SNPs is still too large to be used for association studies.
- There are millions of SNPs in a human body.
- To reduce the SNP genotyping cost, we wish to use as few SNPs as possible for association studies.

- Tag SNPs are a small subset of SNPs that is sufficient for performing association studies without losing the power of using all SNPs.
- There are many definitions of tag SNPs.
- We will first study one definition of tag SNPs based on haplotype blocks model.

- Recent studies have shown that the chromosome can be partitioned into haplotype blocks interspersed by recombination hotspots (Daly et al, Patil et al.).
- Within a haplotype block, there is little or no recombination occurred.
- The SNPs within a haplotype block tend to be inherited together.

- Within a haplotype block, a small subset of SNPs (called tag SNPs) is sufficient to distinguish each pair of haplotype patterns in the block.
- We only need to genotype tag SNPs instead of all SNPs within a haplotype block.

Haplotype patterns

P1

P2

P3

P4

Recombinationhotspots

S1

S2

S3

S4

: Major allele

Haplotypeblocks

S5

SNP loci

S6

: Minor allele

S7

S8

S9

S10

S11

S12

Chromosome

- The Chromosome 21 is partitioned into 4,135 haplotype blocks over 24,047 SNPs by Patil et al. (Science, 2001).
- Blue box:major allele
- Yellow box:minor allele

Haplotype patterns

An unknown haplotype sample

P1

P2

P3

P4

S1

- Suppose we wish to distinguish an unknown haplotype sample.
- We can genotype all SNPs to identify the haplotype sample.

S2

S3

S4

S5

S6

SNP loci

S7

S8

S9

: Major allele

S10

S11

: Minor allele

S12

Haplotype pattern

P1

P2

P3

P4

S1

- In fact, it is not necessary to genotype all SNPs.
- SNPs S3, S4, and S5 can form a set of tag SNPs.

S2

S3

S4

S5

S6

SNP loci

P1

P2

P3

P4

S7

S8

S3

S9

S4

S10

S5

S11

S12

Haplotype pattern

P1

P2

P3

P4

S1

- SNPsS1, S2, and S3 can not form a set of tag SNPs because P1 and P4 will be ambiguous.

S2

S3

S4

S5

S6

SNP loci

P1

P2

P3

P4

S7

S1

S8

S2

S9

S3

S10

S11

S12

Haplotype pattern

- SNPs S1 and S12 can form a set of tag SNPs.
- This set of SNPs is the minimum solution in this example.

P1

P2

P3

P4

S1

S2

S3

S4

S5

S6

SNP loci

S7

S8

P1

P2

P3

P4

S9

S1

S10

S12

S11

S12

S3

S4

S2

There are pairs of patterns.

P1

P2

P3

P4

- The relation between SNPs and haplotypes can be formulated as a bipartite graph.
- S1can distinguish (P1, P3), (P1, P4), (P2, P3), and (P2, P4).
- S2 can distinguish (P1, P4), (P2, P4), (P3, P4).

S1

S2

S3

S4

S1

(1,2)

(1,3)

(1,4)

(2,3)

(2,4)

(3,4)

P1

P2

P3

P4

S1

S2

S3

S1

S3

S4

S2

- The SNPs can form a set of tag SNPs ifeach pair of patterns is connected by at least one edge.
- e.g., S1 and S3 can form a set of tag SNPs.
- e.g., S1 and S2 can not be tag SNPs.

(1,2)

(1,3)

(1,4)

(2,3)

(2,4)

(3,4)

Each pair of patterns is connected by at least one edge.

- The problem of finding the minimum set of tag SNPs is known to be NP-hard.
- This problem is the minimum test set problem.
- A number of methods have been proposed to find the minimum set of tag SNPs (Bafna et al., Zhang, et al.).

- In reality, we may fail to obtain some tag SNPs if they do not pass the threshold of data quality.
- In the current genotyping environment, the missing rate of SNPs is around 5~10%.
- We proposed two greedy algorithms and one linear programming relaxation algorithm to solve this problem.

- Huang, Y.-T., Zhang, K., Chen, T. and Chao, K.-M., 2005, “Selecting Additional Tag SNPs for Tolerating Missing Data in Genotyping,” BMC Bioinformatics, 6: 263.
- Chang, C.-J., Huang, Y.-T., and Chao, K.-M., 2006, “A Greedier Approach for Finding Tag SNPs,” Bioinformatics, 22: 685-691.

SNPDatabase

HaplotypeInference

Tag SNPSelection

…

MaximumParsimony

Perfect

Phylogeny

Statistical

Methods

Haplotype

block

LD bin

PredictionAccuracy

- The problem of finding tag SNPs can be also solved from the statistical point of view.
- We can measure the correlation between SNPs and identify sets of highly correlated SNPs.
- For each set of correlated SNPs, only one SNP need to be genotyped and can be used to predict the values of other SNPs.

- Linkage Disequilibrium (LD) is a measure that estimates such correlation between two SNPs.
- We will formally introduce the detailed information of LD later.

- The statistical methods for finding tag SNPs are based on the analysis of LD among all SNPs.
- An LD bin is a set of SNPs such that SNPs within the same bin are highly correlated with each other.
- The value of a single SNP in one LD bin can predict the values of other SNPs of the same bin.
- These methods try to identify the minimum set of LD bins.

- SNP1 and SNP2 can not form an LD bin.
- e.g., A in SNP1 may imply either G or A in SNP2.

- SNP1, SNP2, and SNP3 can form an LD bin.
- Any SNP in this bin is sufficient to predict the values of others.

- There are three LD bins, and only three tag SNPs are required to be genotyped (e.g., SNP1, SNP2, and SNP4).

- Haplotype blocks are based on the assumption that SNPs in proximity region should tend to be correlated with each other.
- The probability of recombination occurs in between is less.

- LD bins can group correlated of SNPs distant from each other.
- A disease is usually affected by multiple genes instead of single one.

- The SNPs in one LD bin can be shared by other bins.
- The SNPs in a haplotype block do not appear in another block.

A

B

a

B

a

b

A, B: major alleles

a, b: minor alleles

PA: probability for A alleles at SNP1

Pa: probability for a alleles at SNP1

PB: probability for B alleles at SNP2

PB: probability for b alleles at SNP2

PAB: probability for AB haplotypes

Pab: probability for ab haplotypes

A

b

SNP2

SNP1

- PAB = PAPB
- PAb = PAPb = PA(1-PB)
- PaB = PaPB = (1-PA) PB
- Pab = PaPb = (1-PA) (1-PB)

SNP2

SNP1

- PAB≠ PAPB
- PAb≠ PAPb = PA(1-PB)
- PaB≠ PaPB = (1-PA) PB
- Pab≠ PaPb = (1-PA) (1-PB)

SNP2

SNP1

- Suppose we have three haplotypes: AG, CG, and CC.
- There is no AC haplotype, i.e., PAC = 0.

- Note that PAC=0, PAPC=1/9, and PAC ≠ PAPC.
- These two SNPs are linkage disequilibrium.

-- A -- -- -- G -- -- --

-- C -- -- -- G -- -- --

-- C -- -- -- C -- -- --

PA=1/3PC=2/3

PG=2/3PC=1/3

Before recombination

After recombination

- After recombination,
- PAG = PAPG = 1/4,
- PCG = PCPG = 1/4,
- PCC = PCPC = 1/4, and
- PAC = PAPC = 1/4.

- These two SNPs are linkage equilibrium.

-- A -- -- -- G -- -- --

-- A -- -- -- G -- -- --

-- C -- -- -- G -- -- --

-- C -- -- -- G -- -- --

-- C -- -- -- C -- -- --

-- C -- -- -- C -- -- --

-- A -- -- -- C -- -- --

PA=1/2PC=1/2

PG=1/2PC=1/2

- There are many formulas to compute LD between two SNPs, and most of them areusually normalized between -1~1 or 0~1.
- LD = 1 (perfect positive correlation)
- LD = 0 (no correlation or linkage equilibrium)
- LD = -1 (perfect negative correlation)
- LD = 0.8 (strong positive correlation)
- LD = 0.12 (weak positive correlation)

- Mathematical formulas for computing LD:
- r2 or Δ2:
- D’:
- Chi-square Test.
- P value.

- The correlation between two random variables A and B can be measured by the correaltion coefficient:

- This problem asks for a minimum set of LD bins.
- The minimum LD value required between two SNPs in one bin is usually set to 0.8.

- This problem is known to be the minimum clique cover problem (by Huang and Chao, 2005).
- Consider each SNP as nodes on the graph.
- There exists an edge between two nodes iff the LD of these two SNPs ≥ 0.8.

- The minimum clique cover problem is not easy to be approximated.
- The relaxed problem asks for a minimum set of LD bins such that at least one SNP in an LD bin has r2≥ 0.8 with other SNPs in the same bin.

- The relaxed problem is known to be the minimum dominating set problem.
- The minimum dominating set problem is still NP-hard but is easier to be approximated.

- Given a graph G(V, E), the minimum dominating set C is the minimum set of nodes, such that each node in V has at least one edge connecting to nodes in C.
- Consider each node as a SNP and each edge as strong LD (r2≥ 0.8) between two SNPs.
- The minimum dominating set of this graph is the set of tag SNPs.
- We can only use this set of SNPs to predict other SNPs.

- Hinds et al. (2005) identified 1,586,383 SNPs across three human populations.
- African, Americans of European, and Asian.

- The database provides both genotype data and inferred haplotype data.