Loading in 5 sec....

Authors: Lan Liu , Yonghui Wu, PowerPoint Presentation

Authors: Lan Liu , Yonghui Wu,

- 116 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Authors: Lan Liu , Yonghui Wu, ' - vevina

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Efficient Algorithmsfor Genome-wide TagSNP Selection across Populationsvia the Linkage Disequilibrium Criterion

Authors: Lan Liu, Yonghui Wu,

Stefano Lonardi and Tao Jiang

Outline

- Introduction
- The MCTS Model
- Our Algorithms
- Experimental Result

Outline

- Introduction
- The MCTS Model
- Our Algorithms
- Experimental Result

Motivation

- With the rapid development of genotyping technologies, there are more than 10 million verified single-nucleotide polymorphisms (SNPs) in dbSNP database.
- We aim to select a subset of informative SNPs (i.e. tagSNPs), in order to
- Save the cost for genotyping all SNPs.
- Perform disease association mapping.

TagSNP Selection

- Haplotype-based methods
- Require the information of the phased multilocus haplotypes

- Haplotype-free methods
- Do not require haplotype information
- TagSNP selection via r2 linkage disequilibrium statistics

(pAB –pA. p.B)2

r2 =

- r2 statistics:

pA.(1-pA.)p.B(1-p.B)

r2 Linkage Disequilibrium Statistics- Given a pair of genetic markers 1 and 2.

- If r2 is no less than a given threshold r0, marker 1 (or marker 2) can tag marker 2 (or marker 1, respectively).

(a) SNP markers and their LD patterns in a population

(b) TagSNPs for the population

The TagSNP Selection Problem- Instance: a set V of SNP markers and LD patterns
E={(vj1,vj2)| r2(vj1,vj2) is no less than a given threshold r0, vj1and vj2are in V},

Feasible solution: a subset V' , such that given any v in V, there exists a

v' in V', where r2(v,v') is no less than r0.

Objective: minimize |V'|.

If we define G=(V, E), a tagSNP set is equivalent to a dominating set on G.

- This model is introduced by Carlson et al., 2004. It is a simple and popular tagging method.

Outline

- Introduction
- The MCTS Model
- Our Algorithms
- Experimental Result

Population 1

B

b

B

b

A

0.0025

0.0475

0.05

A

0.9025

0.0475

0.95

a

0.0475

0.9025

0.95

a

0.0475

0.0025

0.05

0.05

0.95

r2= 0

0.95

0.05

r2= 0

B

b

Admixed population:

50% population 1

50% population 2

A

0.4525

0.0475

0.5

a

0.0475

0.4525

0.5

0.5

0.5

r2= 0.6561

r2 Statistics in Single and Admixed Populations- SNP 2: B, b

- SNP 1: A, a

TagSNP Selection across Populations

- A pair of SNPs
- have remarkably different marker frequencies and very weak LD
in two populations with different evolutionary histories.

- may show strong LD in the admixed population.

- have remarkably different marker frequencies and very weak LD
- TagSNPs picked from the admixed populations or one of the populations might not be sufficient to capture the variations in all populations.

(a) SNP markers and their LD patterns in two populations.

(b) The minimum TagSNP set for these two populations.

The MCTS Model- Given a set of SNP markers and LD patterns in multiple populations, we want to find a minimum common tagSNP set for each of the populations.
- The above problem is called the minimum common tagSNPselection problem (MCTS).

Outline

- Introduction
- The MCTS Model
- Our Algorithms
- Experimental Result

- the upper bound：the number of the tagSNPs obtained by our algorithms

- the lower bound：the minimum number of tagSNPs needed

- GreedyTag_lb
- LRTag_lb

- The MCTS problem can be easily formulated by an integer linear programming.

- We first apply some data reduction rules, then use one of the following algorithms
- A greedy algorithm: GreedyTag
- A Lagrangian relaxation algorithm: LRTag

- Remove less informative markers
- Example: among markers 1, 2 and 6, remove marker 1 and 2.

- Remove less stringent occurrences
- Example: between the occurrences of markers 4 and 5 in population 2, remove the occurrence of marker 4.

- Pick all irreplaceable markers
- Example: marker 7

A Greedy Algorithm

Apply data reduction rules

no

un-tagged occurrence?

yes

Output the tagSNPs

Pick the marker which tags the most of the remaining occurrences as a tagSNP

A Lagrangian Relaxation Algorithm

iteration := 0

Introduce the Lagrangian multipliers λ

no

iteration++ < max_iter

Obtain the relaxed integer program

yes

Update λtowards the subgradient direction

Output the tagSNPs

Initialize λ

Obtain the tagSNP set based on λ

Update the tagSNP set based on λ

Outline

- Introduction
- The MCTS Model
- Our Algorithms
- Experimental Result

- There are four populations in HapMap data.
- CEU: Europe descendents.
- CHB: Chinese, Beijing.
- JPT: Japanese, Tokyo.
- YRI: Yoruba people of Ibadan, Nigeria.

- We get tagSNPs for the following two datasets:
- Encode regions
- all 10 ENCODE regions

- Human genome
- chromosomes 1 – 22

- Encode regions

- 10,859 markers.

- 2,862,454 markers

- We apply our algorithms on real HapMap data (release #19, NCBI build 34, October 2005).

Experiment Result for ENCODE Regions

- We compare our GreedyTag and LRTag with MultiPop-TagSelect(MPS).
- Multipop-TagSelect first generates the tagSNPs for each single population, then combines the obtained tagSNPs together for multiple populations.

- The gap between LRTag_lb and LRTag
- r2 = 0.5: at most two for each region
totally six for all regions

- r2 = 0.8: there is no gap.

- r2 = 0.5: at most two for each region

Experiment Result for Human Genome

- The gap between LRTag_lb and LRTag for the whole genome
- 2,862,454 SNPs in total
- r2 = 0.5: 1,061
- r2 = 0.8: 142

The numbers of tagSNPs selected by our algorithms are almost optimal.

Running Time of Our Algorithms

- Running environment
- a 32-processor SGI Altix 4700 supercomputer system
- 1.6 GHZ CPU
- 64 GB shared memory
- 15 threads in parallel.

- Running time
- r2= 0.5,
- ENCODE regions: < 7 seconds for each region, < 1 minute for all regions.
- Human genome: < 12 minutes for each chromosome, < 1 hour for the genome.

- r2> 0.5, our algorithms run faster the above speed.

- r2= 0.5,

Outline

- Introduction
- The MCTS Model
- Our Algorithms
- Experimental Result

and attention!

Download Presentation

Connecting to Server..