Download
1 / 23

Authors: Lan Liu , Yonghui Wu, - PowerPoint PPT Presentation


  • 116 Views
  • Uploaded on

E fficient A lgorithms for G enome-wide T agSNP S election across P opulations via the Linkage Disequilibrium C riterion. Authors: Lan Liu , Yonghui Wu,. Stefano Lonardi and Tao Jiang. Outline. Introduction The MCTS Model Our Algorithms Experimental Result. Outline. Introduction

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Authors: Lan Liu , Yonghui Wu, ' - vevina


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Efficient Algorithmsfor Genome-wide TagSNP Selection across Populationsvia the Linkage Disequilibrium Criterion

Authors: Lan Liu, Yonghui Wu,

Stefano Lonardi and Tao Jiang


Outline
Outline

  • Introduction

  • The MCTS Model

  • Our Algorithms

  • Experimental Result


Outline1
Outline

  • Introduction

  • The MCTS Model

  • Our Algorithms

  • Experimental Result


Motivation
Motivation

  • With the rapid development of genotyping technologies, there are more than 10 million verified single-nucleotide polymorphisms (SNPs) in dbSNP database.

  • We aim to select a subset of informative SNPs (i.e. tagSNPs), in order to

    • Save the cost for genotyping all SNPs.

    • Perform disease association mapping.


Tagsnp selection
TagSNP Selection

  • Haplotype-based methods

    • Require the information of the phased multilocus haplotypes

  • Haplotype-free methods

    • Do not require haplotype information

    • TagSNP selection via r2 linkage disequilibrium statistics


R 2 linkage disequilibrium statistics

(pAB –pA. p.B)2

r2 =

  • r2 statistics:

pA.(1-pA.)p.B(1-p.B)

r2 Linkage Disequilibrium Statistics

  • Given a pair of genetic markers 1 and 2.

  • If r2 is no less than a given threshold r0, marker 1 (or marker 2) can tag marker 2 (or marker 1, respectively).


The tagsnp selection problem

(a) SNP markers and their LD patterns in a population

(b) TagSNPs for the population

The TagSNP Selection Problem

  • Instance: a set V of SNP markers and LD patterns

    E={(vj1,vj2)| r2(vj1,vj2) is no less than a given threshold r0, vj1and vj2are in V},

    Feasible solution: a subset V' , such that given any v in V, there exists a

    v' in V', where r2(v,v') is no less than r0.

    Objective: minimize |V'|.

If we define G=(V, E), a tagSNP set is equivalent to a dominating set on G.

  • This model is introduced by Carlson et al., 2004. It is a simple and popular tagging method.


Outline2
Outline

  • Introduction

  • The MCTS Model

  • Our Algorithms

  • Experimental Result


R 2 statistics in single and admixed populations

Population 2

Population 1

B

b

B

b

A

0.0025

0.0475

0.05

A

0.9025

0.0475

0.95

a

0.0475

0.9025

0.95

a

0.0475

0.0025

0.05

0.05

0.95

r2= 0

0.95

0.05

r2= 0

B

b

Admixed population:

50% population 1

50% population 2

A

0.4525

0.0475

0.5

a

0.0475

0.4525

0.5

0.5

0.5

r2= 0.6561

r2 Statistics in Single and Admixed Populations

  • SNP 2: B, b

  • SNP 1: A, a


Tagsnp selection across populations
TagSNP Selection across Populations

  • A pair of SNPs

    • have remarkably different marker frequencies and very weak LD

      in two populations with different evolutionary histories.

    • may show strong LD in the admixed population.

  • TagSNPs picked from the admixed populations or one of the populations might not be sufficient to capture the variations in all populations.


The mcts model

(a) SNP markers and their LD patterns in two populations.

(b) The minimum TagSNP set for these two populations.

The MCTS Model

  • Given a set of SNP markers and LD patterns in multiple populations, we want to find a minimum common tagSNP set for each of the populations.

  • The above problem is called the minimum common tagSNPselection problem (MCTS).


Outline3
Outline

  • Introduction

  • The MCTS Model

  • Our Algorithms

  • Experimental Result


Our algorithms

  • the upper bound:the number of the tagSNPs obtained by our algorithms

  • the lower bound:the minimum number of tagSNPs needed

  • GreedyTag_lb

  • LRTag_lb

Our Algorithms

  • The MCTS problem can be easily formulated by an integer linear programming.

  • We first apply some data reduction rules, then use one of the following algorithms

    • A greedy algorithm: GreedyTag

    • A Lagrangian relaxation algorithm: LRTag


Data reduction rules

  • Remove less stringent occurrences

    • Example: between the occurrences of markers 4 and 5 in population 2, remove the occurrence of marker 4.

Data Reduction Rules

  • Pick all irreplaceable markers

    • Example: marker 7


A greedy algorithm
A Greedy Algorithm

Apply data reduction rules

no

un-tagged occurrence?

yes

Output the tagSNPs

Pick the marker which tags the most of the remaining occurrences as a tagSNP


A lagrangian relaxation algorithm
A Lagrangian Relaxation Algorithm

iteration := 0

Introduce the Lagrangian multipliers λ

no

iteration++ < max_iter

Obtain the relaxed integer program

yes

Update λtowards the subgradient direction

Output the tagSNPs

Initialize λ

Obtain the tagSNP set based on λ

Update the tagSNP set based on λ


Outline4
Outline

  • Introduction

  • The MCTS Model

  • Our Algorithms

  • Experimental Result


Experimental result

  • We get tagSNPs for the following two datasets:

    • Encode regions

      • all 10 ENCODE regions

    • Human genome

      • chromosomes 1 – 22

  • 10,859 markers.

  • 2,862,454 markers

Experimental Result

  • We apply our algorithms on real HapMap data (release #19, NCBI build 34, October 2005).


Experiment result for encode regions
Experiment Result for ENCODE Regions

  • We compare our GreedyTag and LRTag with MultiPop-TagSelect(MPS).

    • Multipop-TagSelect first generates the tagSNPs for each single population, then combines the obtained tagSNPs together for multiple populations.

  • The gap between LRTag_lb and LRTag

    • r2 = 0.5: at most two for each region

      totally six for all regions

    • r2 = 0.8: there is no gap.


Experiment result for human genome
Experiment Result for Human Genome

  • The gap between LRTag_lb and LRTag for the whole genome

    • 2,862,454 SNPs in total

    • r2 = 0.5: 1,061

    • r2 = 0.8: 142

The numbers of tagSNPs selected by our algorithms are almost optimal.


Running time of our algorithms
Running Time of Our Algorithms

  • Running environment

    • a 32-processor SGI Altix 4700 supercomputer system

    • 1.6 GHZ CPU

    • 64 GB shared memory

    • 15 threads in parallel.

  • Running time

    • r2= 0.5,

      • ENCODE regions: < 7 seconds for each region, < 1 minute for all regions.

      • Human genome: < 12 minutes for each chromosome, < 1 hour for the genome.

    • r2> 0.5, our algorithms run faster the above speed.


Outline5
Outline

  • Introduction

  • The MCTS Model

  • Our Algorithms

  • Experimental Result


Thanks for your time

and attention!


ad