snp and haplotype analysis algorithms and applications l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
SNP and Haplotype Analysis Algorithms and Applications PowerPoint Presentation
Download Presentation
SNP and Haplotype Analysis Algorithms and Applications

Loading in 2 Seconds...

play fullscreen
1 / 39

SNP and Haplotype Analysis Algorithms and Applications - PowerPoint PPT Presentation


  • 148 Views
  • Uploaded on

SNP and Haplotype Analysis Algorithms and Applications. Eran Halperin International Computer Science Institute Berkeley, California. “Computational Genetics”. The Human Genome Project.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'SNP and Haplotype Analysis Algorithms and Applications' - gilmore


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
snp and haplotype analysis algorithms and applications

SNP and Haplotype Analysis Algorithms and Applications

Eran Halperin

International Computer Science Institute

Berkeley, California

CPM 2006

the human genome project
The Human Genome Project

“What we are announcing today is that we have reached a milestone…that is, covering the genome in…a working draft of the human sequence.”

“But our work previously has shown… that having one genetic code is important, but it's not all that useful.” (referring to comparative genomics).

“I would be willing to make a predication that within 10 years, we will have the potential of offering any of you the opportunity to find out what particular genetic conditions you may be at increased risk for…”

Washington, DC

June, 26, 2000.

CPM 2006

individually tailored medicine
Individually Tailored Medicine

People react to different drugs indifferent ways.

The vision: a simple DNA test would help todetermine which medicine to prescribe.

CPM 2006

slide5

International consortium that aims in genotyping the genome of 270 individuals from four different populations.

  • Launched in 2002. First phase was finished in October (Nature, 2005).

CPM 2006

motivation
Motivation

Genetic

Factors (50%)

Complexdisease

Environmental

Factors (50%)

Multiple genes may affect the disease.

Therefore, the effect of every single gene may be negligible.

CPM 2006

disease association studies the search for genetic factors
Disease Association StudiesThe search for genetic factors
  • Comparing the DNA contents of two populations:
    • Cases - individuals carrying the disease.
    • Controls - background population.

A significant discrepancy between the two populations is an evident to a causal gene.

CPM 2006

where should we look

Associated SNP

Where should we look?

Usually SNPs are bi-allelic (only two letters appear).

SNP= Single Nucleotide Polymorphism

Cases:

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCC

AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTC

AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCC

AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCC

AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCC

AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC

AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTC

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC

Controls:

AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC

AGAGCAGTCGACATGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC

AGAGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCC

AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTC

AGAGCCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCC

AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCC

AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC

AGAGCCGTCGACAGGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC

CPM 2006

genotyping technology
Genotyping Technology
  • Extracting the allele information for a SNP from a DNA sample.
  • Considerable genotyping costs reductions in the last couple of years.
  • Current cost allows for the genotyping of 500,000 SNPs for ~$1000 (compared to ~50 cents per SNP 3-4 years ago).

CPM 2006

haplotypes
Haplotypes
  • SNPs in physical proximity are correlated.
  • A sequence of alleles along a chromosome are called haplotypes.

CPM 2006

haplotype block structure
Haplotype Block Structure

(Daly et al., 2001) Block 6 from Chromosome 5q31

CPM 2006

haplotypes as proxies for rare snps

000

001

111

Tag SNPs

Haplotypes as Proxies for Rare SNPs

Common haplotypes:

  • 011000111 (23% of population)
  • 000001111 (55% of population)
  • 111111111 (14% of population)

CPM 2006

tag snp selection
Tag SNP Selection
  • Input: a set of genotypes
  • Goal: find a set of t tag SNPs such that using these SNPs only, the error rate for the prediction of all other SNPs is minimized.

Formulation by [H., Kimmel, Shamir, 05’] (STAMPA)

CPM 2006

tag snps

Correlations between SNPs

Tag SNPs

Cases:

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

Controls:

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

CPM 2006

basic assumption

intermediate SNPs

SNP j

SNP k

Basic Assumption

Given two SNPs, the probabilities of the values at any

intermediate SNPs do not change if we know the values of additional distal ones.

CPM 2006

stampa s election of ta g snps to m aximize p rediction a ccuracy

intermediate SNPs

SNP j

SNP k

Test genoteype

STAMPA (Selection of TAg SNPs to Maximize Prediction Accuracy)

1. Put aside one test genotype. Use the rest of the data to develop a majority rule for each pair of SNPs to predict intermediate SNPs values.

2. Average prediction error over all test genotypes gives a score to the pair j and k.

3. Apply dynamic programming to obtain best set of tag SNPs.

CPM 2006

slide18

Comparison: STAMPA vs. ldSelect

x - STAMPA, - ldSelect

52 sets of Yoruba genotypes (Gabriel et al., 2002).

CPM 2006

slide19

The haplotype ancestral structure of two subtypes of NHL.

The trees are automatically generated by HAP (H., Eskin, 04’).

CPM 2006

phasing

Genotype

T

C

C

ì

ü

ì

ü

ì

ü

mother chromosome

father chromosome

A

CG

í

ý

í

ý

í

ý

G

A

A

î

þ

î

þ

î

þ

ATACGA

AGCCGC

AGACGA

ATCCGC

Possible

phases:

….

Phasing

Haplotypes

  • Cost effective genotyping technology gives genotypes and not haplotypes.

ATCCGA

AGACGC

CPM 2006

public genotype data growth
Public Genotype Data Growth

Perlegen Data

Science

1,570,000 SNPs

100,000,000

genotypes

HapMap

Phase 2

5,000,000+

SNPs

600,000,000+

genotypes

TSC Data

Nucleic Acids

Research

35,000 SNPs

4,500,000

genotypes

NCBI dbSNP

Genome

Research

3,000,000 SNPs

286,000,000

genotypes

Daly et al.

Nature

Genetics

103 SNPs

40,000

genotypes

Gabriel et al.

Science

3000 SNPs

400,000

genotypes

2001

2002

2003

2004

2005

2006

- HAP’s speed allows it to phase whole-genome datasets

- HAP is very accurate (Marchini et al., 2006).

CPM 2006

hap phasing model
HAP Phasing Model

00000

  • A directed phylogenetic tree.
  • {0,1} alphabet.
  • Each site mutates at mostonce.
  • No recombination.
  • Goal: Finding a phase that fits the tree modelFormulation: [Gusfield, 2003]

2

01000

1

5

11000

01001

3

11100

4

11110

CPM 2006

example

2

01000

1

5

11000

01001

3

4

11100

01011

Example

00000

Genotypes

02022

22200

21222

21200

02000

01022

Haplotypes

00000

01000

11100

01011

Given the tree and the haplotypes the phase is unique

CPM 2006

phasing via greedy
Phasing via Greedy
  • A simple heuristic:
    • Find a haplotype that is compatible with as many genotypes as possible.
    • Assign the haplotype for these genotypes.
    • Continue with the rest of the genotypes.
  • Intuition: Haplotypes with missing data.

CPM 2006

haplotypes with missing data
Haplotypes with missing data

Input:

111*11*1

00*01*1*

01*000*0

11*11*11

*111**00

1111*11*

01*00010

Output:

11111111

00001111

01000010

11111111

11110000

11111111

01000010

Goal: Find a maximum likelihood phase.

CPM 2006

greedy analysis h karp 2005
Greedy Analysis (H., Karp, 2005)
  • Maximum likelihood == minimum entropy solution.
  • Entropy(Greedy) < Entropy(OPT) + 3.
  • Can be viewed as a variant of set cover.

CPM 2006

mother father child trios
Mother, Father, Child Trios
  • Advantages:
    • Better phasing results(Marchini et al., 06’).
    • Population stratification(Spielman et al., 93’).
  • Disadvantage:
    • 50% more expensive (and thus, reduces power).

CPM 2006

inferring haplotypes from trios

10011?

11111?

1??11?

1??11?

10?11?

11?11?

1??11?

1??11?

?100??

?100??

1100??

0100??

11000?

01001?

1100??

0100??

1?0???

1?0???

100???

110???

10011?

11000?

1?0???

1?0???

Inferring Haplotypes From Trios

Parent 1

122112

Parent 2

210022

120222

Child

Assumption: No recombination

CPM 2006

slide30

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Mother transmitted allele

A

A

A

A

A

A

A

A

G

G

G

G

G

G

G

G

Mother untransmitted allele

A

A

A

A

G

G

G

G

A

A

A

A

G

G

G

G

Father transmitted allele

A

A

G

G

A

A

G

G

A

A

G

G

A

A

G

G

Father untransmitted allele

A

G

A

G

A

G

A

G

A

G

A

G

A

G

A

G

Father and Child pool – allele frequency

0

1

2

3

0

1

2

3

1

2

3

4

1

2

3

4

Mother and Child pool – allele frequency

0

0

1

1

1

1

2

2

2

2

3

3

3

3

4

4

  • Every configuration has a different pair of values.
  • Except for configurations 7 and 10 (het-het-het).

CPM 2006

genotyping unrelated individuals
Genotyping Unrelated Individuals

Edge size  pool size (accuracy)

Vertex degree  amount of DNA used

CPM 2006

slide33
For every m, what is the largest n, so that m equations uniquely determine the n {0,1,2} variables?

For every m, what is the largest n for which

A  {0,1}mn, s.t. x,x’ {0,1,2}n , Ax=Ax’ x=x’

CPM 2006

lower bound
Lower Bound
  • A random matrix A.
    • For every x {-2,-1,0,1,2}n, Aix=0 with prob. O(k-0.5) where k is the number of non-zero elements.
    • Since the rows are independent, the probability that Ax = 0 is O(k-m/2).
    • Using union bound, n=(m log m).

CPM 2006

upper bound
Upper Bound
  • Counting argument:
    • There are at most (2n)m different values that Ax can take.
    • There are 3n values for x.
    • 3n< (2n)m and so n < O(m log m).

CPM 2006

further challenges
Further Challenges
  • Population stratification
    • In case/control studies and in family based studies.
    • Admixed populations.
  • Other pooling schemes
    • Practical considerations: error rates, missing data, scalability, etc.
  • Inferring evolutionary processes (e.g. selection, recombination rate, haplotype ancestry, etc.).

CPM 2006

summary
Summary
  • Exciting times in genetics: changes in medicine may be felt in our lifetime.
    • An opportunity for Computer Scientists to have a huge impact.
  • An interdisciplinary work is needed. It involves computer science,statistics, genetics, biology,and medicine.

CPM 2006

acknowledgement
UCSD

Eleazar Eskin.

Tel-Aviv U.

Ron Shamir

Gad Kimmel

Noga Alon

HIIT

MattiKaariainen

SequenomInc.

Andreas Braun

Ken Abel

Perlegen Sciences

David Hinds

David Cox

UC Berkeley

Richard Karp

Chris Skibola

MPI

ReneBeier

CHORI

KennyBeckman

Acknowledgement

CPM 2006