computational and statistical challenges in association studies
Download
Skip this Video
Download Presentation
Computational and Statistical Challenges in Association Studies

Loading in 2 Seconds...

play fullscreen
1 / 79

Computational and Statistical Challenges in Association Studies - PowerPoint PPT Presentation


  • 62 Views
  • Uploaded on

Computational and Statistical Challenges in Association Studies. Eleazar Eskin University of California, Los Angeles. The Human Genome Project. “What we are announcing today is that we have reached a milestone…that is, covering the genome in…a working draft of the human sequence.”.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Computational and Statistical Challenges in Association Studies' - kalb


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
computational and statistical challenges in association studies

Computational and Statistical Challenges in Association Studies

Eleazar Eskin

University of California, Los Angeles

the human genome project
The Human Genome Project

“What we are announcing today is that we have reached a milestone…that is, covering the genome in…a working draft of the human sequence.”

“I would be willing to make a prediction that within 10 years, we will have the potential of offering any of you the opportunity to find out what particular genetic conditions you may be at increased risk for…”

Washington, DC

June, 26, 2000.

human genetics
Disease Risk

“genetic” factors account for 20%-80% of disease risk.

Many genes contribute to “complex” diseases.

Mother

Risk Factors

Risk Factors

Child

Human Genetics

Mother

Father

Child

  • Personalized Medicine
    • Treatment decisions influenced by diagnostics
  • Understanding Disease Biology
    • New drug targets.
    • Understanding of mechanism of disease.

Where are the risk factors?

(Genetic Basis of Disease)

disease association studies the search for genetic factors
Disease Association StudiesThe search for genetic factors
  • Comparing the DNA contents of two populations:
    • Cases - individuals carrying the disease.
    • Controls - background population.

Differences within a gene between the two populations is evidence the gene is involved in the disease.

s ingle n ucleotide p olymorphisms snps
Single Nucleotide Polymorphisms(SNPs)

AGAGCCGTCGACAGGTATAGCCTA

AGAGCCGTCGACATGTATAGTCTA

AGAGCAGTCGACAGGTATAGTCTA

AGAGCAGTCGACAGGTATAGCCTA

AGAGCCGTCGACATGTATAGCCTA

AGAGCAGTCGACATGTATAGCCTA

AGAGCCGTCGACAGGTATAGCCTA

AGAGCCGTCGACAGGTATAGCCTA

  • Human Variation
    • Humans differ by 0.1% of their DNA.
    • A significant fraction of this variation is accounted by SNPs.
s ingle n ucleotide p olymorphisms association analysis
Associated SNPSingle Nucleotide PolymorphismsAssociation Analysis

Cases: (Individuals with the disease)

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCC

AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTC

AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCC

AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCC

AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCC

AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC

AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTC

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC

Controls: (Healthy individuals)

AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC

AGAGCAGTCGACATGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC

AGAGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCC

AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTC

AGAGCCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCC

AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCC

AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC

AGAGCCGTCGACAGGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC

s ingle n ucleotide p olymorphisms association analysis1
Correlations between SNPs
  • Millions of Common SNPs

False Positives

Challenges:

Single Nucleotide Polymorphisms Association Analysis

Cases:

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

Controls:

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

s ingle n ucleotide p olymorphisms snps1
Single Nucleotide Polymorphisms(SNPs)

Cases:

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

Controls:

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

  • Millions of Common SNPs
  • Correlations between SNPs
  • SNP locations unknown

False Positives

Challenges:

slide9
Successor to the Human Genome Project
  • International consortium that aims in genotyping the genome of 270 individuals from four different populations.
  • Launched in 2002. First phase was finished in October (Nature, 2005).
  • Collected genotypes for 3.9 million SNPs.
  • Location and correlation structure of many common SNPs.
public genotype data growth
HapMap

Phase 2

5,000,000+

SNPs

600,000,000+

genotypes

TSC Data

Nucleic Acids

Research

35,000 SNPs

4,500,000

genotypes

Perlegen Data

Science

1,570,000 SNPs

100,000,000

genotypes

NCBI dbSNP

Genome

Research

3,000,000 SNPs

286,000,000

genotypes

Daly et al.

Nature

Genetics

103 SNPs

40,000

genotypes

Gabriel et al.

Science

3000 SNPs

400,000

genotypes

2001

2002

2003

2004

2005

2006

Public Genotype Data Growth
  • More SNPs increase genome coverage in association studies.
  • More genotypes allow for discovery of weaker associations.
some computational challenges
HAP

SAT Tagger

WHAP

Some Computational Challenges
  • Genetics - identifying disease genes
    • Haplotype phasing - preprocessing SNPs
    • Association study design
    • Association study analysis
    • Population stratification
    • Inferring evolutionary processes (recombination rates, selection, haplotype ancestry).
    • Etc…
  • Genomics - functions of disease genes
    • Predicting functional effect of variation
    • Understanding disease effect on gene regulation
    • Understanding disease effect on metabolic pathways
    • Combining systems biology with genetics
    • Etc…
haplotype phasing
Genotype

T

C

C

mother chromosome

father chromosome

A

CG

G

A

A

ATACGA

AGCCGC

AGACGA

ATCCGC

Possible

phases:

….

Haplotype Phasing

Haplotypes

  • High throughput cost effective sequencing technology gives genotypes and not haplotypes.

ATCCGA

AGACGC

haplotype limited diversity
Haplotype Limited Diversity
  • Previous studies on local haplotype structure:
    • (Daly et al., 2001) chromosome 5q31.
    • (Patil et al., 2001) chromosome 21.
  • Study findings:
    • The SNPs on each haplotype are correlated.
    • SNPs can be separated into blocks of limited diversity.
    • Local regions have few haplotypes.
haplotype data in a block
Haplotype Data in a Block

(Daly et al., 2001) Block 6 from Chromosome 5q31

example phasing
1st Possible

resolution

11111110 2

00000001 7

11000001 3

00000001 7

11011000 2

00000001 7

11111110 2

00000001 7

11000001 3

00000001 7

11011000 2

00000001 7

11000001 3

00000001 7

2nd Possible

resolution

11100110 1

00011001 1

10000001 2

01000001 2

01011001 1

10000000 1

10101110 1

01010001 1

11000001 1

00000001 1

11001000 1

00010001 1

01000001 2

10000001 2

1

1

1

0

0

0

?

?

0

2

1

or

Maximum

Likelihood

Criterion

ExamplePhasing

Genotypes

22222222

22000001

22022002

22222222

22000001

22022002

22000001

Maximum

Likelihood

Haplotype

Inference

is a

NP-Hard

Problem

narrowing the search perfect phylogeny
Narrowing the Search:PerfectPhylogeny

00000

  • A directed phylogenetic tree.
  • {0,1} alphabet.
  • Each site mutates at mostonce.
  • No recombination.

2

01000

1

5

11000

01001

3

11100

4

11110

the perfect phylogeny haplotype problem pph
The Perfect Phylogeny Haplotype Problem (PPH)
  • Given genotypes over a short region.
  • Find compatible haplotypes which correspond to a perfect phylogeny tree.
  • [Gusfield 02’].
  • PPH deficiency – the data does not fit the model.
solving pph
Solving PPH
  • A very simple o(nm2) algorithm for PPH problem. (Also Gusfield 02, Bafna et al., 2003)

But – in practice, we do not expect to see perfect phylogeny in biological data.

We extend our algorithms to the case where the data is almost perfect phylogeny.

Eskin, Halperin, Karp ``Large Scale Reconstruction of Haplotypes from Genotype Data.'’ RECOMB 2003.

hap algorithm
HAP Algorithm
  • HAP Local Predictions
    • http://research.calit2.net/hap/
    • Over 6,000 users of webserver.
  • Main Ideas:
    • Imperfect Phylogeny
    • Maximum Likelihood Criterion
  • Extremely efficient.
    • Orders of magnitude faster than other algorithms.

Eskin, Halperin, Karp ``Large Scale Reconstruction of Haplotypes from Genotype Data.'’ RECOMB 2003.

public genotype data growth1
HapMap

Phase 2

5,000,000+

SNPs

600,000,000+

genotypes

TSC Data

Nucleic Acids

Research

35,000 SNPs

4,500,000

genotypes

Perlegen Data

Science

1,570,000 SNPs

100,000,000

genotypes

NCBI dbSNP

Genome

Research

3,000,000 SNPs

286,000,000

genotypes

Daly et al.

Nature

Genetics

103 SNPs

40,000

genotypes

Gabriel et al.

Science

3000 SNPs

400,000

genotypes

2001

2002

2003

2004

2005

2006

HAP

Timeline

:

Public Genotype Data Growth

Eskin,

Halperin,

Karp

RECOMB

2003

phasing methods
Phasing Methods
  • HAP is one of many phasing algorithms.
    • Clark, 1990, Excoffier and Slatkin, 1995, PHASE – Stephens et al., 2001, HAPLOTYPER - Niu et al., 2002. Gusfield, 2000, Lancia et al. 2001. Many more…
  • Algorithms were designed for only 4-12 SNPs!
  • How do we phase entire chromosomes?
  • HAP “tiling” extension phasing for long regions.
  • Leverages the speed of HAP.
scaling to whole genomes hap tile
Scaling to Whole GenomesHAP-TILE

genotypes

Local predictions

  • For each window we compute the haplotypes using HAP
  • We tile the windows using dynamic programming
haplotype tiling problem
001000

110111

010000

101111

011111

100000

000101

111010

000011

111100

100110

011001

(minimum number of conflicts)

00100000110

11011111001

Haplotype Tiling Problem

(ignoring homozygous positions)

001000

110111

010000

101111

011111

100000

000101

111010

000011

111100

100110

011001

00100000110

11011111001

  • NP-Hard Problem
  • Dynamic Programming Solution
    • (Eskin et al. 2004.)
public genotype data growth2
HapMap

Phase 2

5,000,000+

SNPs

600,000,000+

genotypes

TSC Data

Nucleic Acids

Research

35,000 SNPs

4,500,000

genotypes

Perlegen Data

Science

1,570,000 SNPs

100,000,000

genotypes

NCBI dbSNP

Genome

Research

3,000,000 SNPs

286,000,000

genotypes

Daly et al.

Nature

Genetics

103 SNPs

40,000

genotypes

Gabriel et al.

Science

3000 SNPs

400,000

genotypes

(12 hours)

(24 hours)

(48 hours)

2001

2002

2003

2004

2005

2006

HAP

Timeline

:

Perlegen

collaboration

NCBI dbSNP

collaboration

Public Genotype Data Growth

Eskin,

Halperin,

Karp

RECOMB

2003

slide27
Only 103 SNPs,

0.02% of the

genome!

RECOMB 2003 Submission

association statistics
Association Statistics
  • Assume we are given N/2 cases and N/2 control individuals.
  • Since each individual has 2 chromosomes, we have a total of N case chromosomes and N control chromosomes.
  • At SNP A, let p+A and p-A be the observed case and control frequencies respectively.
  • We know that:

p+A ~N(p+A,p+A(1-p+A)/N).

p-A ~ N(p-A,p-A(1-p-A)/N).

^

^

^

^

association statistics1
Association Statistics

^

p+A ~N(p+A,p+A(1-p+A)/N).

p-A ~ N(p-A,p-A(1-p-A)/N).

p+A- p-A ~ N(p+A- p-A,(p+A(1-p+A)+p-A(1-p-A))/N)

We approximate

p+A(1-p+A)+p-A(1-p-A) ≈ 2 pA(1-pA)

then if p+A =p-A

^

^

^

^

^

association statistic
Association Statistic
  • Under the null hypothesis p+A- p-A=0
  • We compute the statistic SA.
  • If SA< -1(/2) or SA>--1(/2) then the association is significant at level .

-

association power
Association Power
  • Lets assume that SNP A is causal and p+A ≠ p-A
  • Given the true p+A and p-A, if we collect N individuals, and compute the statistic SA, the probability that SA has a significance level of  is the power.
  • Power is the chance of detecting an association of a certain strength with a certain number of individuals.
association statistic1
Association Statistic
  • Lets assume that p+A ≠ p-A then
association power1
Association Power

Power of

association

test

Threshold for

significance

Non-centrality

parameter.

association power2
Association Power
  • Statistical Power of an association with N individuals, non-centrality parameter and significance threshold  is P(, )=
  • Note that if =0, power is always .
indirect association
Indirect Association
  • Now lets assume that we have 2 markers, A and B. Let us assume that marker B is the causal mutation, but we are observing marker A.
  • If we observed marker B directly our statistic would be
indirect association1
Indirect Association
  • However, we are observing A where our statistic is
  • What is the relation between SA and SB?
indirect association2
Indirect Association
  • We want to relate
  • to
indirect association3
Indirect Association
  • We assume conditional probability distributions are equal in case and control samples
indirect association6
Indirect Association
  • How many individuals, NA, do we need to collect at marker A to achieve the same power as if we collected NB markers at marker B?
visualization in terms of power
Visualization in terms of Power

Power of

association

test

Threshold for

significance

Non-centrality

parameters.

correlating haplotypes with the disease
Correlating Haplotypes with the Disease
  • The disease may be correlated with a SNP not in the panel.
  • The disease may be more correlated with a haplotype (group of SNPs) than with any single SNP in the panel.
  • Haplotype tests:
    • Which haplotypes should we test?
    • Which blocks should we pick?
key problem indirect association
Key Problem: Indirect Association
  • We have the HapMap.
    • Information on 4,000,000 SNPs.
  • AffyMetrix gene chip collects information on 500,000 SNPs.
  • What about the remaining 3,500,000 SNPs?
  • So far, we have designed studies by picking tag SNPs with high r2.
  • Can we use the HapMap when performing association?
    • Multi-Tag methods.
basic multimarker method
Basic MultiMarker Method
  • For each SNP in HapMap, find haplotype among genotyped SNPs that has highest r2 to the SNP.
  • Perform association at each SNP and each added haplotype.
  • Now instead of performing 500,000 tests, we perform 4,000,000 tests.
weighted haplotype test
Weighted Haplotype Test
  • For each haplotype h, we assign a weight wh
  • We use a “weighted” allele frequency statistic:
  • This statistic is the weighted numerator in SA.
  • What is the variance of this statistic?
    • Complication: Haplotype frequencies are not independent!
weighted haplotype example
Weighted Haplotype Example
  • Assume we have 4 haplotypes AB, Ab, aB and ab.
  • If we set the weights so that wAB=wAb=1 and waB=wab=0, this is equivalent to looking at the single SNP A.
  • If we set the weights so that wAB=1 and wAb=waB=wab=0, this is equivalent to looking at the single haplotype AB.
  • Other weights are can be something in between.
the test
The -test
  • Each haplotype h is assigned a weight wh.
    • N is the number of individuals.
    • ph - the probablity for h in cases/controls, or average.
    • Under the null, the -test is 2 distributed.
non centrality parameter
Non-Centrality Parameter
  • Under weights w1,w2,w3,w4 and true case/control probabilities p1+,p2+,p3+,p4+ and p1-,p2-,p3-,p4-, Wh is expected to be
  • When normalizing for the variance, the non-centrality parameter is
w h and indirect association
Wh and indirect association
  • Let us assume that SNP C is causal with non-centrality parameter C.
  • If we perform weighted haplotype association, the noncentrality parameter is h.
  • How are they related? (i.e. What is the power of the weighted haplotype association test).
  • Using the same technique, we can show that C=rhh, where rh is the conceptual equivalent of r in 2 SNP case.
the relation to power
The Relation to Power

The power of detecting the SNPwith N individuals is the sameas using the tag SNPs withN/rh2 individuals.

slide55
Choosing the Weights

Optimal weights:

wh(s5) = P(s5 = ‘A’ | h) = qAh

the relation to power1
The Relation to Power
  • This is exactly r2 in the case of one tag SNP.
  • WHAP always has at least as much power as:
  • single SNP test
  • single haplotype test
  • haplotype group test
  • 2 with k degrees of freedom.
slide57
Apply tests: T1,…,T4M

Cases

0.5M SNPs

Controls

0.5M SNPs

HapMap

4M SNPs

Use as training dataset to getthe weights

Tests: T1,…,T4M

Positive results give evidence for a causal SNP

- can be verified by a follow up/two stage study.

slide60
Power Simulations
  • Relative power to using all SNPs.
  • Tested on the ENCODE regions, Affy 500k tag SNPs.
practical issues
Practical Issues
  • We assume we have the haplotype frequencies in the HapMap (not the phase).
  • We assume the case/control populations are coming from the same population as the HapMap.
  • Over-fitting:
    • Train with half of the data, test the other half.
    • No correlation between the haps and random SNPs.
slide64
Associations using WHAP. Red lines are assocations at collected SNPs.

Blue lines are associations at uncollected SNPs inferred by WHAP.

satisfiability and sat solvers
Satisfiability and SAT Solvers
  • Boolean variables called literals
  • Logical operators
    • AND ∧
    • OR ∨
    • NOT ¬
  • Example:
    • (s1 ∨¬s2) ∧ (s2 ∨ s3∨ s1)
    • s1 = false; s2 = false; s3 = true
negation normal form
Negation Normal Form

or

and

and

or

or

or

or

and

and

and

and

and

and

and

and

A

B

 B

A

C

 D

D

 C

rooted DAG (Circuit)

A. Darwiche

local single snp r 2 tagging
Local Single SNP r2 Tagging
  • Generate a clause for each SNP
    • Clause for SNP si contains all covers
  • Input CNF as conjuction of all clauses
  • Compile with minSAT solver
  • Find solutions by traversal of NNF
slide79
UCLA:

Adnan Darwiche

Arthur Choi

Knot Pipatswisawat

ICSI:

Eran Halperin

Richard Karp

Perlegen Sciences:

David Hinds

David Cox

Ph.D. Students:

Buhm Han

Nils Homer

Hyun Min Kang

Sean O’Rourke

Jimmie Ye

Noah Zaitlen

Webserver Hosted By:

ad