roadmap
Download
Skip this Video
Download Presentation
Roadmap

Loading in 2 Seconds...

play fullscreen
1 / 46

Roadmap - PowerPoint PPT Presentation


  • 212 Views
  • Uploaded on

Roadmap. Discovering Patterns Structure-preserving patterns Strings, Networks Permuting patterns Combinatorics Algorithmics Statistics Analyzing Patterns Genographic Project LD Patterns Then & Now (IRIS).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Roadmap' - karl


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
roadmap
Roadmap
  • Discovering Patterns
    • Structure-preserving patterns
      • Strings, Networks
    • Permuting patterns
      • Combinatorics
      • Algorithmics
      • Statistics
  • Analyzing Patterns
    • Genographic Project
    • LD Patterns
      • Then &
      • Now (IRIS)
slide2
Who ? National Geographic and IBM on a five year study, launched in April 2005

What ? Although fossil records fix human origins in Africa, little is known about the great journey that took Homo sapiens to the far reaches of the earth. How did we, each of us, end up where we are?

How ? Using genetics as a tool : samples all around the world are being collected and the mtDNA and NRY chr are being analyzed

phylogeographic question

public participation
Public Participation

Over 250,000 public participants to date (April, 2008)

www.nationalgeographic.com/genographic

www.ibm.com/genographic

www.ibm.com/dna

slide7
How?
  • Each of us carries ancestral material

marked by signatures due to imperfections in DNA replication

      • SNP’s (Single Nucleotide Polymorphisms)
      • STR numbers (Short Tandem Repeats)
      • Inversions
      • …..etc……
  • Uni-parental Model (topology=tree)
      • Non-recombining segments of genome
mtdna micro phylogeny tree
mtDNA Micro-Phylogeny Tree

22 (coding-region) SNPs

The Genographic Project Public Participation MtDNA Database,

Behar et al, PLoS Genetics. 2007

locus
Locus

16000 bp

58 mill bp

0.38%

bi parental model
Bi-parental Model

past

GMRCA

MRCA

present

what is recombination
What is recombination?

Genetic recombination is the process by which a strand of DNA is broken & then joined to the end of a different DNA molecule.

It occurs during meiosis and between paired chromosomes. This process leads to offspring having different combinations of genes from their parents

recombnations capture requirements specification
Recombnations Capture:Requirements Specification
  • Enumerate the (multiple) recombinations
    • Statistical averages not adequate..
  • Identify the participating lineages
  • Detect ancient recombinations as well as recent ones
then our task is to
Then our task is to:

Estimate the phylogenetic network, called the

Ancestral Recombinations Graph (ARG)

ARG coined by Griffiths & Marjoram, 1996

Joint work with Marta Mele, Jaume Bertranpetit, Francesc Callafel

an inconvenient truth
An Inconvenient Truth

Theorem: Given data D, the problem of computing the ARG G with minimum number of recombinations is NP-complete.

an inconvenient truth1
An Inconvenient Truth

Theorem: Given data D, the problem of computing the ARG G with minimum number of recombinations is NP-complete.

Recall other inconvenient truths….

Theorem: The problem of computing the most parsimonious tree T is NP-complete.

flavors of hardness
Flavors of hardness….

(Uni-parental)

  • In a NON infinite-sites model, TREE construction hard
    • No back mutations; No parallel mutations
    • But reality is infinite-sites
    • Yet, problem is tractable, in practice

(Bi-parental)

  • In a pure recombinations model, problem is hard
    • Generally a “statistical average” has been pursued thru LD…
    • Combining potentially misleading mutations with recombinations makes the general problem intractable in practice
tractability model balance between reality and simplicity
Tractability Model(Balance between reality and simplicity)
  • Use characteristics of the observed haplotypes
  • Use a compatible network model (not a generic phylogenetic model)
slide22

IRIS(IdentifyingRecombinationsInSequences)

Stage Haplotypes: use SNP block patterns

biological insights

Segment along the length: infer trees

computational insights

Infer network (ARG)

slide24

Stage 1: Staging the Input

0 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 0) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1) 1 3 4 2 3 2 2 2 2 3 4 3 3 3 3 2 2 3 3 4 4 3 2 2 2 2 2 2 1 2 1 2) 1 3 4 2 3 2 2 2 2 3 4 3 3 3 3 2 2 3 3 4 4 3 2 4 1 1 4 3 2 6 1 3) 1 3 5 2 3 2 2 2 2 4 5 1 1 1 1 1 3 1 4 1 5 4 3 5 3 4 5 3 1 3 1 4) 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 3 0 3 3 1 2 1 5) 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 115 1 6) 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 3 1 7) 2 1 2 2 2 2 2 2 2 2 2 1 1 1 2 2 2 2 1 1 1 1 1 3 2 3 3 3 1 3 1 8) 2 1 2 2 2 2 2 2 2 2 2 1 1 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 111 1 9) 2 3 4 2 3 2 2 2 2 3 4 3 3 3 3 2 2 3 3 4 4 3 2 4 1 1 4 3 2 8 1 10) 2 3 5 2 3 2 2 2 2 4 5 1 1 1 1 1 3 1 4 1 5 4 3 5 3 4 5 3 1 5 1 11) 3 2 3 2 2 3 2 3 2 2 3 2 2 2 3 2 2 2 2 3 1 1 1 3 2 3 3 3 1 6 1 12) 3 2 3 2 2 3 2 3 2 2 3 2 2 2 3 2 2 2 2 3 1 1 1 0 1 1 4 3 2 2 1 13) 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 14) 5 1 6 2 2 2 2 2 2 5 3 4 2 2 3 3 2 3 1 1 1 1 1 1 1 1 1 1 1 2 1 15) 7 2 3 2 2 3 2 3 2 2 3 2 2 2 3 2 2 2 2 3 1 1 1 3 2 3 3 3 1 2 1 16) 7 2 3 2 2 3 2 3 2 2 3 2 2 2 3 2 2 2 2 3 1 1 1 6 1 1 1 1 1 2 1 17) 3 2 3 2 2 3 2 3 2 2 3 2 2 2 3 2 2 2 2 3 0 2 2 2 2 2 2 2 1 1 1 18) 2 1 1 1 1 1 1 1 1 1 4 3 3 3 3 2 2 3 3 4 4 3 2 4 1 1 4 3 2 1 1 19) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 1 1 1 20) 3 2 3 2 2 3 2 3 2 2 3 2 2 2 3 2 2 3 3 4 4 3 2 4 1 1 4 3 2 1 1 21) 2 3 5 2 3 2 2 2 2 4 5 1 1 1 1 1 3 1 4 1 5 4 3 5 2 2 2 2 1 1 1 22) 5 1 2 2 2 2 2 2 2 2 2 1 1 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 1 1

ancestral recombination graph arg our characterization
Ancestral Recombination Graph (ARG) (our characterization)

ARG is a fortified compatible graph:

  • Defined on k segments G(k)
  • A node can have at most 2 incoming edges (2 parents)
  • When 2 parents: denotes recombination of two segments; incoming edge is labeled by one segment each
stage 3 trees to forest dsr algorithm
Stage 3: Trees to ForestDSR Algorithm

Input: Two graphs G1 and G2

Output: Consensus ARG G

Optimization

Topology

DSR

dsr algorithm overview
DSR Algorithm Overview

initialization

  • Let G1 and G2 be defined on leaf labels L
  • Let universe U  L
    • P1 and P2 are partitions on U at leaf level
  • DO-WHILE
  • A network structure with nodes in G and the labels derived from P1 and P2
  • Universe U  this nodes in G
  • Increment layer and update P1 and P2 as sets on U of this layer
    • P1 has labels from G1
    • P2 has labels from G2
  • WHILE (P1 is nonempty) OR (P2 is nonempty)

iterative loop

walk through of dsr algorithm
Walk-through of DSR Algorithm

(0 2 5 9 12-14 16 18-20 23-24 28-30 33 35-36) 2 7:24:1 7:25:1

(1 6 8 17 21-22 26 31) 1 7:25:2

(7 11 15 34 37) 1 7:25:3

(25) 1 7:70:7

(3 10 27) 1 7:25:4

(4) 1 7:00:9

(32) 1 7:00:0

(0 5 13-14 16 23-24 29-30) 1 8:27:1

(1 6 8 17 21-22 26) 1 8:27:2

(19 36) 1 8:27:5

(2 9 12 18 20 31) 1 8:28:2

(3-4 7 10-11 15 25 27-28 32 34-35 37) 1 8:80:6

(33) 1 8:00:0

d s r dominant subdominant recombinant
DSRDominantSubdominant Recombinant
  • Dominant: labels of G1 AND G2
  • Subdominant: label of G1 OR G2
  • Recombinant: no labels (NEITHER G1 NOR G2)

Rules:

1. Each row and each column

has at most one dominant

ELSE has at most one subdominant

ELSE all recombinants

2. A non-recombinant can have non-recombinants either in its row or its column but NOT both

dsr algorithm x matrix
DSR Algorithm: X-matrix

P2

P1

labels

P1

P2 labels

dsr algorithm rows cols d s r
DSR Algorithm: rows & cols DSR

P2

P1

labels

P1

P2 labels

dsr algorithm rows cols d s r1
DSR Algorithm: rows & cols DSR

P2

P1

labels

P1

P2 labels

dsr algorithm rows cols d s r2
DSR Algorithm: rows & cols DSR

P2

P1

labels

P1

P2 labels

d s r feasible topology
DSR Feasible Topology

Next layer

Last layer

slide39

chr21:14505500 -14602168

Chinese (2 subpops: CBx, HNx); Japanese (JTx) data

  • Around 200 SNPs
  • Around 100 haplotypes
network
Network

Median-joining networks for inferring intraspecific phylogenies, Bandelt, Forster & Rohl,Molecular Biology and Evolution, Vol 16, 37-48, 1999

iris i dentifying r ecombinations i n s equences
IRIS(IdentifyingRecombinationsInSequences)

1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234511111111111111111111111111111111111111112222222222222222222222222222222222233333333344444444455555555555555----

iris non recombining cluster ids
IRIS: Non-recombining Cluster Ids

11 12 13 14 15 16 0 17 1 18 4 19 65 20 8 21 9107 22 23 32 24

chr 21 locus preliminary results
Chr 21 locus: Preliminary Results
  • Not distinguishable

share recent & ancient recombinations

  • No pop-specific mutation/recombination
the big picture
The Big Picture

Ecosystem

Population Genomics

Species

Organism

Physiology

Metabolism

Network

Function

Structure

Sequence

thank you
Thank You!

… success stories in bioinformatics will depend on algorithmic and statistical ingenuity.

Pavel Pevzner

ad