The Lion, the Leopard, the Wolf, or the Boar…
Download
1 / 129

The Lion, the Leopard, the Wolf, or the Boar… Why not more? - PowerPoint PPT Presentation


  • 84 Views
  • Uploaded on

The Lion, the Leopard, the Wolf, or the Boar… Why not more?. Comparative Genomics. Bud Mishra. Professor of Computer Science, Mathematics and Cell Biology ¦ Courant Institute, NYU School of Medicine, Tata Institute of Fundamental Research, and Mt. Sinai School of Medicine. Outline. Biology

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' The Lion, the Leopard, the Wolf, or the Boar… Why not more?' - zoe


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

The Lion, the Leopard, the Wolf, or the Boar…

Why not more?

Comparative Genomics


Bud mishra

Bud Mishra

Professor of Computer Science, Mathematics and Cell Biology

¦

Courant Institute, NYU School of Medicine, Tata Institute of Fundamental Research, and Mt. Sinai School of Medicine


Outline
Outline

  • Biology

    • Evolution by Duplication

    • Segmental Duplications in mammalian genomes

  • Mathematical Challenges in Systems Biology

  • Compare Genomes All-vs-All

    • Models:

    • Blast and SWAT:

    • MUMMER:

    • PRIZM

  • Nondeterminism, Probability and Determinism: Computing the Priors and Mixed Strategies.

  • Demos: IT WORKS!



Introduction to biology1
Introduction to Biology

  • Genome:

    • Hereditary information of an organism is encoded in its DNA and enclosed in a cell (unless it is a virus). All the information contained in the DNA of a single organism is its genome.

  • DNA molecule can be thought of as a very long sequence of nucleotidesor bases:

    S = {A, T, C, G}


Complementarity
Complementarity

  • DNA is a double-stranded polymer and should be thought of as a pair of sequences over S. However, there is a relation of complementarity between the two sequences:

    • A , T, C , G

    • That is if there is an A (respectively, T, C, G) on one sequence at a particular position then the other sequence must have a T (respectively, A, G, C) at the same position.

  • We will measure the sequence length (or the DNA length) in terms of base pairs (bp): for instance, human (H. sapiens) DNA is 3.3 £ 109 bp measuring about 6 ft of DNA polymer completely stretched out!


Inert rigid
Inert & Rigid

  • Complementary base pairs (A-T and C-G) are connected by hydrogen bonds and the base-pair forms a coplanar “rung” connecting the two strands.

    • Cytosine and thymine are smaller (lighter) molecules, called pyrimidines

    • Guanine and adenine are bigger (bulkier) molecules, called purines.

    • Adenine and thymine allow only for double hydrogen bonding, while cytosine and guanine allow for triple hydrogen bonding.

  • Thus the chemical (through hydrogen bonding) and the mechanical (purine to pyrimidine) constraints on the pairing lead to the complementarity and makes the double stranded DNA both chemically inert and mechanically quite rigid and stable.


J b s haldane
J.B.S. Haldane

  • “If I were compelled to give my own appreciation of the evolutionary process…, I should say this: In the first place it is very beautiful. In that beauty, there is an element of tragedy…In an evolutionary line rising from simplicity to complexity, then often falling back to an apparently primitive condition before its end, we perceive an artistic unity …

  • “To me at least the beauty of evolution is far more striking than its purpose.”

    • J.B.S. Haldane, The Causes of Evolution. 1932.



Mer scape
Mer-scape…

  • Overlapping words of different sizes are analyzed for their frequencies along the whole human genome

    • Red: 24-mers,

    • Green: 21-mers

    • Blue:18 mers

    • Gray:15 mers

  • To the very left is a ubiquitous human transposon Alu. The high frequency is indicative of its repetitive nature.

  • To the very right is the beginning of a gene. The low frequency is indicative of its uniqueness in the whole genome.


Doublet repeats
Doublet Repeats

  • Serendipitous discovery of a new uncataloged class of short duplicate sequences; doublet repeats.

    • almost always < 100 bp

    • appear to use duplicative machinery that does not have the hallmarks of transposition, segmental duplication, or pseudogene formation.

  • (Top) . The distance between the two loci of a doublet is plotted versus the chromosomal position of the first locus.

  • (Bottom) : Distribution of doublets (black) and segmental duplications (red) across human chromosome 2


J.B.S. Haldane (1932):

“A redundant duplicate of a gene may acquire divergent mutations and eventually emerge as a new gene.”

Susumu Ohno (1970):

“Natural selection merely modified, while redundancy created.”

EBD


Evolution by duplication
Evolution By Duplication

  • Tandem:

    • Polymerase slippage

    • Unequal crossover: γ-globin duplication mediated by L1.

  • Interspersed:

    • Transposons

    • Exon shuffling

    • Segmental duplication

  • Whole genome

    • S. cerevisiae; early vertebrates; A. thaliana, etc


Duplicated genes
Duplicated Genes

  • Mutation during nonfunctionalization (MDN)

  • Gene sharing, Duplication-degeneration-complementary (DDC)

  • Dosage compensation

  • Multi-function constraint


Chromosomal aberrations
Chromosomal Aberrations

  • Breakage

  • Translocation (Among non-homologous chromosomes.)

  • Formation of acentric and dicentric chromosomes.

  • Gene Conversions

  • Amplifications and deletions

  • Point mutations

  • Jumping genes a Transposition of DNA segments

  • Programmed rearrangements a E.g., antibody responses.


Recent segmental duplications
Recent Segmental Duplications

Human

  • 3.5% ~ 5% of the human genome is found to contain

    • segmental duplications, with length > 5 or 1kb, identity > 90%.

  • These duplications are estimated to have emerged about 40Mya under neutral assumption.

  • The duplications are mostly interspersed (non-tandem), and happen both inter- and intra-chromosomally.

From [Bailey, et al. 2002]


The model

f - -

f - -

deletion or mutation

insertion

f + -

f + -

Duplication by recombination between other repeats or other mechanisms

deletion or mutation

insertion

f ++

f ++

Duplication by recombination between repeats

Mutation accumulation in the duplicated sequences

The Model


The mathematical model
The Mathematical Model

Time after duplication

1-α-2β

1-α-2β

1-α-2β

h0--

α

α

α

α

f - -

h1--

γ

γ

γ

h0+-

h0

α

α

α

α

H0

f + -

1-α-β/2-γ

1-α-β/2-γ

1-α-β/2-γ

β/2

β/2

β/2

h0++

α

α

α

α

H1

f ++

h1

h1++

1-α-2γ

1-α-2γ

1-α-2γ

0 ≤ d < ε

ε ≤ d < 2ε

(k-1)ε ≤ d < kε

h1: proportion of duplications by repeat recombination;

h1++: proportion of duplications by recombination of the specific repeat;

h1- -: proportion of duplications by recombination of other repeats;

h0: proportion of duplications by other repeat-unrelated mechanism;

h0++: proportion of h0 with common specific repeat in the flanking regions;

h0+-: proportion of h0 with no common specific repeat in the flanking regions;

h0- -: proportion of h0 with no specific repeat in the flanking regions;

α: mutation rate in duplicated sequences;

β: insertion rate of the specific repeat;

γ: mutation rate in the specific repeat;

d: divergence level of duplications;

ε: divergence interval of duplications.


Mechanisms of segmental duplications in mammalian genomes
Mechanisms ofSegmental Duplications in Mammalian Genomes

  • To test and quantify the hypothesis, we designed a Markov model that incorporates:

    • evolutionary dynamics of the sequence elements

    • interactions between different elements

  • Segmental duplication is a multi-mechanism process.

  • 12% was caused by recombination between the most active interspersed repeats.

  • Physical instability in the DNA sequences is also involved in duplication mechanism.

  • The results showed dynamic interaction between duplications at different scales.


Tasks
TASKS

  • Paralogs:

    • Compare one genome against itself

  • Orthologs:

    • Compare one genome against another

  • Compare multiple sets of genomes

  • Metagenomes:



Mathematical challenges in systems biology
Mathematical challenges in Systems Biology

  • How to fully decipher the (digital) information content of the genome

  • How to do all-vs-all comparisons of 1000s of genomes

  • How to extract protein and gene regulatory networks from 1 & 2

  • How to integrate multiple high-throughout data types dependably

  • How to visualize & explore large- scale, multi- dimensional data

  • How to convert static network maps into dynamic mathematical models

  • How to predict protein function ab initio

  • How to identify signatures for cellular states (e. g. healthy vs. diseased)

  • How to build hierarchical models across multiple scales of time & space

  • How to reduce complex multi- dimensional models to underlying principles


Evolution by mutation
Evolution by Mutation

  • Mutant gene or DNA sequence:

    • Substitution, Insertion/Deletion, Recombination, Duplication, Gene Conversion

    • Spread through a population by genetic drift and/or natural selection and eventually is fixed in a species

  • Nucleotide substitution can be divided into two classes:

    • Transitions: Substitution of a purine by purine (A, G) or a pyrimidine by a pyrimidine (C,T)

    • Transversions: The other types.

  • More specific properties:

    • Frameshift mutation, nonsense mutation, synonymous or silent substitutions, non-synonymous or amino acid replacement substitution

    • Transposons, gene conversions, horizontal gene transfer.


Jukes cantor
Jukes Cantor

  • The nucleotide substitution occurs at any site with equal frequency, and that at each site a nucleotide changes to of the three remaining nucleotides with a probability of a per year.

  • Probability of a change of a nucleotide to another is r = 3 a

  • qt = Proportion of identical nucleotides between two sequences

    qt+1 = (1-2r) qt +(2/3) r (1-qt)

    q(t) = 1 – ¾(1-e-8rt/3)


Kimura 2 parameters
Kimura 2 Parameters

  • The rate of transitional substitution per site per year = a

  • The rate of transversional substitution per site per year = 2b

  • Total substitution rate, r = a + 2 b

  • Pt is the proportion of identical transition-type pairs AG, GA, CT, TC

  • Qt is the proportion of identical transversion-type pairs AG, GA, CT, TC

    P(t) = (1/4)(1- 2 e-4(a+b)t +e-8b t)

    Q(t) = (1/2)(1 –e-8b t)


Other models
Other Models

  • Tajima & Nei:

    • Substitutions that seem to be rather insesnsitive to various disturbing factors.

  • Tamura:

    • Takes into account varying GC content

  • Hasegawa et al. (HKY)

  • Rzhetsky & Nei

  • Tamura & Nei


Blast and swat blast and the world blasts with you swat and you swat alone

Blast and SWAT:Blast, and the world Blasts with you; SWAT, and you SWAT alone…


Pair wise alignment task definition
Pair-wise Alignment:Task Definition

  • Given

    • a pair of sequences (DNA or protein)

    • a method for scoring the similarity of a pair of characters

  • Do

    • determine the correspondences between substrings in the sequences such that the similarity score is maximized


Dp alignment algorithms
DP Alignment Algorithms

  • Needleman & Wunsch

    • Improvement for Affine Gap-penalty, by Smith & Waterman (Swat)

    • Dynamic programming:

      • Determine alignment of two sequences by determining alignment of all prefixes of the sequences


Comparing syntenic regions
Comparing Syntenic Regions

Comparison of Human and Fugu orthologous genomic sequence using Plains and EMBOSS. This sequence contains six exon regions. Both Plains and EMBOSS correctly identify the exon region 2 in Fugu with 3 in Human, but only Plains correctly aligns the exon region 5 in Fugu with 5 in Human.


Heuristic alignment
Heuristic Alignment

  • FASTA [Pearson & Lipman, 1988]

  • BLAST [Altschul et al., 1990]

    • BLAST heuristically finds high scoring segment pairs (HSPs):

      • identical length segments from 2 sequences with statistically significant match scores

      • i.e. ungapped local alignments

      • key tradeoff: sensitivity vs. speed


The mummer system
The MUMmer System

  • Delcher et al., Nucleic Acids Research, 1999

  • Given genomes A and B

    • Find all maximal, unique, matching subsequences (MUMs)

    • Extract the longest possible set of matches that occur in the same order in both genomes

    • Close the gaps

    • Output the alignment


Find longest subsequence
Find Longest Subsequence

  • Sort MUMs according to position in genome A

  • Solve variation of Longest Increasing Subsequence (LIS) problem to find sequences in ascending order in both genomes

    • Requires O(k log k) time where k is number of MUMs


The more the merrier

Algorithm

Homology Seed

Indexing

Reference

BLAST

exact k-mer

Scanned with DFA*

[31]

WABA

wobble base degenerate k-mer

Array

[32]

LSH-ALL-PAIRS

randomly projected k-mer with <d mismatches

Sorted Array

[33]

BLASTZ

discontinuous exact k-mer

Hash Table

[34][35]

PatternHunter

discontinuous exact k-mer

Hash Table

[36]

BLAT

exact or inexact k-mer

Hash Table

[37]

CHAOS

exact or inexact k-mer

T-Trie

[38]

PASH

discontinuous exact k-mer

Hash Table

[39]

REPuter

maximal exact repeat

Suffix Tree

[40]

FORRepeats

exact repeat

Factor Oracle

[41]

The More the Merrier


Summary
Summary

  • Many innovative sequence alignment tools are available for detailed comparative genomic studies.

    • Recent segmental duplications in mammalian genomes (with identity level >90%) can be detected using BLAST and many other tools.  

  • They use exact or inexact k-mers as homology seeds.

  • As homology levels become lower, they encounter a dilemma between sensitivity and computational efficiency

    • homologous segments,

    • segmental duplications, or

    • homology-based phylogentic distances.


Summary1
Summary

  • To improve sensitivity they rely on exhaustive searches of exact matches with short mers or inexact matches with long mers.

    • They encounter too many false-positives, to be later filtered through an expensive post-processing step.

  • Instead, more stringent search criteria (longer mers with more exact matches) may be used to improve efficiency.

    • Then these algorithms fail to detect low-homology regions, such as ancient duplication events.

  • In order to detect less-recent duplications, orthologous genes have been used as “anchors” to map out the duplication blocks. But, for obvious reasons, they are unsuitable for identifying duplications that are not subject to strong selection process.


Prizm paxia rastogi zhou mishra

#2

PRIZM(Paxia, Rastogi, Zhou, Mishra)

How to do all-vs-all comparisons of 1000s of genomes


Prizm
PRIZM

  • It uses a Bayesian scheme.

    • It is efficient…Linear time.

    • It computes homologous regions between two genomes even when the homology level drops to a value around 65%.

      • Incorporates background knowledge about genome evolution, by experimenting with several priors (noninformative improper prior, exponential and Gamma priors, and priors based on Juke-Cantor one parameter and Kimura’s two parameters models of evolution).

      • The results appear to be unaffected by these choices, while the computational efficiency is only mildly affected by what prior is employed.


A simple observation
A Simple Observation

  • Homology is hard to compute but easy to verify. Quadratic vs. Linear time.

    • Can a probabilistic approach replace nondeterminism? If so, we can expect a probabilistic linear time algorithm.

    • Unlikely! But if we can use priors based on the underlying distributions, there is hope.

    • Many computational biology problems share this feature!


#2

The genomic sequences under comparison: A) M. genitalium (Y-axis) and M. pneumoniae (X-axis), computed using 300 probes from six iterations, taking about 5 seconds using a non-optimized algorithm


#2

The mouse chromosome 10 and rat chromosome 1 share a syntenic region about 20Mb at the beginning of the two chromosomes (arrow).


#2

Alignment between Mouse Chromosome 8 andhuman chromosome 4.

In silico experiment uses 21 mers (5,105,3), and takes about 45 seconds.


Demo

Human vs. Mouse

M. genitalium vs. M. pneumoniae


Homology curve
Homology Curve

  • An m-mer is a word of length m, selected from either genome.

  • Consider a location in the first genome, G1[a] and a short window, starting at a.

    W1,a = G1[a, a+m−1]

  • Compare this window with a word of equal length from the second genome starting at G2[b]:

    W2,b = G2[b, b+m − 1].

  • Define the homology level for the locations G1[a] and G2[b] and a window of size m as

    h(2)(G1[a], G2[b], m) = (1/m) åg=1m-1IG1[a+g] = G2[a + g].


Homology curve1
Homology Curve

  • Let us define, h(a) to denote the highest homology level for genome G1 at position a and computed with respect to G2:

    h(a) = max1 ·b <|G2| - m h(2)(G1[a], G2[b], m)

  • The “homology curve” for the first genome, G1 with the respect to the second genome G2 is then defined as:

    h : [1..G1 −m − 1] → [0, 1]

    : aa h(a)


Prizm1
PRIZM

  • Replacing non-determinism with a probabilistic guessing scheme.

    • The probability distributions are based on biologically meaningful priors.

    • Using these priors it guesses a local homology curve, and designs and performs an in silico experiment.

    • It uses the results to verify its guess (in linear time) and refines the local homology curve and the probability distributions for the next iteration.


Probabilistic guesses
Probabilistic guesses

  • Use a Bayesian scheme and a boosting approach to modify the probability distributions of the “guessing experiments” from one iteration to next.

  • At each iteration, a sequence of words with a specific distribution is selected from one genome, and is optimally partitioned into groups for “in silico experiments” involving

    • exact-match search,

    • inexact-match search with one error,

    • inexact match search with two errors, etc.


Probabilistic guesses1
Probabilistic guesses

  • These searches can be efficiently conducted over the second genome,

    • Assuming that the other genome has been preprocessed and stored in an efficient data structure (e.g., suffix arrays or hash table).

    • From the results of the experiments, a Bayesian estimator can compute the local homology levels for the genome, and use it to verify and sharpen the probabilistic distributions for the next iteration.

  • The algorithm converges to the true local homology levels after a few iterations.


In silico experiments
In Silico Experiments

  • Let b, B = | G2 |/b, w, W, m, N0, N1, . . ., Nk (k ≤ m, and in our applications usually k = 2) be some pre-specified parameters.

    • Choose k+1 random subsets, S0, S1, . . ., Sk, of words each of length m, randomly (i.i.d. uniform) from G1[a, a+w−1], such that

      |S0| = N0, |S1| = N1, . . . , and |Sk| = Nk.

    • Consider a block in the second genome of length b: Bb = G2 [b, b+b−1]. Let X0 (X1, . . ., Xk, respectively) be defined as the number of m-mers from set S0 (S1, . . .,Sk, respectively) that match exactly (with one, two and so on up to k mismatches, respectively) to an m-mer in G2[b, b +b − 1].


Experiment design
Experiment Design

  • By examining the sensitivity (∂(Xi/Ni)/∂h = a′i[h]) we can divide the interval for h into three intervals: [1/4, θ1] ≈ [1/4, (m−2)/m], (θ1, θ2] ≈ [(m−2)/m, (m−1)/m] and (θ1, 1] ≈ ((m−1)/m, 1], such that the choices of (N0,N1,N2) are based on the following mixed strategies:

    N0 = (K/b) sq21 pi,I(h) dh

    N1 = (K/3bm) sq1q2 pi,I(h) dh

    N2 = (2K/9b(m2 –m)) s1/4q1 pi,I(h) dh

    ….


In silico experiments1
In Silico Experiments

  • Thus Xi’s for i in [0..k] are binomially distributed random variables whose parameters depend on the homology level h. 

  • We can estimate the local homology by the following robust estimators:

    h h | X0, X1, … Xki

    = s01 h p(h | X0, …, Xk) dh

    = s01 h p(h) p(X0, …, Xk|h) dh/ s01 p(h) p(X0, …, Xk|h) dh

    • Similarly, compute the mean, standard deviation and confidence of the homology function over Bb.

    • Let b*= arg maxb mean(Bb). Then the homology function is estimated at a by mean(Bb*).


Conditional probabilities
Conditional Probabilities

ri = b pmåj=1i C[m,j] 3j

si = åj=1i C[m,j] hm−j (1 − h)j

bi = (1 − si)(1 − ri)

ai = 1 − bi = si + ri − si ri.

p(X1, X2, …, XK|h) /Õj aj Xj bjNj – Xj


Initial priors
Initial Priors

  • Using Jukes-Cantor: the random variable r represents the rate of nucleotide substitution per site per year.

    • In this model, it is assumed that nucleotide substitution occurs at any nucleotide site with equal frequency and at each site a nucleotide changes to one of the three remaining nucleotides with a probability a per year: r = 3 a.

    • The substitution rate is often higher at functionally less important sites than at functionally more important sites.

    • Case 1: r » Exponential(λ): fexp(r) = λe−λr. In that case

      p(h) »(4h − 1)3λ/8T−1

    • Case 2: r » Gamma(λ, ν): f Γ(r) = λn e−λrrn−1/Γ(n). In that case

      p(h) »(4h − 1)3λ/8T−1 ln[3/(4h − 1)/T]n−1


Initial priors1
Initial Priors

  • A more complex structures arise as we consider multi-parameter models: e.g., Kimura’s Two-Parameter Method. In this model, the rate of transitional substitution per site per year (a) is assumed to be different from that of transversional substitution (2β).

    h = (1 − P)(1 − Q)

    P = (1/4) (1 − 2e−4(a+β)T + e−8βT)

    Q = (1/2) (1 − e8βT)


Initial priors2
Initial Priors

  • Assume that α » Exponential(λa) and β » Exponential(λβ) and they are independent.

    p(h) = (λa λβ/8T2)

    s01(1 − P)2+(λa+λb)/8T(2h + P − 1)(λa−λb)/8T

    (h − 2P + P2)−λa/4T

    /(2h + P − 1)(h − 2P + P2) dP.


Refining priors
Refining Priors

  • In iteration i, let us consider an interval I with k homology estimates:

    hm1, s1i, hm2, s2i, …, hmk, ski

  • Assume that the homology values h1, h2, . . ., hk is sampled from a distribution

    h »N(μ, σ2).

  • Furthermore, we assume the following:

    μ »N(ξ, t2)

    r = (σ2 + t2)−1» Γ(a, β).


New prior
New Prior

  • Prior = Kummer’s hypergeometric function of order 1

    f(h|ξ, t, a , β)

    »s01/τ2 ra-2 1/√(1 − r t2) e−Br dr

    »1F1(a − 1, a − 1/2,−B/ t2)

    ¼ ((h-x)2 + 2 b)/2 t2)-a+1

  • Estimates

    ξ = h μji

    t2 = h μ2i − h μji2

    a/β = h (σ2i + t2)−1i

    a/β2 = h σ2i + t2)−2i - h σ2i + t2)−1i2


Optimizing the parameters
Optimizing the parameters

  • The parameter choices are as follows:

    • Let the number of blocks (b) and the number of windows (w) be chosen a priori based on the needed resolution for homology.

    • We may choose these parameters so that b = O(p(G2)) and w = O(p(G1)). We assign K = O(1) amount of work to a region defined by a combination of any single block with any single window.

    • Thus the amount of work is roughly K(G1G2)/(w b) = O(G1+G2) per iteration. 

    • The mer size parameter ‘m’ is chosen so that the probability of a “hit” in a block containing a homologous sequence much higher than in a random block:

      (b/4m) << E(h0,G)m.


How to identify signatures for cellular states e g healthy vs diseased

#7

How to identify signatures for cellular states (e. g. healthy vs. diseased)


Microarray analysis

Normal DNA

Tumor DNA

Normal LCR

Tumor LCR

Label

Hybridize

Microarray Analysis

  • Representations are reproducible samplings of DNA populations in which the resulting DNA has a new format and reduced complexity.

    • We array probes derived from low complexity representations of the normal genome

    • We measure differences in gene copy number between samples ratiometrically

    • Since representations have a lower nucleotide complexity than total genomic DNA, we obtain a stronger specific hybridization signal relative to non-specific and noise


Copy number fluctuation
Copy Number Fluctuation

A1

B1

C1

A2

B2

C2

A3

B3

C3


A map maximum a posteriori estimators

Priors:

Deletion + Amplification

Data:

Priors + Noise

Goal: Find the most plausible hypothesis of regional changes and their associated copy numbers

Generalizes HMM:The prior depends on two parameters pe and pb.

pe is the probability of a particular probe being “normal”.

pb is the average number of intervals per unit length.

(pe,pb) max at (0.55,0.01)

A MAP (Maximum A Posteriori) Estimators




Prior selection f criterion

(pe,pb) max at (0.55,0.01)

Prior Selection: F criterion

  • For each break we have a T2 statistic and the appropriate tail probability (p value) calculated from the distribution of the statistic. In this case, this is an F distribution.

  • The best (pe,pb) is the one that leads to the maximum min p-value.


Current implementation
Current Implementation

  • NYU Array CGH

  • Versatile:

    • Works well for BAC array, ROMA & Agilent

    • Raw Affymetrix data is noisier, but our new algorithm for “background correction and summarization (BCS)” makes Affy-data significantly better than the competitors.

    • (BCS also generalizes to gene expression and performs better than RMA, Li-Wong, etc.)

  • Global Analysis (LOH analysis, detecting TSG and onco-genes)

  • Fast implementation, with visualization and integration (to be released soon…)


Finding cancer genes
Finding Cancer Genes

  • LOH/Deletion Analysis analysis

  • Hypothesize a TSG (Tumor Suppressor Gene)

  • Score function for each possible genomic region containing the TSG

    • Evolutionary history

    • Interactions

    • Parameters

  • This score can be computed using estimation from data and also prior information on how the deletions arise. We use a simple approximation; we assume there is a Poisson process that generates breakpoints along the genome and an Exponential process that models the length of the deletions.


Simulations

There are S = 200 individuals. Deletions occur in each individual. A cell having incurred a deletion overlapping one of the TSGs, will multiply quickly and generate many copies of itself. These copies evolve independently for a while until we collect the information.

Several scenarios:

Model1: one TSG [300, 350]. All individuals are diseased because of homozygous deletion of the TSG.

Model2: one TSG [300, 350]; 50% of the individuals are diseased because of homozygous deletion of the TSG and the rest are diseased because of hemizygous deletion of the TSG.

Model3: one TSG [300, 350]. All individuals are diseased because of hemizygous deletion of the TSG.

Model4: two TSGs; 50% of the people are diseased because of homozygous deletion in the first TSG and the other half are diseased because of homozygous deletion in the second TSG.

Simulations



Host pathogen interaction

#10

Host-Pathogen Interaction


BioCOMP/Biospice

GOALIE

Gene Ontology Algorithmic Logic Invariant Extractor


A picture
A Picture

Biological System:

Part-List + Functional Relations

Measurements

Regulatory, Metabolic & Signaling Relations

Recomputation

Redescription

Descriptive Relation

Numerical Traces

KRIPKE MODELS

Invariants


Yeast cell cycle data spellman et al
Yeast-Cell Cycle Data:Spellman et al.


Invariants
Invariants

G0

G1(I)

M

G1(II)

G2

S


Another example
Another Example

  • Pre-Apoptosis 6 time points data analysis

  • Six time-point data at 2h, 4h, 6h, 8h, 12h, 24h

    • Cells treated with SEB

    • Control: untreated cells

  • Data from Jett-Lab (Walter-Reed)



From the GO web site. The

path back to each ontology

from a gene.

We will call each term in a

path a split.


Structure of a go annotation
Structure of a GO annotation ontology: It can be customized to work with other controlled vocabulary: e.g., UMLS, MetaCYC, Reactome…

Each gene can have several annotated GOs and each GO can have several splits. E.g. DNA topoisomerase II alpha has 8 GO annotations and 11 splits


Time coarse go categorization
Time Coarse GO categorization ontology: It can be customized to work with other controlled vocabulary: e.g., UMLS, MetaCYC, Reactome…

  • GOALIE partitions the temporal data in order to understand local behavior. Data are grouped by considering (weighted) windows of time points (2-4-6, 4-6-8, 6-8-12, 8-12-24)

  • Next GOALIE uses a K-Means algorithm (genesis) to produce clusters for each window

  • GOALIE then associates to each cluster a set of GO descriptors (with p-values and controlled FDR, false discovery rates)

  • GOALIE computes “patterns” of GO categories across the time points

  • All these steps can be run from one unifying interface that GOALIE provides. GOALIE will be embedded inside VALIS.


Female, Sings, Overweight ontology: It can be customized to work with other controlled vocabulary: e.g., UMLS, MetaCYC, Reactome…

X1

X2

X3

X4

Male, Talks, Thin

Male, Sings, Overweight

Female, Sings, Underweight


X ontology: It can be customized to work with other controlled vocabulary: e.g., UMLS, MetaCYC, Reactome…3

X4

X3

X2

X1

X1

X4

X2


X ontology: It can be customized to work with other controlled vocabulary: e.g., UMLS, MetaCYC, Reactome…3

X4

X3

X2

X1

X1

X4

X2


X ontology: It can be customized to work with other controlled vocabulary: e.g., UMLS, MetaCYC, Reactome…3

X4

X3

X2

X1

X1

X4

X2


X ontology: It can be customized to work with other controlled vocabulary: e.g., UMLS, MetaCYC, Reactome…2

X4

X3

X4

X1

X2

Finale

X3

X1


X ontology: It can be customized to work with other controlled vocabulary: e.g., UMLS, MetaCYC, Reactome…2

X4

: O, : FLS

X3

X4

X1

X2

Finale

X3

: O, : FLS

: O, : FLS

O

: O, FLS

X1

: O, FLS


X ontology: It can be customized to work with other controlled vocabulary: e.g., UMLS, MetaCYC, Reactome…2

X4

: O, : FLS

: O UFLS

X3

X4

X1

X2

Finale

X3

: O, : FLS

: O, : FLS

: O UFLS

O

: O, FLS

X1

: O, FLS

: O UFLS


X ontology: It can be customized to work with other controlled vocabulary: e.g., UMLS, MetaCYC, Reactome…2

X4

: O, : FLS

: O UFLS

X3

X4

X1

X2

Finale

X3

: O, : FLS

: O UFLS

: O, : FLS

: O UFLS

O

: O, FLS

X1

: O, FLS

: O UFLS


It ain’t over ‘til ontology: It can be customized to work with other controlled vocabulary: e.g., UMLS, MetaCYC, Reactome…

the fat lady sings

X2

X4

: O, : FLS

: O UFLS

X3

X4

X1

X2

Finale

X3

: O, : FLS

: O UFLS

A[: O UFLS]

: O, : FLS

: O UFLS

O

: O, FLS

X1

: O, FLS

: O UFLS


Goalie go algorithmic logic for invariant extraction
GOALIE: GO Algorithmic Logic for Invariant Extraction ontology: It can be customized to work with other controlled vocabulary: e.g., UMLS, MetaCYC, Reactome…

Clusters connection treeEach level a “window”

Micro-array accessions

GO categories

Clusters information

Connection information

Cluster Information


Goalie go algorithmic logic for invariant extraction1
GOALIE: GO Algorithmic Logic for Invariant Extraction ontology: It can be customized to work with other controlled vocabulary: e.g., UMLS, MetaCYC, Reactome…

GO categories describing “source” cluster but not “destination”

GO categoriesdescribing “destination” cluster but not “source”

GO categories shared with “destination” cluster

GO categories describing genes in “source” cluster


Goalie seb analysis preliminary results
GOALIE: SEB Analysis Preliminary Results ontology: It can be customized to work with other controlled vocabulary: e.g., UMLS, MetaCYC, Reactome…

  • Time Course Window 1 to Time Course Window 2: Connection 1:9 to 2:18.By inspecting the first cluster in the first window (Cluster~1:9), we note that one of the connection to the cluster2 in the second window (Cluster~2:18) is labeled (among many others) by the GO categories “circulation” (GO:0008015), and by the category “negative regulation of heart rate” (GO:0045822). This represents a constant set of biological processes shared by this cluster chain, traversing Cluster 3:17, to Cluster 4:13.

  • Time Course Window 1 to Time Course Window 2: Connection 1:9 to 2:6.The connection between Cluster 1:9 and Cluster 2:6 is interesting because it shows how the category “regulation of lymphocyte proliferation” (GO:0050670) becomes activated in the next time-window (Cluster 2:6), while the categories “antigen presentation” and “antigen processing” became inactive. This should indicate that some of the genes in the clusters start a response to the pathogen in the second time point.


Hidden kripke model
Hidden Kripke Model ontology: It can be customized to work with other controlled vocabulary: e.g., UMLS, MetaCYC, Reactome…

  • “Hidden Kripke Model” reconstruction via ontology based redescription of time-sliced clusters of time-course measurements (arrays) offers a novel viewpoint form which formulate biological hypotheses and eventually reconstruct “biological first principles”

  • The GOALIE tool, in its embryonic state, has already been proven “correct” and offered a wide range of insights into a number of biological datasets

    • Spellman’s Yeast Cell Cycle

    • SEB host-pathogen data from WRAIR

  • New datasets being analyzed now include

    • P. falciparum dataset [Bozdech et al, 1(1):085]

    • Subset of Genome Module Map dataset [Segal et al]


Story generation

Dataset ontology: It can be customized to work with other controlled vocabulary: e.g., UMLS, MetaCYC, Reactome…

FormulaGenerator

Dataset

Dataset

Dataset

Formula

Formula

Formula

Temporal

Logic Analysis

Natural LanguageStory Generation

HTML

file

Story generation

  • Temporal Logic formulae can be rendered in English.

  • Temporal Logic formulae can be generated automatically (with care).

  • Each formula can be tested against a set of datasets; differences can then be noted.


Chr1 ontology: It can be customized to work with other controlled vocabulary: e.g., UMLS, MetaCYC, Reactome…

Ns

ATs

Reps

MER57A

L1P

CDs

ΔG

Dup

Copy#

Mer

Freq

#4

How to integrate multiple high-throughout data types dependably; How to visualize & explore large- scale, multi- dimensional data

#5

VALISvast¢active¢living¢intelligence¢system


Databases

Sequence ontology: It can be customized to work with other controlled vocabulary: e.g., UMLS, MetaCYC, Reactome…

GENBANK, EMBL, DDBJ

SWISS-PROT, GenPept, TREMBL, PIR

PDB, SCOP,…

Genetic (Flybase, Wormbase, GDB, AtDB, SGD,…)

Expression (RNA Expression from microarrays,...)

Metabolic (KEGG, WIT,…)

Literature (MEDLINE,…)

Bioperl

International open-source collaboration

7 Years of development

600 Modules

100,000 lines of code

Easy-to-use, stable, and consistent programming interface for bioinformatics application programmers

Databases


Bioperl module statistics

mean 144 ontology: It can be customized to work with other controlled vocabulary: e.g., UMLS, MetaCYC, Reactome…

BioPerl Module Statistics


Bioinformatics data

Majority of the data kept in (indexed) flat files & Relational Databases

Flat Files

Unstructured/Difficult to update

Relational Databases

Strings are atomic objects

Blobs

Example: GoldenPath /UCSC Genome Browser

Rough Estimates:

800 Tables

14000 fields

1600 varchars [most varchar(255) ]

1300 blobs

Blobs:

exonStart, ExonEnd, qStarts, tStarts, …

Bioinformatics Data


Key feature of valis
Key Feature of Valis Relational Databases

  • State of the art of rapid prototyping in bioinformatics

  • Multilanguage Scripting

  • Data storage

  • Graphical User interfaces


Multi scripting
Multi-Scripting Relational Databases

  • A Valis script can be written in any supported language:

    • JScript, VBScript, Python, PERL, Lisp, R and SETL.

    • All the scripts see the same Valis class hierarchy.

    • For example, once a user learns that a Valis Sequence Object has a method called Input that will read the sequence from a file, the user can subsequently use this same primitive from all the different languages.


Data storage
Data Storage Relational Databases

#4

  • Based on Extended B-Trees

  • At the lowest level there is an Heap of pages

  • Must correctly keep track of the reference counts of each record/object to implement value semantics


Visualization
Visualization Relational Databases

#5

  • Once the processing is completed, it is very important to be able to quickly visualize the results.

  • For this reason Valis provides numerous visualization tools that allow a user to quickly display

    • sequences, maps,

    • microarray data,

    • tables,

    • graphs and annotations.

  • These widget can be customized from the scripts.


Glycogen Relational Databases

P_i

Glucose

Glucose-1-P

Phosphorylase a

Phosphoglucomutase

Glucokinase

Glucose-6-P

Phosphoglucose isomerase

Fructose-6-P

Phosphofructokinase

How to convert static network maps into dynamic mathematical models; How to predict protein function ab initio; How to build hierarchical models across multiple scales of time & space;How to reduce complex multi- dimensional models to underlying principles

#6

#9

Glycolysis

#10

SIMPATHICA


Simpathica system
SimPathica System Relational Databases


Simpathica is a multi functional system

Front End Relational Databases

Analysis

XSSYS (TL)Time/FrequencyCombined TL/TF

Model data structures

Equations HandlingOctave/Matlabcode generation

Dash-board

Simulation

Visualization

MatlabOctave

PtPlotgnuplot

mArray DB

NYUMAD

Simpathica is a multi-functional system


Simpathica is a modular system
Simpathica is a modular system Relational Databases

Canonical Form:

  • Characteristics:

  • Predefined Modular Structure

  • Automated Translation from Graphical to Mathematical Model

  • Scalability


Purine metabolism
Purine Metabolism Relational Databases

  • Purine Metabolism

    • Provides the organism with building blocks for the synthesis of DNA and RNA.

    • The consequences of a malfunctioning purine metabolism pathway are severe and can lead to death.

  • The entire pathway is almost closed but also quite complex. It contains

    • several feedback loops,

    • cross-activations and

    • reversible reactions

  • Thus is an ideal candidate for reasoning with computational tools.


Simple model
Simple Model Relational Databases


Biochemistry of purine metabolism
Biochemistry of Purine Metabolism Relational Databases

  • The main metabolite in purine biosynthesis is 5-phosphoribosyl-a-1-pyrophosphate (PRPP).

    • A linear cascade of reactions converts PRPP into inosine monophosphate (IMP). IMP is the central branch point of the purine metabolism pathway.

    • IMP is transformed into AMP and GMP.

    • Guanosine, adenosine and their derivatives are recycled (unless used elsewhere) into hypoxanthine (HX) and xanthine (XA).

    • XA is finally oxidized into uric acid (UA).


Purine metabolism1
Purine Metabolism Relational Databases


Queries

Variation of the initial concentration of PRPP does not change the steady state.(PRPP = 10 * PRPP1) implies steady_state()

This query will be true when evaluated against the modified simulation run (i.e. the one where the initial concentration of PRPP is 10 times the initial concentration in the first run – PRPP1).

Persistent increase in the initial concentration of PRPP does cause unwanted changes in the steady state values of some metabolites.

If the increase in the level of PRPP is in the order of 70% then the system does reach a steady state, and we expect to see increases in the levels of IMP and of the hypoxanthine pool in a “comparable” order of magnitude. Always (PRPP = 1.7*PRPP1) implies steady_state()

Queries

TRUE

TRUE


Queries1

Consider the following statement: change the steady state.

Eventually

(Always (PRPP = 1.7 * PRPP1)impliessteady_state()and Eventually

Always(IMP < 2* IMP1))and Eventually (Always (hx_pool < 10*hx_pool1)))

where IMP1 and hx_pool1 are the values observed in the unmodified trace. The above statement turns out to be false over the modified experiment trace..

In fact, the increase in IMP is about 6.5 fold while the hypoxanthine pool increase is about 60 fold.

Since the above queries turn out to be false over the modified trace, we conclude that the model “over-predicts” the increases in some of its products and that it should therefore be amended

Queries

False


Final model
Final Model change the steady state.


Purine metabolism2
Purine Metabolism change the steady state.


Quantum leaps

Quantum leaps change the steady state.

“provoke creative

dreaming”


Shotgun mapping
Shotgun Mapping change the steady state.

  • Schematics

    • Experiment Design

    • Robotics

    • BioChemistry

    • Imaging

    • Image Analysis

    • Statistical Algorithms

    • Visualization


Gentig s successes

E. coli change the steady state.

P. falciparum ¦ D. radiodurans ¦ Y. Pestis

Rhodobacter sphaeroides ¦ Shigella flexneri ¦ Salmonella enterica

Aspergillus fumigatus

The automated Gentig system is routinely used

to map a microbe genome quickly & effortlessly

by a scientist with no quantitative or computational training.

E. coli

Y.Pestis

P. falciparum

Gentig’s Successes

Shotgun Optical Mapping of Genomes


Some interesting applications
Some Interesting Applications change the steady state.

  • Sequence Validation

  • Haplotyping

  • Sequencing

  • Comparative Genomics

  • Rearrangement events

  • Hemizygous Deletions

  • Epigenomics

  • Characterizing cDNAs

    • Expression Profiling

    • Alternate Splicing


Sequencing in post genomic era
Sequencing in Post-Genomic Era change the steady state.

  • Haplotypic Sequencing of 6.6 Billion Base Pairs in a Diploid Human genome

  • Less than $700

  • Less than 24 Hours

  • Draft Quality (Not Resequencing)


Ingredients
Ingredients change the steady state.

  • Single Molecule Optical Mapping

    • Methylation Sensitive Restriction Enzymes

    • Multiple-Enzyme Maps

  • Probe Hybridization on the Surface

    • PNA, LNA, TFO

  • Sequencing by Hybridization

    • Localize algorithms


Psbh problem
PSBH problem change the steady state.

  • The PSBH problem:

    • Can overcome the complexity issues associated with the SBH problem.

    • For each L-mer probe, in addition to knowing whether it hybridizes with the unknown sequence (with or without count) we are provided constraints on the location of the L-mer in the sequence:

    • The constraint takes the form of a set of permissible locations for each L-mer (which need not be contiguous).

    • However it is known that if the constraint limits each L-mer to no more than two exact locations on the sequence, then it admits an efficient algorithmic solution—If 3 or more locations are possible the reconstruction problem becomes NP-complete.

    • However if the location constraints are in the form of K contiguous locations then the reconstruction problem is exponential in K, rather than the sequence length m.


Multiple psbh problem
Multiple PSBH problem change the steady state.

  • For our data set of probe maps for all 6-mers, we have multiple instances of each L-mer, for L=6 (about one every 4Kb on each strand of the DNA) in the sequence, and for each instance we can constrain the location to within about  200 base pairs depending on the optical resolution.

  • A special case of the PSBH problem, which we will call the “Multiple Positional Sequence by Hybridization” (Multiple PSBH) problem, where we have separate constraints for each of the multiple instances of each L-mer.

    • By focusing on a small window of about 2000bp, in which most L-mers will occur only once, we can solve the standard PSBH problem where separate constraints for multiple instances of each L-mer are not important.

    • However if we solve each local PSBH problem for each 2000bp window separately, the worst case exponential time reconstruction is unlikely to apply to most windows, so if we simply limit the amount of time spent in each window to some upper bound linear in the window size, we can still be able to reconstruct the sequence in most windows in linear time.


Initial experiments
Initial Experiments change the steady state.


The end change the steady state.


The end change the steady state.


ad