Genome-wide Shotgun Mapping, Validating, Sequence Aligning & Population Studies .

NYU Faculty Research Descriptions October 16 2001 Genome-wide Shotgun Mapping, Validating, Sequence Aligning &Population Studies. ¦ Bud Mishra Professor of Computer Science & Mathematics (Courant Institute) Professor (Cold Spring Harbor Laboratory) http://www.cs.nyu.edu/mishra

Senior Research Scientists: Marco Antoniotti (CS) Archisman Rudra (CS) Junior Research Scientists: Toto Paxia (CS) Marc Rejali (Bio/CS) Valerio Luccio (CS) Joe McQuown (Stat/OR) Graduate Students Joey Zhou (Biology) Will Casey (Math) Vera Cherepinsky (Math) Collaborators: Mt Sinai School of Medicine Harel Weinstein Bob Desnick Courant Institute Misha Gromov Biology Dept: NYU Gloria Coruzzi Phil Benfey People • Collaborators: • Cold Spring Harbor Lab • Mike Wigler • Dick McCombie • Vivek Mittal • University of Wisconsin • Tom Anantharaman • David Schwartz • Memorial Sloan Kettering • Larry Norton

Shotgun Sequence Assembly • A jigsaw puzzle • Assembles a collection of words while minimizing errors. • An Example (somewhat unrealistic) • Words:complete, correct, -ly, but, the, sequence, in-, human, published, is, genome. • Solution 1: Minimize unused letters: • The published human genome sequence is incomplete but correct. • Solution 2: Minimize the number of spaces: • The published human genome sequence is completely incorrect.

Validation, Alignment & Assembly

Shotgun Mapping • Large fragments of genomic DNA of length from 2Mb to 12Mb are optically mapped • The resulting ordered restriction maps are automatically contiged by “Gentig” • The consensus map computed by Gentig is free of errors due to partial digestion, sizing error and false cuts

Shotgun Mapping • Schematics • Surface Chemistry • Robotics • BioChemistry • Imaging • Image Analysis • Statistical Algorithms • Visualization

Gentig Maps:Plasmodium falciparum • A. Gap-free consensus BamHI & NheI maps for all 14 chromosomes. • B.BamHI map • C. NheI map • D.NheI map of Chromosome 3 displayed by ConVEx

Validation & Its Software Architecture

Dynamic Programming Recurrence T[i,j] := minu 5 i, v 5 j{ T[i-u, j-v] + ln{2p(si2+L + s(I-u)2)}1/2/pc + {(lj + L + l(j-v)) – (ai + L +a(i-u))}2/2(sj2 +L+s(j-v)2) +(u-1) ln(1/(1-pc)) +(v-1) ln(1/pf)}

P. Falciparum c14 Alignment

Map Assisted Sequence Assembly • Using Multi-Enzyme Optical Maps to anchor Sequence Contigs. • Sequence Assembly (Speed + Accuracy) • Sequence Validation • Sequence Contig Phasing • Characterizing the gaps and finishing

Sequence Anchoring • Probability that a random sequence Y (|Y| = L) gets anchored at the wrong position in a map of genome length G PFP¼ 1 – e-prG • where r 5 2 (pL)m e-mpL(1-b/2) • Number of enzymes = m • Cutting rate = p • Relative sizing error = b For BACs about 5 enzymes suffice to anchor 5Kb sequence contigs

CR = Coverage of the reference CT= Coverage of the individual G = Length of the genome pE = Cutting frequency of the enzyme b = Sizing error L= Length of the genomic fragment X = A region of differences Pr[A single fragment covers “X”] = (L-X)/(G-L) Pr[At least one of n random fragments covers “X”] = [1- (L-X)/(G-L)]n = e-c(1-X/L) Probability that the difference of length “X” is detected by genome-wide map = {1 – e-CR(1-X/L) – e-CT(1-X/L) } £ {1 - (b/2)(X pE –1) } (See Mathematica Demo: GenomeCompare.nb) A Simple Analysis

An Example • Genome Length = G = 3300 Mb; • Average Fragment Length = L = 2 Mb; • Coverages: • CR = 12; • CT = 6; • l = 25 Kb; e = 3 Kb; b = e/l; • pE = 1/l; • Probability that the difference of length “X” is detected by genome-wide map = • {1 – e-CR(1-X/L) – e-CT(1-X/L) } • £ {1 - (b/2)(X pE –1) } • (See Mathematica Demo: GenomeCompare.nb)

Some Interesting Applications • Haplotyping • Phasing haplotypes unambiguously (both for SNP’s and RFLP’s) • Rearrangement events • Amplifications and Deletions • Translocations • Synteny groups • Hemizygous Deletions

Improving the Resolution • Markers, characterized by genome-wide optical maps, can be indexed to genes. • Functionality of these genes can be established from • homology searches • motif identification or • simply literature searches • Many genes will need to be characterized in the context of experimental systems, and populations. • Genotype/phenoype relations can be established through large population studies, or kinship analysis, where the analysis is a combination of • high-resolution cytogenetics (optical maps) and • RFLP analysis.

GENTIG ALGORITHM

Genomic Contig Problem:GCP • Given: M intervals (genomic DNA segments) each of length L D1, D2, …, DM Dj) 0 < cj1 < cj2 < … < cjn < L cj1 = True or False (Optical) Cut Sites • p = Digestion rate, • k (>3) = Goodness • Goal:Place M intervals on the real line by fixing the alignment (orientation + position) of each D_j Dja Aj = (Dj, sgn, xj} Int(Dj) = Ij = [xj, xj + L] • Subject to additional constraints

Constraints forGCP • Composite Map 0 < m1 < m2 < L < mK 8 mi, (|{ mi2 Aj}| / |{ mi2 Ij}| ) > p • Every admissible placement induces a permutation of Di’s (determined by the positions of their left ends) p! Permutation, Ap(1), Ap(2), …, Ap(M) • Goodness: c(A1, …, AM) = min |{mi2 Ap(j)Å Ap(j+1) }| ¸ k

GCP is NPComplete • Transformation from Hamiltonian Path Problem restricted to cubic graphs. Choose p= 3/4 & k = M

v1 v2 v3 NPCompleteness • G has a Hamiltonian path v1, v2, …vM Then, the admissible placement is D1, D2, …DM with at most two intervals Ij & Ij+1 overlapping with k cuts in common. • Conversely, any admissible placement with a goodness >k induces a permutation p on the indices of the vertices of G. v(p(1)), v(p(2), …, v(p(M))=Hamiltonian D1 D2 D3 Consensus Map

Overlap Rule • Comparing Two Genomic Restriction Maps: Given two maps A and B, we say that they overlap, if --- 1. kor more of the restriction fragments align positionally (subject to sizing error) 2. Number of unmatched fragments in either prefix is bounded byr

Comparing Maps:Effect of Partial Digestion • Parameters: • Partial digestion probability, p • Relative sizing error, b • # Restriction fragments, n • Overlap threshold ratio, q • m = n p = Expected # detected restriction fragments. • Controlling False Negative: K 5 np4q/2 and r = k1/p4, k1¼ 2 If in fact the clones A and B overlap then we will it detect with a probability, at least (1-exp(-k1)) (1 – exp(-n p4q/8))

Overlap Rule • Controlling False Positive: Consider an arbitrary alignment: Let the random variable W denote the number of fragments in clone A that positionally match with the fragments of clone B. E[C(W, i)] = C(m, i) (b/2)i¼ (1/i!) (np b/2)i • By Brun’s sieve Pr[W = i] = (1/i!) (b n p/2)i exp(-b n p /2) Poisson »b n p /2 • and the false positive probability is 4 r å1i=k (1/i!) (b n p)I e-b n p/2 Make r as small & k as large as possible

Experimental Design • Relation among the error parameters: 3b n p /4 5 k 5 n p4q/2 ) p = (3 b/2 q)1/3 • Parameter choice for shotgun-mapping. Make the partial digestion probability rather high (close to 1) or the relative sizing error as low … for instance by using a rare cutter.

Contour Plot as a Function of Sizing Error (x-axis) and Digestion Rate (y-axis) • The calculation is for the human genome, G =3,300 mb. • The average molecule length = 5 mb, with an overlap of 1 mb • The average restriction fragment length = 25 kb • For a sizing error of 3 kb, the required digestion rate is ~80% • If the sizing error is reduced to 2 kb, the required digestion rate drops to ~ 70%… • (See Mathematica Demo: GentigFeasibility.nb)

Gentig (GENomic conTIG) Algorithm • Scoring Function - An upper bound estimate of the false positive overlap probability - A Bayesian probability estimate for the proposed placement Maximize the Bayesian Probability Density subject to the False Positive Probability Constraint *GREEDY ALGORITHM*

Other Ongoing Projects • Valis Bioinformatic Environment & Language • (Funded by DOE & NYSTAR) • Microarray-based Genomic Mapping • (In collaboration with CSHL & funded by NCI/NIH) • Expression Data Analysis • (In collaboration with NYU Biology & funded by NSF) • Cell Informatics • (Funded by DARPA)

Valis Architecture • Valis aims to address all aspects of post-genomic biology. • With this goal in mind we built a powerful computational infrastructure • With a distributed architecture consisting of a Linux cluster and customized special hardware for homology search • A large database system for massive amounts of biological data in multiple forms • Mathematical, statistical and algorithmic tools that can handle the multitude of scientific problems arising from bioinformatics, comparative and functional genomics, cell informatics, population genomics, etc.

Algorithmic Support • Wide classes of mathematical and computational tools are integrated into Valis: • Most of the work and the interesting technical developments are algorithmic. The tools rely on mathematical ideas from • combinatorics, • probabilistic methods, • statistics, • kinetic modeling, • and dynamical and discrete event systems. • For example, to construct probe maps using microarrays, the algorithm relies on a probabilistic analysis of when nearby probes get hybridized by a low coverage sample from a clone library. This analysis is built into the design of the microarray experiments, and also exploited by the data structures used in the algorithm.

Bio-computing: • Joint Project involving Cold Spring Harbor Lab & Courant Institute: • “Algorithmic Tools and Computational Frameworks for Cell Informatics:” 2001-2004 • Two areas of interest • Computational Tools: • Valis Informatics Tools • Simulation Tools • Reasoning Tools • Biological Experiments: • DNA Evolution • Cell Communication

Cells signal through communication proteins Many communication proteins fall into two classes: Extracellular factors and External receptors. Factor-receptor interactions occur in pairs and influence the genes and proteins that cells express. Factors and receptors are encoded by genes, about a thousand of each class. Only a few of each class are expressed in cells of a particular type. The pairing of factor and receptors are largely unknown. The consequences of most factor-receptor interactions are unknown. These pairings and their consequences are explored by cell cocultivation experiments. We examine cell type A and B alone, and when cocultured (“A c B”)…We examine the genes expressed by cells using DNA microarrays, that quantitate tens of thousands of genes simultaneously. Cocultivation Experiments:

Experimental Results • Cell A is a carcinoma (derived from ectoderm), • Cell B is a sarcoma (derived from mesoderm), • The data displayed are ratios of expressed genes, each point the ratio of either A or B alone, or A and B cocultured (“A c B”) vs the combination of both A and B cultured alone and then combined (“A+B”).

A Rudimentary Simulation with Mathematica Specifications of cell types- Simulation & Inference

The central claim of this proposal is that, by drawing upon mathematical approaches developed in the context of dynamical systems, kinetic analysis, computational theory and logic, it is possible to create powerful simulation, analysis and reasoning tools for working biologists to be used in deciphering existing data, devising new experiments and ultimately, understanding functional properties of genomes, proteomes, cells, organs and organisms. Risks & Challenges: Effectively integrating diverse methodologies to study a monolithic, heterogeneous and complex system Multiple hierarchical levels of fidelity Multiple spatio-temporal scales Automatically designing experiments that can falsify and eventually revise existing models. Milestones and Deliverables: Kinetic-modeling prototyping tool with VALIS interface A quantitative sequence analysis tool A genome-generator (based on a stochastic context-free grammar and simulation of genome-rearrangements) A preliminary integrated tool combining simulation, visualization, numerical integration and symbolic algebraic analysis A hybrid system simulator and modal logic reasoning system A simulation and combinatorial/probabilistic analysis system for DNA repair, polymerase stuttering and unselected DNA drift (using kinetic models) A multi-fidelity model of signal transduction Experimental validation of microarray-based, computationally driven model of the RAS pathway Concluding Remarks

Genome-wide Shotgun Mapping, Validating, Sequence Aligning & Population Studies .