Annotation and Alignment of the Drosophila Genomes
This presentation is the property of its rightful owner.
Sponsored Links
1 / 55

Annotation and Alignment of the Drosophila Genomes Centro de Ciencas Genomicas, May 29, 2006. PowerPoint PPT Presentation


  • 77 Views
  • Uploaded on
  • Presentation posted in: General

Annotation and Alignment of the Drosophila Genomes Centro de Ciencas Genomicas, May 29, 2006. Genes or Regulation ?. “10,516 putative orthologs have been identified as a core gene set conserved over 25–55 million years (Myr) since the pseudoobscura / melanogaster divergence”

Download Presentation

Annotation and Alignment of the Drosophila Genomes Centro de Ciencas Genomicas, May 29, 2006.

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

Annotation and Alignment of the Drosophila Genomes

Centro de Ciencas Genomicas, May 29, 2006.


Genes or regulation

Genes or Regulation?

  • “10,516 putative orthologs have been identified as a core gene set conserved over 25–55 million years (Myr) since the pseudoobscura/melanogaster divergence”

  • “Cis-regulatory sequences are more conserved than random and nearby sequences between the species—but the difference is slight, suggesting that the evolution of cis-regulatory elements is flexible”

Richards et al., Comparative genome sequencing of Drosophila pseudoobscura: Chromosomal, gene, and cis-element evolution, Genome Res., Jan 2005.


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

http://rana.lbl.gov/drosophila/wiki/


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

BP England, U Heberlein, R Tjian. Purified Drosophila transcription factor, Adh distal factor-1 (Adf-1), binds to sites in several Drosophila promoters and activates transcription, J Biol Chem 1990.


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

  • S. Chatterji and L. Pachter, GeneMapper: Reference based annotation with GeneMapper, in press.

http://bio.math.berkeley.edu/genemapper/


Genes or regulatory elements

Genes or Regulatory Elements?

  • “10,516 10,867 putative orthologs have been identified as a core gene set conserved over 25–55 million years (Myr) since the pseudoobscura/melanogaster divergence”

  • “Cis-regulatory sequences are more conserved than random and nearby sequences between the species—but the difference is slight, suggesting that the evolution of cis-regulatory elements is flexible”

Richards et al., Comparative genome sequencing of Drosophila pseudoobscura: Chromosomal, gene, and cis-element evolution, Genome Res., Jan 2005.


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

Alignment of coding sequence

DroAna_20041206_ GTCGCTCAACCAGCATTTGCAAAAGTCGCAGAACTTGCGCTCATTGGATTTCCAGTACTC

DroMel_4_ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTGCGCTCGTTTGATTTCCAGTACTC

DroMoj_20041206_ GTCGCTTAACCAGCATTTACAGAAATCGCAATACTTGCGTTCATTGGATTTCCAGTACTC

DroPse_1_ GTCGCTCAGCCAGCACTTGCAGAAGTCGCAGTACTTGCGCTCGTTTGATTTCCAGAATTC

DroSim_20040829_ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTGCGCTCGTTTGATTTCCAGTACTC

DroVir_20041029_ GTCGCTCAACCAGCATTTGCAGAAGTCGCAATACTTGCGTTCATTCGACTTCCAGTACTC

DroYak_1_ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTCCGCTCGTTTGACTTCCAGTACTC ****** * ****** ** ** ** ***** **** ** ** ** ** ****** * **

Alignment of non-coding sequence

DroAna_20041206_ CTGAAGGAAT-------TCTATATT---------AAAGAAGATTTCTCATCATTGGTTG

DroMel_4_ CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTT

DroMoj_20041206_ CTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGAAA-------

DroPse_1_ CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG----DroSim_20040829_ CTGCGGGATTAGGAGTCATTAGAGT---------GCGGAAAAGCGG---------GTT-DroVir_20041029_ CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT--AATTTGGTCCAAA-------

DroYak_1_ CTGCGGGATTAGCGGTCATTGGTGT---------GAAGAATAGATC---------CTTT *** * * *

DroAna_20041206_ AATC-----ACTTAC

DroMel_4_ ATTCTATGGACTCAC

DroMoj_20041206_ ----TATTTACTCAC

DroPse_1_ ------TGTACTTAC

DroSim_20040829_ ATTCTATGGACTCAC

DroVir_20041029_ ----TATTTACTCAC

DroYak_1_ ATTTCATAAACTCAC

*** **


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

Alignment of coding sequence

DroAna_20041206_ GTCGCTCAACCAGCATTTGCAAAAGTCGCAGAACTTGCGCTCATTGGATTTCCAGTACTC

DroMel_4_ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTGCGCTCGTTTGATTTCCAGTACTC

DroMoj_20041206_ GTCGCTTAACCAGCATTTACAGAAATCGCAATACTTGCGTTCATTGGATTTCCAGTACTC

DroPse_1_ GTCGCTCAGCCAGCACTTGCAGAAGTCGCAGTACTTGCGCTCGTTTGATTTCCAGAATTC

DroSim_20040829_ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTGCGCTCGTTTGATTTCCAGTACTC

DroVir_20041029_ GTCGCTCAACCAGCATTTGCAGAAGTCGCAATACTTGCGTTCATTCGACTTCCAGTACTC

DroYak_1_ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTCCGCTCGTTTGACTTCCAGTACTC ****** * ****** ** ** ** ***** **** ** ** ** ** ****** * **

Alignment of non-coding sequence

droAna1.2448876 CTGAAGGAATTCTA--TATTAAAG-------------------------------

dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGT-TTATTC

droMoj1.contig_2959 CTGGAATAGTTAATTTCATTGTAA---------CACATAAA--CGTTTTAAATTC

dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG

droSim1.chr2L CTGCGGGATTAGGAGTCATTAGAG---------TGCGGAAAAGCGGG--TTATTC

droVir1.scaffold_6 CTGCAGCAGTTAA-ATAATTGTAA---------TAAACAA----TTCTCTAATTT

droYak1.chr2L CTGCGGGATTAGCGGTCATTGGTG---------TGAAGAATAGATCCT-TTATTT

*** * * * *

droAna1.2448876 AAGATTTCTCATCATTGGTTGAATC---------------------ACTTAC

dm2.chr2L -----------------------------------------TATGGACTCAC

droMoj1.contig_2959 -------------------------AAATATTT--------TATTGACTCAC

dp3.chr4_group3 -----------------------------------------TGT--ACTTAC

droSim1.chr2L -----------------------------------------TATGGACTCAC

droVir1.scaffold_6 ---------------------------------AAATATTTGGTCCACTCAC

droYak1.chr2L -----------------------------------------CATAAACTCAC

*** **


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

UUCCCUAG--------CAAGUACCUCA------------------UUCCCUAG--------CAAGUACCUCA------------------UUCCCUAG--------CAAGUACCUCA------------------UUCCUUAGACUCUUAGCAAGUACCUCA------------------UUCCUUAGACUCUUAGAAAGUACCUCAAAAACGAAAUGCGAACACGACUCU----UUUUAGCAAGUACCUCAAAAUAUUUAAUUAAA-AC ACUCUU----UUUUAGCAAGUACCUCAAGAAUUACAAUUAAAUAU

let-7

.

.

.

.

.

.

.

.

AUGGAGU

Grun et al. microRNA target predictions across seven Drosophila species and comparison to mammalian targets, PloS Computational Biology, June 2005

Lall et al. A genome wide map of conserved microRNA targets in C. Elegans, Current Biology, February 2006

Example of a conserved microRNA target


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

Richards et al., Comparative genome sequencing of Drosophila pseudoobscura: Chromosomal, gene, and cis-element evolution, Genome Res., Jan 2005.


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

How is an alignment made from two sequences?

Given two sequences of lengths n,m:

>dm2.chr2L

CTGCGGGATTAGGGGTCATTAGAGTGCCGAAAAGCGAGTTTATTCTATGGACTCAC

>dp3.chr4_group3

CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCGTGTACTTAC

n=50

m=62

?

dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGT-TTATTC

dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG

dm2.chr2L TATGGACTCAC

dp3.chr4_group3 TGT--ACTTAC


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGT-TTATTC

dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG

dm2.chr2L TATGGACTCAC

dp3.chr4_group3 TGT--ACTTAC

DroMel_4_ CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTT

DroPse_1_ CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG----

DroMel_4_ ATTCTATGGACTCAC

DroPse_1_ ------TGTACTTAC

Each alignment can be summarized by counting the number of matches (#M), mismatches (#X), gaps (#G), and spaces (#S).


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

#M=31, #X=22, #G=3, #S=12

#M=27, #X=18, #G=3, #S=28

dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGT-TTATTC

dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG

dm2.chr2L TATGGACTCAC

dp3.chr4_group3 TGT--ACTTAC

DroMel_4_ CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTT

DroPse_1_ CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG----

DroMel_4_ ATTCTATGGACTCAC

DroPse_1_ ------TGTACTTAC

Each alignment can be summarized by counting the number of matches (#M), mismatches (#X), gaps (#G), and spaces (#S).

2(#M+#X)+#S=112 so #X,#G and #S suffice to specify a summary.


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

The summary of an alignment is a point in 3 dimensional space.

For example, the two alignments just shown correspond to the points:

(22,3,12)(18,3,28)


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

The summary of an alignment is a point in 3 dimensional space.

For example, the two alignments just shown correspond to the points:

(22,3,12)(18,3,28)

In the example of our two sequences there are

379522884096444556699773447791552717765633

different alignments.


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

The summary of an alignment is a point in 3 dimensional space.

For example, the two alignments just shown correspond to the points:

(22,3,12)(18,3,28)

In the example of our two sequences there are

379522884096444556699773447791552717765633

different alignments, but only

53890

different summaries. So we don’t need to plot that many points.


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

The summary of an alignment is a point in 3 dimensional space.

For example, the two alignments just shown correspond to the points:

(22,3,12)(18,3,28)

In the example of our two sequences there are

379522884096444556699773447791552717765633

different alignments, but only

53890

different summaries. So we don’t need to plot that many points.

But 53890 is still quite a large number. Fortunately, there are only 69 vertices on the convex hull of the 53890 points.

These are the interesting ones, and we can even draw them…


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

49 #x=24, #S=10, #G=2

There are eight alignments that have this summary.

>mel

CTGCGGGATTAGGGGTCATTAGAGTGCCGA

AAAGCGAGTTTATTCTATGGAC

>pse

CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGA

GGAGAGGCCATCATCGTGTAC

For the sequences:

the alignment polytope is:


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

mel CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGAGTTTATTCTATGGAC

pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATC-GTGTAC

mel CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGAGTTTATTCTATGGAC

pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG-TGTAC

mel CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGTTTATTCTATGGAC

pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATC-GTGTAC

mel CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGTTTATTCTATGGAC

pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG-TGTAC

mel CTGCGGGATTAGGGGTCATTAGA---------GTGCCGAAAAGCGAGTTTATTCTATGGAC

pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATC-GTGTAC

mel CTGCGGGATTAGGGGTCATTAGA---------GTGCCGAAAAGCGAGTTTATTCTATGGAC

pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG-TGTAC

mel CTGCGGGATTAGGGGTCATTAG---------AGTGCCGAAAAGCGAGTTTATTCTATGGAC

pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATC-GTGTAC

mel CTGCGGGATTAGGGGTCATTAG---------AGTGCCGAAAAGCGAGTTTATTCTATGGAC

pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG-TGTAC


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

mel CTGCGGGATTAGGGGTCATTAGAGT===------===GCCGAAAAGCGAGTTTATTCTA=TGGAC

pse CTGGAAGAGTTTTGATTAGTAG===GGGATCCATGGGGGCGAGGAGAGGCCATCATC==GTGTAC

Consensus at a vertex


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

49 #x=24, #S=10, #G=2

The vertices of the polytope have special significance.

Given parameters for a model, e.g. the default parameters for MULTIZ:

M = 100,

X = -100,

S = -30,

G = -400

the summary is the result of maximizing the linear form

-200*(#X)-400*(#G)-80*(#S)

over the polytope.

Thus, the vertices of the polytope correspond to optimalalignments.


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

Needleman-Wunsch Alignment

What is usually done, is that a single set of parameters is specified (M = 100, X = -100, S = -30, G = -400 is a standard default) and then theoptimal vertex is identified using dynamic programming. An alignment optimal for the vertex is then selected. The running time of the algorithm is O(nm) [Needleman-Wunsch, 1970, Smith-Waterman, 1981] and it requires O(n+m) space [Hirschberg 1975] .

Standard scoring schemes are:

ParametersModel

M,X,SJukes-Cantor with linear gap penalty

M,X,S,GJukes-Cantor with affine gap penalty

M,XTS,XTV,S,GKimura-2 parameter with affine gap penalty


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

Building Drosophila whole genome multiple alignments

  • MAVID

  • http://hanuman.math.berkeley.edu/kbrowser

  • MULTIZ

  • http://genome.ucsc.edu/

    (currently no D. erecta)


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

DroAna_20041206_ CTGAAGGAAT-------TCTATATT---------AAAGAAGATTTCTCATCATTGGTTG

DroMel_4_ CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTT

DroMoj_20041206_ CTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGAAA-------

DroPse_1_ CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG----DroSim_20040829_ CTGCGGGATTAGGAGTCATTAGAGT---------GCGGAAAAGCGG---------GTT-DroVir_20041029_ CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT--AATTTGGTCCAAA-------

DroYak_1_ CTGCGGGATTAGCGGTCATTGGTGT---------GAAGAATAGATC---------CTTT *** * * *

DroAna_20041206_ AATC-----ACTTAC

DroMel_4_ ATTCTATGGACTCAC

DroMoj_20041206_ ----TATTTACTCAC

DroPse_1_ ------TGTACTTAC

DroSim_20040829_ ATTCTATGGACTCAC

DroVir_20041029_ ----TATTTACTCAC

DroYak_1_ ATTTCATAAACTCAC

*** **

MAVID

N. Bray and L. Pachter, MAVID: Constrained ancestral alignment of multiple sequences, Genome Research 14 (2004) p 693--699


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

Needleman-Wunsch


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

droAna1.2448876 CTGAAGGAATTCTA--TATTAAAG-------------------------------

dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGT-TTATTC

droMoj1.contig_2959 CTGGAATAGTTAATTTCATTGTAA---------CACATAAA--CGTTTTAAATTC

dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG

droSim1.chr2L CTGCGGGATTAGGAGTCATTAGAG---------TGCGGAAAAGCGGG--TTATTC

droVir1.scaffold_6 CTGCAGCAGTTAA-ATAATTGTAA---------TAAACAA----TTCTCTAATTT

droYak1.chr2L CTGCGGGATTAGCGGTCATTGGTG---------TGAAGAATAGATCCT-TTATTT

*** * * * *

droAna1.2448876 AAGATTTCTCATCATTGGTTGAATC---------------------ACTTAC

dm2.chr2L -----------------------------------------TATGGACTCAC

droMoj1.contig_2959 -------------------------AAATATTT--------TATTGACTCAC

dp3.chr4_group3 -----------------------------------------TGT--ACTTAC

droSim1.chr2L -----------------------------------------TATGGACTCAC

droVir1.scaffold_6 ---------------------------------AAATATTTGGTCCACTCAC

droYak1.chr2L -----------------------------------------CATAAACTCAC

*** **

MULTIZ

Blanchette et al., Aligning multiple sequences with the threaded blockset aligner, Genome Research 14 (2004) p 708--715


One possibly wrong alignment is not enough the history of parametric inference

One (possibly wrong) alignment is not enough: the history of parametric inference

  • 1992: Waterman, M., Eggert, M. & Lander, E.

    • Parametric sequence comparisons, Proc. Natl. Acad. Sci. USA89, 6090-6093

  • 1994: Gusfield, D., Balasubramanian, K. & Naor, D.

    • Parametric optimization of sequence alignment, Algorithmica12, 312-326.

  • 2003: Wang, L., Zhao, J.

    • Parametric alignment of ordered trees, Bioinformatics, 19 2237-2245.

  • 2004: Fernández-Baca, D., Seppäläinen, T. & Slutzki, G.

    • Parametric Multiple Sequence Alignment and Phylogeny Construction, Journal of Discrete Algorithms, 2 271-287.

XPARAL

by Kristian Stevens and Dan Gusfield


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

Whole Genome Parametric AlignmentColin Dewey, Peter Huggins, Lior Pachter, Bernd Sturmfels and Kevin Woods

  • Mathematics and Computer Science

  • Parametric alignment in higher dimensions.

  • Faster new algorithms.

  • Deeper understanding of alignment polytopes.

  • Biology

  • Whole genome parametric alignment.

  • Biological implications of alignment parameters.

  • Alignment with biology rather than for biology.


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

Whole Genome Parametric AlignmentColin Dewey, Peter Huggins, Lior Pachter, Bernd Sturmfels and Kevin Woods

  • Mathematics and Computer Science

  • Parametric alignment in higher dimensions.

  • Faster new algorithms.

  • Deeper understanding of alignment polytopes.

  • Biology

  • Whole genome parametric alignment.

  • Biological implications of alignment parameters.

CTGAAGGAAT-------TCTATATT---------AAAGAAGATTTCTCATCATTGGTTG

CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTT

CTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGAAA-------

CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG----

CTGCGGGATTAGGAGTCATTAGAGT---------GCGGAAAAGCGG---------GTT-

CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT--AATTTGGTCCAAA-------

CTGCGGGATTAGCGGTCATTGGTGT---------GAAGAATAGATC---------CTTT

analysis


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

Whole Genome Parametric AlignmentColin Dewey, Peter Huggins, Lior Pachter, Bernd Sturmfels and Kevin Woods

  • Mathematics and Computer Science

  • Parametric alignment in higher dimensions.

  • Faster new algorithms.

  • Deeper understanding of alignment polytopes.

  • Biology

  • Whole genome parametric alignment.

  • Biological implications of alignment parameters.

CTGAAGGAAT-------TCTATATT---------AAAGAAGATTTCTCATCATTGGTTG

CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTT

CTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGAAA-------

CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG----

CTGCGGGATTAGGAGTCATTAGAGT---------GCGGAAAAGCGG---------GTT-

CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT--AATTTGGTCCAAA-------

CTGCGGGATTAGCGGTCATTGGTGT---------GAAGAATAGATC---------CTTT

analysis


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

computational geometry


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

=

+

A Whole Genome Parametric Alignment of

D. Melanogaster and D. Pseudoobscura

  • Divided the genomes into 1,116,792 constrained and 877,982 unconstrained segment pairs.

  • 2d, 3d, 4d, and 5d alignment polytopes were constructed for each of the 877,802 unconstrained segment pairs.

  • Computed the Minkowski sum of the 877,802 2d polytopes.


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

A Whole Genome Parametric Alignment of

D. Melanogaster and D. Pseudoobscura

  • Divided the genomes into 1,116,792 constrained and 877,982 unconstrained segment pairs.

  • This is an orthology map of the two genomes.

  • 2d, 3d, 4d, and 5d alignment polytopes were constructed for each of the 877,802 unconstrained segment pairs.

  • For each segment pair, obtain all possible optimal summaries for all parameters in a Needleman--Wunsch scoring scheme.

  • Computed the Minkowski sum of the 877,802 2d polytopes.

  • There are only 838 optimal alignments of the two Drosophila genomes if the same match, mismatch and gap parameters are used for all the segment pair alignments.


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

>mel

CTGCGGGATTAGGGGTCATTAGAGTGCCGA

AAAGCGAGTTTATTCTATGGAC

>pse

CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGA

GGAGAGGCCATCATCGTGTAC

?

How do we build the polytope for


Alignment polytopes are small

Alignment polytopes are small

Theorem: The number of vertices of an alignment polytope for two sequences of length n and m is O((n+m)d(d-1)/(d+1)) where d is the number of free parameters in the scoring scheme.

Examples:

ParametersModel Vertices

M,X,SJukes-Cantor with linear gap penaltyO(n+m)2/3

M,X,S,GJukes-Cantor with affine gap penaltyO(n+m)3/2M,XTS,XTV,S,GK2P with affine gap penalty O(n+m)12/5

L. Pachter and B. Sturmfels, Parametric inference for biological sequence analysis, Proceedings of the National Academy of Sciences, Volume 101, Number 46 (2004), p 16138--16143.

L. Pachter and B. Sturmfels, Tropical geometry of statistical models, Proceedings of the National Academy of Sciences, Volume 101, Number 46 (2004), p 16132--16137.

L. Pachter and B. Sturmfels (eds.), Algebraic Statistics for Computational Biology, Cambridge University Press.


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

Back to Adf1

BP England, U Heberlein, R Tjian. Purified Drosophila transcription factor, Adh distal factor-1 (Adf-1), binds to sites in several Drosophila promoters and activates transcription, J Biol Chem 1990.


Back to adf1

Back to Adf1

mel TGTGCGTCAGCGTCGGCCGCAACAGCG

pse TGT-----------------GACTGCG

*** ** ***

BLASTZ alignment


Back to adf11

Back to Adf1

mel TGTGCGTCAGCGTCGGCCGCAACAGCG

pse TGT-----------------GACTGCG

*** ** ***

mel TGTG----CGTCAGC--G----TCGGCC---GC-AACAG-CG

Pse TGTGACTGCG-CTGCCTGGTCCTCGGCCACAGCCAAC-GTCG

**** ** * ** * ****** ** *** * **


Back to adf12

Back to Adf1

mel TGTGCGTCAGCGTCGGCCGCAACAGCG

pse TGT-----------------GACTGCG

*** ** ***

mel TGTG----CGTCAGC--G----TCGGCC---GC-AACAG-CG

pse TGTGACTGCG-CTGCCTGGTCCTCGGCCACAGCCAAC-GTCG

**** ** * ** * ****** ** *** * **

mel TGTGCGTCAGC------GTCGGCCGCAACAGCG

pse TGTGACTGCGCTGCCTGGTCCTCGGCCACAGC-

**** * ** *** * ** *****


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

80.4%


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

85.1%


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

86.5%


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

79.1%


Applications

Applications

  • Conservation of cis-regulatory elements

  • Phylogenetics: branch length estimation

Jukes-Cantor correction:

This is the expected number of mutations per site in an alignment with summary (x,s).


Applications1

Applications

  • Conservation of cis-regulatory elements

  • Phylogenetics: branch length estimation


Annotation and alignment of the drosophila genomes centro de ciencas genomicas may 29 2006

  • Algebraic Statistics

  • -- A language for unifying and developing many of the algorithms for biological sequence analysis --

    • The few inference functions theorem

    • Polytope propagation

    • Phylogenetic tree reconstruction

    • Evolutionary models

    • Maximum likelihood estimation

    • Mutagenic tree models


  • Login