slide1
Download
Skip this Video
Download Presentation
Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries

Loading in 2 Seconds...

play fullscreen
1 / 40

Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries - PowerPoint PPT Presentation


  • 172 Views
  • Uploaded on

Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries. Philipp Bucher In Silico Analysis of Proteins Celebrating the 20th Anniversary of Swiss-Prot Fortaleza – Brazil, Aug 3 2006.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries' - pia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1
Comparative Analysis of Promoter Sequences:The discovery of the Pribnow-box and some follow-up discoveries

Philipp Bucher

In Silico Analysis of Proteins

Celebrating the 20th Anniversary of Swiss-Prot

Fortaleza – Brazil, Aug 3 2006

why a talk on promoters at a protein meeting aren t promoters dna sequences
Why a talk on promoters at a protein meeting ?Aren’t promoters DNA sequences ?

No. promoters are not DNA sequences.

Any general representation of promoters, or algorithm to predict promoters, does not relate to intrinsic properties of DNA.

In fact, a profile or hidden Markov model representing promoter sequences constitutes a description of the DNA-binding surfaces of a protein in terms of base pair preferences.

Not surprisingly therefore, the first consensus sequence for an E.coli promoter element has been derived from seven sequences originating from six different species, including a eukaroytic virus.

early comparative analysis of e coli promoter sequences
Early comparative analysis of E.coli promoter sequences

FIG. 4. Comparison of promoter sequences (see text). b, Homologous sequence probably engaged by RNA polymerase; i, mRNA initiation point (underlined). Hyphens have been omitted. SV40, simian virus 40; w.t., wild type.

Among the promoter sequences, there is a homologous, 7-base sequence lying to the left of the initiation points. I feel that the DNA sequence

5\' T-A-T-Pu-A-T-G 3\'

3\' A-T-A-Py-T-A-C 5\'

is implicated in the formation of a tight binary complex with RNA polymerase.

Text and Figures from: Pribnow (1975) Proc. Nat. Acad. Sci. USA 72, 784-788.

e coli promoters chapter 2
E. coli promoters: Chapter 2

A second sequence motif located about -35 bp upstream of the initiation site was discovered based on a larger promoter sequence collection.

e coli promoters chapter 3
E.coli promoters: Chapter 3

The figure below illustrates the concept of functional homology between two promoter sequences. In particular, these footprint results confirm that the -35 and -10 elements are correctly assigned even though the spacing between the two elements is different (Siebenlist et al. 1980, Cell 20, 269-281).

slide7

The program TargSearch implements an early sequence profile method using position-specific residue weights and scores for alternative spacer lengths.

early work on e coli promoters important contributions to computational biology
Early work on E. coli promoters: Important contributions to computational biology
  • Representation of functional molecular sequence motifs by IUPAC consensus sequences and weight matrices
  • A definition of functional homology and an xperimental criterion for correct alignment of DNA sequence motifs.
  • Prediction algorithms using profile or HMM-like target description.
  • The idea that quantitative promoter prediction scores can and perhaps should viewed as predictors of a protein property: the selectivity of RNA polymerase to a particular DNA ligand sequence.
eukaryotic promoters differences with regard to e coli promoters and other biological facts
Eukaryotic promoters: Differences with regard to E.coli promoters and other biological facts
  • Eukaryotic polymerases do not have intrinsic affinity to specific promoter sequences.
  • Eukaryotic promoters are recognized by a variety of transcriptions factors, each recognizing a specific target motif.
  • The binding sites of proteins which direct RNA polymerase to the promoter, may be located at larger and more variable distances from the initiation sites. Moreover, they these sites may occur in either orientation, or even downstream of the start site.
  • Tissue and developmental stage-specificity.
  • Epigenitic silencing mediated by chromatin condensation or DNA methylation.
epd essentials
EPD Essentials

Promoter definition: An experimentally mapped transcription initiation site.

Important assumption: A capped 5’end of a eukaryotic mRNA is generated by transcriptional initiation, not endonucleolytic cleavage

Primary data: (i) RNA sequencing, nuclease protection, primer extension data published in Journal articles, (ii) 5’ESTs from cDNA clones obtained with the oligo-capping method (only recently).

Purpose: (i) Comparative analysis of promoter elements, (ii) training and test set for promoter prediction algorithms (iii) resource for experimental researchers.

signal search analysis essentials
Signal Search Analysis Essentials
  • History: Signal Search Analysis is an ancient method developed by myself in the early eighties in Max Birnstiel’s lab in Zurich (first published in 1984)
  • Purpose: to discover and characterize sequence motifs that occur at constrained distances from physiologically defined sites in nucleic acid sequences.
  • Recent event: Adaptation of software to new environment, SSA web server, application to promoters and translational start sites.
  • Note the difference: SSA programs serve to characterize
    • motifs that occur at constrained distances from sites
  • not:
    • motifs that are over-represented within sequence sets
  • There are hundreds of programs that address the latter problem, but only very few that serve the same purpose as the SSA programs!
definition of a locally over represented sequence motif
Definition of a Locally Over-represented Sequence Motif

The definition of a locally over-represented sequence motif has three components:

  • A weight matrix or consensus sequence defining the motif
  • A cut-off value
  • A preferred region of occurrence with respect to a functional site, e.g. a transcription initiation sites

The weight matrix or consensus sequence allows one to compute a match score for any subsequence of a promoter that has the same length as the matrix.

The cut-off value determines which subsequence constitutes a motif match.

The preferred region is the third criterion necessary to decide whether a given promoter contains a given locally over-represented sequence motif or not.

The difference in occurrence frequency inside and outside of the preferred region can be used as an objective function to optimize the three components of a locally over-represented sequence motif listed above.

a weight matrix definition for the tata box motif
A weight matrix definition for the TATA-box motif

See also. Bucher 1990, J. Mol. Biol.212, 563-578.

promoter prediction
Promoter prediction

Benchmark results from Fickett & Hatzigeorgiou 1997, Genome Res.7, 861-878

Note: The false/random discovery rates (about 1 in 1 kb) are about 2 orders of magnitude too high if one assumes one promoter per 100 kb for the human genome (perhaps an underestimation).

At this unacceptably high false discovery rate the sensitivity barely exceeds 50% for most of the programs.

why is eukaryotic promoter prediction so hard
Why is eukaryotic promoter prediction so hard ?

Technical reasons:

  • Too few promoters mapped experimentally
  • Low quality of experimental data resulting in inexact or wrong transcription initiation site mapping

Biological reasons:

  • Transcription initiation appears to be often a fuzzy process. The initiation sites pertaining to one promoter may be scattered over 50 bp or more.
  • There may be many useless promoters giving rise to rapidly degraded non-functional transcripts.
  • There may be too many promoter classes recognized by different combinations of transcription factors.
  • Tissue and developmental stage specificity. Most promoters are in fact silent in most tissues. Promoter prediction is partly a tissue-specific problem.
progress may come from new technologies
Progress may come from new technologies

Introduction of high throughput technologies for cDNA (mRNA) 5’end sequencing. Recent papers:

Oligo-capping technique: Suzuki et al. (2001) Identification and Characterization of the Potential Promoter Regions of 1031 kinds of human genes. Genome Res. 11:677-684.

CAGE: Carninci et al. (2001) Genome-wide analysis of mammalian promoter architecture and evolution. Nat. Genet. doi:10.1038/ng1789.

Close to one million 5’ tags of human transcripts have been analyzed with these techniques.

Processing of cDNA 5’tags has tripled the number of promoter entries in EPD in less than two years.

We have coined the term “in silico primer extension” designating the process of TSS mapping with cDNA 5’tag data.

in silico primer extension essentials
In silico primer Extension - Essentials

Purpose:

  • to map transcription start sites to a genome,
  • to study the regulation of alternative promoter usage

Experimental procedures:

  • full-length cDNA synthesis (e.g. oligo-capping method)
  • Generation of 5’tags (EST sequencing, 5’SAGE, CAGE)

Computational procedures:

  • mapping of 5’ tags to the genome,
  • identification of clusters in mRNA 5’end profiles
promoter region defined by transcription start sites tss
Promoter region defined by transcription start sites (TSS)

genomic DNA

conventional primer extension experiment with gene specific primer

cDNAs

promoter

TSS

in silico digital versus in vitro analog primer extension
In Silico (Digital) versus in Vitro (Analog) Primer Extension

ccgagtcccctcacccctttccttcccacAGGTCCCTGGCCAAAGATTTATTTCTCTTGACAACCA

our in silico primer extension pipeline
Our in Silico Primer Extension Pipeline

GenBank/EMBL

5’ EST entries of

selected libraries

Unigene entry

RefSeq entry

Blast

Trace files

cDNA 5’tag (50 nt)

Genome sequence (2kb)

Profile-based multiple

sequence aligment method

Zero to several

Promoter entries

1-D clustering

By MADAP

mRNA 5’end profile

definition of promoter sites and classes from cdna 5 end profiles with the program madap
Definition of Promoter Sites and Classes from cDNA 5’end Profiles with the Program MADAP

10 bp

45 bp

# of 5’end of NEDO transcripts

Genomic position

R

R

84047148-84047231

84046905-84046987

in silico pe versus conventional techniques
In silico PE versusconventional techniques

100 bp

# of 5’end of DBTSS transcripts

Genomic position

Characterization of three optional promoters in the 5\' region of the human aldolase A gene.

Maire P. et al (1987) J. Mol. Biol. 197, 425-438

comparative evaluation of human promoter sets compiled by different methods
Comparative Evaluation of Human promoter Sets Compiled by Different Methods

Questions addressed:

  • What is the overlap and agreement in transcription start sites definitions between the four data sets ?
  • Is any of the data sets contaminated by a substantial number of non-promoter sequences ?
  • Which method defines the transcription start site most accurately ?
  • Is any of the four promoter compilations biased with regard to promoter subclasses ?
comparative evaluation of human promoter sets compiled by different methods1
Comparative Evaluation of Human promoter Sets Compiled by Different Methods

Goal of the project: to compare four different promoter (transcription start sites) compilations:

  • EPD: manually compiled promoter compilation based primarily on nuclease protection and primer extension experiments published in the biological journal literature.
  • PRESTA: Automatically compiled promoter collection relying on author submitted sequence feature annotations in EMBL sequence entries and confirmatory evidence from public EST sequences.
  • DBTSS (NEDO): Transcription starts sites inferred from 5’end sequences of full-length enriched cDNA libraries obtained with the oligo-capping method.
  • MGC: Transcription starts sites inferred from 5’end sequences of full-length enriched cDNA libraries from the Mammalian Gene Catalog (MGC) program.
promoter elements and sequence properties used for the evaluation of different promoter sets
Promoter Elements and Sequence Properties used for the Evaluation of Different Promoter Sets

Locally over-represented sequence motifs:

  • TATA-box: site selector element, occurs around position –27, estimated frequency in human promoters: 64%.
  • Initiator: site selector element, presumably occurs exactly at initiation site, estimated frequency in human promoters: 50%.
  • CCAAT-box: upstream promoter element, occurs in a large upstream region with peak frequency at –80, estimated frequency in human promoters: 23%.
  • GC-box: upstream promoter element, occurs in a large upstream region with peak frequency at –50, estimated frequency in human promoters: 52%.

Other known sequence features:

  • CpG islands: regions of 200-1000 bp with a ratio of CpGobs / CpGexp > 0.6 and a C+G content > 50%, occurs around transcription initiation sites, estimated frequency based on promoters in EPD: 39%.
in silico analysis of larger promoter sequence sets
In silico analysis of larger promoter sequence sets.

The previous results have shown that in silico primer extension is accurate, perhaps even more accurate than convetnional methods.

However:

Was data set size really the bottleneck in promoter analysis ?

Have we already gained new insights into promoter structure from analyzing larger promoter sets defined by in silico primer extension ?

A recent study of about 2000 Drosophila promoters may give a preliminary answer to this question.

slide38
The best conserved and most abundant Drosophila core promoter elements as found by Uwe Ohler and coworkers
slide39

In particular, the most significant and undoubtedly most frequent, most conserved, and thus probably most important Drosophila promoter element corresponds to the following motif:

30 years of very intensive and expensive wet lab molecular biology research has not uncovered that motif !!!

back to proteins
Back to Proteins:

What is the protein that binds to the most important promoter of element of Drospophila ?

Guesses from the audience may be sent to:

[email protected]

ad