Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries. Philipp Bucher In Silico Analysis of Proteins Celebrating the 20th Anniversary of Swiss-Prot Fortaleza – Brazil, Aug 3 2006.
In Silico Analysis of Proteins
Celebrating the 20th Anniversary of Swiss-Prot
Fortaleza – Brazil, Aug 3 2006
No. promoters are not DNA sequences.
Any general representation of promoters, or algorithm to predict promoters, does not relate to intrinsic properties of DNA.
In fact, a profile or hidden Markov model representing promoter sequences constitutes a description of the DNA-binding surfaces of a protein in terms of base pair preferences.
Not surprisingly therefore, the first consensus sequence for an E.coli promoter element has been derived from seven sequences originating from six different species, including a eukaroytic virus.
FIG. 4. Comparison of promoter sequences (see text). b, Homologous sequence probably engaged by RNA polymerase; i, mRNA initiation point (underlined). Hyphens have been omitted. SV40, simian virus 40; w.t., wild type.
Among the promoter sequences, there is a homologous, 7-base sequence lying to the left of the initiation points. I feel that the DNA sequence
5' T-A-T-Pu-A-T-G 3'
3' A-T-A-Py-T-A-C 5'
is implicated in the formation of a tight binary complex with RNA polymerase.
Text and Figures from: Pribnow (1975) Proc. Nat. Acad. Sci. USA 72, 784-788.
A second sequence motif located about -35 bp upstream of the initiation site was discovered based on a larger promoter sequence collection.
The figure below illustrates the concept of functional homology between two promoter sequences. In particular, these footprint results confirm that the -35 and -10 elements are correctly assigned even though the spacing between the two elements is different (Siebenlist et al. 1980, Cell 20, 269-281).
The program TargSearch implements an early sequence profile method using position-specific residue weights and scores for alternative spacer lengths.
Promoter definition: An experimentally mapped transcription initiation site.
Important assumption: A capped 5’end of a eukaryotic mRNA is generated by transcriptional initiation, not endonucleolytic cleavage
Primary data: (i) RNA sequencing, nuclease protection, primer extension data published in Journal articles, (ii) 5’ESTs from cDNA clones obtained with the oligo-capping method (only recently).
Purpose: (i) Comparative analysis of promoter elements, (ii) training and test set for promoter prediction algorithms (iii) resource for experimental researchers.
The definition of a locally over-represented sequence motif has three components:
The weight matrix or consensus sequence allows one to compute a match score for any subsequence of a promoter that has the same length as the matrix.
The cut-off value determines which subsequence constitutes a motif match.
The preferred region is the third criterion necessary to decide whether a given promoter contains a given locally over-represented sequence motif or not.
The difference in occurrence frequency inside and outside of the preferred region can be used as an objective function to optimize the three components of a locally over-represented sequence motif listed above.
See also. Bucher 1990, J. Mol. Biol.212, 563-578.
Benchmark results from Fickett & Hatzigeorgiou 1997, Genome Res.7, 861-878
Note: The false/random discovery rates (about 1 in 1 kb) are about 2 orders of magnitude too high if one assumes one promoter per 100 kb for the human genome (perhaps an underestimation).
At this unacceptably high false discovery rate the sensitivity barely exceeds 50% for most of the programs.
Introduction of high throughput technologies for cDNA (mRNA) 5’end sequencing. Recent papers:
Oligo-capping technique: Suzuki et al. (2001) Identification and Characterization of the Potential Promoter Regions of 1031 kinds of human genes. Genome Res. 11:677-684.
CAGE: Carninci et al. (2001) Genome-wide analysis of mammalian promoter architecture and evolution. Nat. Genet. doi:10.1038/ng1789.
Close to one million 5’ tags of human transcripts have been analyzed with these techniques.
Processing of cDNA 5’tags has tripled the number of promoter entries in EPD in less than two years.
We have coined the term “in silico primer extension” designating the process of TSS mapping with cDNA 5’tag data.
conventional primer extension experiment with gene specific primer
5’ EST entries of
cDNA 5’tag (50 nt)
Genome sequence (2kb)
sequence aligment method
Zero to several
mRNA 5’end profile
# of 5’end of NEDO transcripts
# of 5’end of DBTSS transcripts
Characterization of three optional promoters in the 5' region of the human aldolase A gene.
Maire P. et al (1987) J. Mol. Biol. 197, 425-438
Goal of the project: to compare four different promoter (transcription start sites) compilations:
Locally over-represented sequence motifs:
Other known sequence features:
The previous results have shown that in silico primer extension is accurate, perhaps even more accurate than convetnional methods.
Was data set size really the bottleneck in promoter analysis ?
Have we already gained new insights into promoter structure from analyzing larger promoter sets defined by in silico primer extension ?
A recent study of about 2000 Drosophila promoters may give a preliminary answer to this question.
In particular, the most significant and undoubtedly most frequent, most conserved, and thus probably most important Drosophila promoter element corresponds to the following motif:
30 years of very intensive and expensive wet lab molecular biology research has not uncovered that motif !!!
What is the protein that binds to the most important promoter of element of Drospophila ?
Guesses from the audience may be sent to: