Loading in 2 Seconds...
Loading in 2 Seconds...
Sequence features of DNA binding sites reveal structural class of associated transcription factor Narlikar L and Hartemink AJ. Bioinformatics. 2006 Jan 15;22(2):157-63. Carol Sniegoski The Central Dogma of Molecular Biology Double-stranded chain of nucleotide bases (A-T, C-G)
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Narlikar L and Hartemink AJ. Bioinformatics. 2006 Jan 15;22(2):157-63.
Double-stranded chain of nucleotide bases
Single-stranded chain of nucleotide bases (A,U,C,G)
Primary protein structure
The order of amino acids
Secondary protein structure
Common repeating structures, often formed by hydrogen bonds
Tertiary protein structure
The full 3-dimensional folded structure
Quaternary protein structure
Proteins organized of multiple polypeptide chains
Activating the gene structure
Initiating transcription of mRNA from DNA
Processing the mRNA transcript
Transporting the processed transcript from
nucleus to cytoplasm
Translating mRNA into protein
Controlling mRNA degradation
New RNA transcript
DNA double helix
Example DNA sequences
TFIIIB (with 3 subunits) now binds to its binding site near the startpoint of transcription
TFIIIA binds to a site within the promoter region
Finally RNA polymerasebinds and begins transcribing the gene
TFIIIC binds to form a stable complex
Basal transcription complex
PSSM (Position-Specific Scoring Matrix)
Position-Specific Scoring Matrix
Probability of seeing S in the motif
Probability of seeing S outside the motif
A 3 2 0 12 0 0 0 0 1 3
C 5 2 12 0 12 0 1 0 2 1
G 3 7 0 0 0 12 0 7 5 4
T 1 1 0 0 0 0 11 5 4 4
PSSM matrix built from an alignment of 12 binding sites of length 10 bp for yeast TF Pho4p
Goal: Predict the type of DNA-binding domain that a TF has based on features of the DNA sequences to which it binds.
TRANSFAC® is a database on eukaryotic cis-acting regulatory DNA elements and trans-acting factors. It covers the whole range from yeast to human. It started 1988 with a printed compilation and was transferred into computer-readable format in 1990.
The FACTOR table contains 6133 entries in 50 classes, but this figure does not reflect the number of independent transcription factors. Homologous factors from different species such as human and mouse SRF are given different entries since they may differ in some molecular aspects. Factors originally described by different research groups as binding to different genes may turn out identical when cloned. Also, more factors are recognized as representatives of whole TF families that are products of distinct but similar genes or alternative splice products. We have in general not entered proteins just because of the presence of a putative DNA-binding motif. Thus there are many more zinc finger or homeo domain proteins known than are included in FACTOR, but for many no data about DNA-binding specificity or other gene regulatory features are available.
The SITE table gives information on individual (putatively) regulatory protein binding sites. It contains 7915 entries. 6360 of them refer to sites within 1504 eukaryotic genes. 1295 are artificial sequences. 260 have consensus binding sequences given in the IUPAC code.
1 Superclass: Basic Domains*1.1 Class: Leucine zipper factors (bZIP). (IV)
*1.2 Class: Helix-loop-helix factors (bHLH). (III)
1.3 Class: Helix-loop-helix / leucine zipper factors (bHLH-ZIP).
1.4 Class: NF-1
1.5 Class: RF-X
1.6 Class: bHSH
2 Superclass: Zinc-coordinating DNA-binding domains
*2.1 Class: Cys4 zinc finger of nuclear receptor type. (II)
2.2 Class: diverse Cys4 zinc fingers.
*2.3 Class: Cys2His2 zinc finger domain. (I)
2.4 Class: Cys6 cysteine-zinc cluster.
2.5 Class: Zinc fingers of alternating composition
3 Superclass: Helix-turn-helix
*3.1 Class: Homeo domain. (IV)
3.2 Class: Paired box.
*3.3 Class: Fork head / winged helix. (V)
3.4 Class: Heat shock factors
3.5 Class: Tryptophan clusters.
3.6 Class: TEA domain.
4 Superclass: beta-Scaffold Factors with Minor Groove Contacts
4.1 Class: RHR (Rel homology region).
4.2 Class: STAT
4.3 Class: p53
4.4 Class: MADS box.
4.5 Class: beta-Barrel alpha-helix transcription factors
4.6 Class: TATA-binding proteins
Transcription Factor ClassificationLast modified 2002-10-01
1 Superclass: Basic Domains
1.1 Class: Leucine zipper factors (bZIP).
1.1.1 Family: AP-1(-like) components
220.127.116.11 Subfamily: Jun
18.104.22.168.1 XBP-1 (human). 22.214.171.124.2 v-Jun (ASV). 126.96.36.199.3 c-Jun (mouse); c-Jun (rat); c-Jun (human); c-Jun (chick). 188.8.131.52.4 JunB (mouse). 184.108.40.206.5 JunD (mouse). 220.127.116.11.6 dJRA
18.104.22.168 Subfamily: Fos
22.214.171.124.1 v-Fos (FBR MuLV); v-Fos (FBJ MuLV); v-Fos (NK24). 126.96.36.199.2 c-Fos (mouse); c-Fos (human); c-Fos (rat); c-Fos (chick). 188.8.131.52.3 FosB (mouse).
184.108.40.206.3.1 FosB1 220.127.116.11.3.2 FosB2
18.104.22.168.4 Fra-1 (mouse); Fra-1 (rat). 22.214.171.124.5 Fra-2 (chick); Fra-2 (human).
Drilldown on 1.1 Class: Leucine zipper factors (bZIP) lists factors in the class:
CL basic region + leucine zipper; 1.1.
CC A DNA-binding basic region is followed by a leucine zipper. The leucine zipper consists of repeated leucine residues at every seventh position and mediates protein dimerization as a prerequisite for DNA-binding. The leucines are directed towards one side of an alpha-helix. The leucine side chains of two polypeptides are thought to interdigitate upon dimerization (knobs-into-holes model). The leucine zipper dictates dimerization specificity. Upon DNA-binding of the dimer, the basic regions adopt alpha-helical conformation as well. Possibly, a sharp angulation point separates two alpha-helices of the subregions A and B leading to the scissors grip model for the bZIP-DNA complex. The DNA is contacted through the major groove over a whole turn.
BFT03820 ABF1; Species: thale cress, Arabidopsis thaliana.
BFT03823 ABF2; Species: thale cress, Arabidopsis thaliana.
BFT03824 ABF3; Species: thale cress, Arabidopsis thaliana.
BFT03825 ABF4; Species: thale cress, Arabidopsis thaliana.
BFT04543 ABI5; Species: thale cress, Arabidopsis thaliana.
BFT04565 ACA1; Species: yeast, Saccharomyces cerevisiae.
BFT00027 AP-1; Species: clawed frog, Xenopus.
BFT00029 AP-1; Species: human, Homo sapiens.
BFT00030 AP-1; Species: monkey, Cercopithecus aethiops.
BFT00031 AP-1; Species: rat, Rattus norvegicus.
BFT00032 AP-1; Species: mouse, Mus musculus.
BFT03199 ARR1; Species: yeast, Saccharomyces cerevisiae.
BFT02783 ATB-2; Species: thale cress, Arabidopsis thaliana.
Drilldown on factor ABF1 lists the sequences to which it binds:
1364 integer features encoding subsequence frequency for subsequences up to length 5:
41 = 4 features for subsequences of length 1 (A, T, C, G)
42 = 16 for subsequences of length 2 (AA, AT, AC, AG, TA, TT, TC, TG, …)
43 = 64 for subsequences of length 3
44 = 256 for subsequences of length 4
45 = 1024 for subsequences of length 5
7 binary features encoding the presence or absence of a special sequence identified in the literature as over-represented in the binding sites of certain classes of TF.
G . . G Cys2His2 (I)
G . . G . . G Cys2His2 (I)
[GC] . . [GC] . . [GC] Cys2His2 (I)
AGGTCA | TGACCT Cys4 (II)
CA . . TG bHLH (III)
TGA .* TCA bZip (IV)
TAAT | ATTA Homeodomain (VI)
Regular expression representation:
. Any single character.
 Any single character inside the brackets.
| Either the expression preceding or the expression following.
* Zero or more of the preceding expression.
6 features = 1 or 2
10 features = 0
9 features = 1
55 features = 0
8 features = 1
248 features = 0
7 features = 1
1017 features = 0
A = 1
C = 3
G = 6
T = 1
1 feature = 1
7 features = 0
8 features = 0
At least 1345 of the 1387 features for this binding sequence are zero-valued.
n = 587 columns, one for each TF
d = 1390 rows, one for each feature
1-of-m class encoding
Sparse Multinomial Logistic Regression
Model/predict a dependent variable as a linear function of independent variables:
yi = b1xi1 + b2xi2 + … + bnxn + εi
Find the best-fit line (e.g., estimate the bi’s) by minimizing the sum of the squares of the vertical deviations from each data point to the line:
R2 = ∑ [yi – f(xi b1, b2. ..., bn)]2
Used when dependent variable y is binary.
Logit function of p is expressed as a linear combination of xi .
logit(p) = log ( p/(1-p) )
= w0 + w1x1 + … + wnxn = wTx
p = P ( y = 1 | x, w)
1 +e wTx
= probability that x belongs to class y, given x and w
w = [ w0 w1 … wn ]T ,
x = [ x0 x1 … xn ]T
d feature values for one sample
single weight vector of length d
Generalization of logistic regression.
Used when dependent variable y is multiclass.
p = P ( y(i) = 1 | x, w) =
= probability that x belongs to the class encoded by y(i) = 1, given w
w = [ w(1)T w(2)T … w(m)T ]T ,
x = [ x0 x1 … xd ]T ,
y = [ y(1) y(2) … y(m)]T
weight vectors of length d for each of m classes
d feature values for one sample
one-of-m class encoding
In logistic regression, w is usually estimated using maximum likelihood (ML).
Want to find w that maximizes the probability of classifying samples correctly.
P ( yj | xj , w ) = probability of classifying sample xj correctly, given the values of w.
log-likelihood l(w) = ∑ log ( P ( yj | xj , w ) )
wj indicates the weight vector for the class to which xj belongs
= ∑ log ( )
= ∑ ( wjTXj ) – log ∑
= ∑∑yj(i)w(i)TXj – log ∑
This is only 1 when xj is in class i, 0 else
We want w to be sparse, with many zero values, deselecting many features.
Use the maximum a posteriori (MAP) method:
Penalize the ML estimate by placing a prior p(w) on the parameters w.
Choose a prior distribution that induces sparsity: the Laplace distribution.
wMAP = argmax L(w) = argmax ( l(w) + log p(w) )
probability that w comes from a Laplace distribution
sum of log-likelihoods of xi being classified correctly, given xi and w
–|x - μ|/b
p(x) = (1/2b) e
Larger |w|j smaller p(w) very negative ln p(w)
Smaller |w|j larger p(w) less negative ln p(w)
ln p(w) is at its max at ln p(w) = 0
p(w) = 1
e = e0 = 1
= #TFs misclassified
.23(97) = 22.31
.09(97) = 8.73
.11(61) = 6.71
.08(165) = 13.2
.17(52) = 8.84
.15(115) = 17.25
.13(587) = 77.04