STOCHASTIC CONTEXT FREE GRAMMAR. PARSING & USE. OUTLINE. Introduction to Stochastic Context Free Grammar(SCFG) Parsing of SCFG Use to RNA secondary structure prediction. SCFG. Chomsky hierarchy:. CONTEXT FREE GRAMMAR It’s a triple where: ∑ = set of terminal symbols(alphabet)
PARSING & USE
A string can be derived from another string ( ) if:
and the production is a production of the grammar.
A Stochastic Context Free Grammar is a quadruple G=(∑,V,R,P):
Def.: Let G=(∑,V,R,P) a SCFG and a derivation sequence d,
where is a string of non terminal symbols, the probability of the derivation d is:
B and C non terminal symbol
αis a single terminal symbol
Give a sequence and a grammar, which is the best parse tree that generate the sequence, wath is which is the parse tree with the highest probability?
Table P[n,n,R] = 0 // set all values in table to 0.
Triples G[n,n,R] = triples of (position,nonterminal1,nonterminal2). //traceback pointers
For j = 1 to n do
for all unit productions of type do
if s[j] == then
set P[j,1,V] = Pv() // the probability of the production
set G[j,1,V] = new Triple(0,0,0) // indicates no further traceback - i.e. a child node
//i is the length of the span, j the start and k where to split into two subspans
for i = 2 to n do
for j = 1 to n-i+1 do
for k = 1 to i-1 do
for all productions of type do
set newprob = P[j, k, X] *P[j + k, i – k, Y ]*pv(XY )
if newprob > P[j, i, V ] then
set P[j, i, V ] = newprob
set G[j, i, V] = new Triple(k,X,Y) // new traceback // point
P[n][Start symbol in G] holds the probability of the most likely parse.
n=length of the input string
M=number of non terminal symbols
T=number of production rules in the type V-->XY
Stem & loop
Sequenceanalysisof RNA is more difficultthan DNA and otherproteins
- Prediction of RNA secondary structure for a single sequence?
Analogy with SCFG
Calculate the most likely “parse tree” that derive a string
RNA secondary structure