Understanding Context-Free Grammars and Language Recognition

Languages • A Language L over a finite alphabet  is a set of strings of characters from the alphabet. • Examples are : • L = { s | s is a grammatical English sentence} and  = { w | w is a word in the dictionary} (consisting of over a hundred thousand words, each treated as an atomic symbol). • L = { s | s is a string of digits} and  = { 0,1,2,3,4,5,6,7,8,9} • L = { P | P is a string of ASCII characters forming a Java program} and  = the printable ASCII character set.

Grammars A context free grammar G is a 4-tuple : G = ( V,,P,S ) where 1.V is a set of nonterminals (or string variables), each representing a sublanguage from which the variable takes its values. Examples are <noun phrase> which can take on values such as “the big box” and T which can take on string values used to represent products in an algebraic expression. 2. is a finite alphabet. Examples are the English vocabulary (consisting of over a hundred thousand words, each treated as an atomic symbol). Another example is the printable ASCII character set. The binary alphabet consists of {0,1}. The alphabet contains the symbols from which language strings are formed. 3.P is a finite set of productions or rules used to define the sublanguages represented by the nonterminals. In a context free grammar, a rule has the format A  X where A  V and X  ( V  )* . The interpretation is that the strings in the sublanguage represented by A can be constructed according to the format indicated by X. For a terminal character in X, the terminal character is used in the A string and for a variable in X, a string in the sublanguage is substituted for the variable. Examples are <noun phrase>  <determiner> <adj-list> <noun> and T a * T. 4.S is a designated variable (referred to as the start symbol or the head of the language). It represents the language being defined by the grammar G.

Grammars and Derivations Derivations If u,v are strings in ( V  )* , A is in V and A  X is in P, then uAv  uXv , referred to as uAv “derives” uXv by application of the rule A  X. For repeated applications of 0 or more rules, the symbol * is used. Language Definition The language L(G) defined by G is { x | x *, S * x }

Parsing • Given a Grammar G with distinguished nonterminal S and a string X over the alphabet, does S * X? • Parsing attempts to find a sequence of rules by which • S * X

Parse tree for d d . d d d I d I d I • D d D d D d Grammar for Decimal Numbers I  d I I  d I  • D D  d D D  d A parse tree has intermediate nodes for nonterminals, a child node for each RHS character in the production used to replace the nonterminal, a leaf node for each character in the language string produced by the derivation. The language is the set of strings for which there exist parse trees.

A Grammar for Sentences

A Grammar for Sentences Alphabet or Vocabulary

Top down Left to Right Parse Repeat Select a rule to replace the leftmost nonterminal whose right hand side will ultimately generate a prefix of the remaining source.

Top down Left to Right Parse Leftmost character of the sentential form is <Det>. Select the rule <Determiner>  [the] and click <Det> to “expand”.

Top down Left to Right Parse

Lexemes and Tokens • A lexeme is a string of terminal characters belonging to some lexical class such as adjective, determiner, noun, etc. Examples are : • “young” – adjective - a • “the” - determiner • “woman” - noun • A token with a syntactic or lexical code. Examples are : • <“young”,a> • <“the”,d> • <“woman”,n>

Finite state automata and language recognition d I d S · · F D d d Finite state automaton has  = {d,•} , start state S and legal final states I and D. The transition function is represented by above diagram or table below: d • S I F I I D F D D D - Accepts : ddd, d.dd, .ddd Rejects d.dd.d

Top down Left to Right Parse • LL(1) Parsing: • Start with the nonterminal representing the language as the unmatched sentential form • Repeat until source string has been generated or until failure • Let X be the leftmost character • If X is terminal it must the first character of the remaining source (otherwise failure) • If X is nonterminal then the rules for X must not overlap as far as the 1st character generated by a rule. • Select the rule which generates (in 1 step or more) a 1st character matching the next source character and apply this rule.

Example Parse LL(1) parse table • S  NvNP • P   • P  pN • N  dAn • A  aA • A   Grammar

LL(1) Parsing FIRST: Define First(X) as the set of characters which can begin a string derived from grammar symbol X Follow: Define Follow(X) as the set of characters which can follow grammar symbol X in a string derived from the start symbol S First: If X is a terminal then First(X) = {X} If X is a nonterminal and X → λ then add λ to First(X) If X → X1X2..XkXk+1..Xn with λ in First(Xi) , 1 <= i <= k, then add First(Xk+1) to First(X) and if λ in First(Xi) , 1 <= i <= n, add λ to First(X) Follow: $ is in Follow(S) If A → αBβ with β <> λ, then add First(β) – { λ} to Follow(B) If A → αB or A → αBβ with λ in First(β), then add Follow(A) to Follow(B) LL(1) parse table Let T be a table with rows for nonterminals and columns for terminals. If Ri A → α and t in First(α) then enter i in T(A,t). If Ri A → α and λ in First(α) and t in Follow(A) then enter i in T(A,t).

LL(1) Parsing – Computation of First & Follow

Example of First & Follow LL(1) parse table

LL(1) parse – Example 2 : the dog bit the young boy in the leg  dnvdanpdn (tokens generated by lexical analyzer)

Understanding Context-Free Grammars and Language Recognition

Understanding Context-Free Grammars and Language Recognition

Presentation Transcript

LANGUAGES

Languages

LANGUAGES

Languages

Languages .

Languages

Languages

Languages

Languages

Languages

Intonation languages and tone languages

Languages

Languages

Languages

Languages

Languages