490 likes | 550 Views
CS153. Intro to Lexing & Parsing. Notes. Problem Set 0 due Friday at 5pm. See Lucas or me if you need help! I will be having office hours from 5:45-7:15 on Thu night in the Science Center. Reading: Relevant chapters on Lexing & Parsing in Appel OCamlLex and OCamlYacc documentation
E N D
CS153 Intro to Lexing& Parsing
Notes • Problem Set 0 due Friday at 5pm. • See Lucas or me if you need help! • I will be having office hours from 5:45-7:15 on Thu night in the Science Center. • Reading: • Relevant chapters on Lexing & Parsing in Appel • OCamlLex and OCamlYacc documentation • “Monadic Parsing in Haskell” by G.Hutton and E.Meijer.
Parsing • Two pieces conceptually: • Recognizing syntactically valid phrases. • Extracting semantic content from the syntax. • E.g., What is the subject of the sentence? • E.g., What is the verb phrase? • E.g., Is the syntax ambiguous? If so, which meaning do we take? • “Fruit flies like a banana” • “2 * 3 + 4” • “x ^ f y” • In practice, solve both problems at the same time.
Specifying Syntax We use grammars to specify the syntax of a language. exp int | var | exp ‘+’ exp | exp ‘*’ exp | ‘let’ var ‘=‘ exp ‘in’ exp ‘end’ int ‘-’?digit+ var alpha(alpha|digit)* digit ‘0’ | ‘1’ | ‘2’ | ‘3’ | ‘4’ | … | ‘9’ alpha [a-zA-Z]
Naïve Matching To see if a sentence is legal, start with the first non-terminal), and keep expanding non-terminals until you can match against the sentence. N ‘a’ | ‘(’ N ‘)’“((a))” N ‘(’N ‘)’ ‘(’‘(’N ‘)’‘)’ ‘(’‘(’‘a’‘)’‘)’ = “((a))”
Alternatively Start with the sentence, and replace phrases with corresponding non-terminals, repeating until you derive the start non-terminal. N ‘a’ | ‘(‘ N ‘)’“((a))” ‘(’‘(‘‘a’‘)’‘)’ ‘(‘‘(‘ N ‘)’‘)’ ‘(‘ N ‘)’ N
Highly Non-Deterministic • For real grammars, automating this non-deterministic search is non-trivial. • As we’ll see, naïve implementations must do a lot of back-tracking in the search. • Ideally, given a grammar, we would like an efficient, deterministic algorithm to see if a string matches it. • There is a very general cubic time algorithm. • Only linguists use it . • (In part, we don’t because recognition is only half the problem.) • Certain classes of grammars have much more efficient implementations. • Essentially linear time with constant state (DFAs). • Or linear time with stack-like state (Pushdown Automata).
Tools in your Toolbox • Manual parsing (say, recursive descent). • Tedious, error prone, hard to maintain. • But fast & good error messages. • Parsing combinators • Encode grammars as (higher-order) functions. • Basically, functions that generate recursive-descent parsers. • Makes it easy to write & maintain grammars. • But can do a lot of back-tracking, and requires a limited form of grammar (e.g., no left-recursion.) • Lex and Yacc • Domain-Specific-Languages that generate very efficient, table-driven parsers for general classes of grammars. • Learn about the theory in 121 • Need to know a bit here to understand how to effectively use these tools.
CS153 Regular Expressions &Finite-State Automata
Regular Expressions • Non-recursive grammars • (matches no string) • ε (epsilon – matches empty string) • Literals (‘a’, ‘b’, ‘2’, ‘+’, etc.) drawn from alphabet • Concatenation (R1 R2) • Alternation (R1 | R2) • Kleene star (R*)
Formally • A regular expression denotes a set of strings: • [[]] = { } • [[ε]] = { “” } • [[‘a’]] = { “a” } • [[R1 R2]] = { s | s = s1 ^ s2 & s1 in R1 & s2 in R2 } • [[R1 | R2]] = { s | s in R1 or s in R2 } • [[R*]] = [[ε + RR* ]] = { s | s = “” or s = s1 ^ s2 & s1 in R & s2 in R* }
Example We might recognize numbers as: digit ::= [0-9] number ::= ‘-’? digit+ [0-9] shorthand for ‘0’ | ‘1’ | … | ‘9’ ‘-’? shorthand for (‘-’ | ε) digit+ shorthand for (digit digit*) So number ::= (‘-’ | ε) ((‘0’ | ‘1’ | … | ‘9’)(‘0’ | ‘1’ | … | ‘9’)*)
Graphical Representation ε ε c b a b accept start ε d a b (c | d)* b ε
Non-Deterministic, Finite-State Automaton ε ε c b a b accept start ε d • Formally: • an alphabet Σ • a set V of states • a distinguished start state • one or more accepting states • transition relation: δ : V * (Σ + ε) * V bool ε
Translating RegExps ε • Epsilon: • Literal ‘a’: • R1 R2 • R1 | R2 a ε R2 R1 R1 ε ε R2 ε ε
Translating RegExps ε • R* R1 ε ε
Converting to Deterministic • Naively: • Give each state a unique ID (1,2,3,…) • Create super states • One super-state for each subset of all possible states in the original NDFA. • (e.g., {1},{2},{3},{1,2},{1,3},{2,3},{123}) • For each super-state (say {23}): • For each original state s and character c: • Find the set of accessible states (say {1,2}) skipping over epsilons. • Add an edge labeled by c from the super state to the corresponding super-state. • In practice, super-states are created lazily.
ε 4 ε c b a b 3 1 7 2 6 ε d 5 ε 4,6,3 c b a b 7 3,6 1 2 d 5,6,3
ε 4 ε c b a b 3 1 7 2 6 ε d 5 ε c 4,6,3 b c b a b 7 3,6 1 2 b d 5,6,3 d
ε 4 ε c b a b 3 1 7 2 6 ε d 5 ε c d 4,6,3 b c b a b 7 3,6 1 2 b d c 5,6,3 d
c d 4,6,3 b c b a b 7 3,6 1 2 b d c 5,6,3 d
Once We have a DFA • Deterministic Finite State automata are easy to simulate: • For each state and character, there is at most one transition we can take. • Usually record the transition function as an array, indexed by states and characters. • Lexer starts off with a variable s initialized to the start state: • Reads a character, uses transition table to find next state. • Look at the output of Lex!
An Algebraic Approach • There’s a purely algebraic way to compute whether a string is in the set denoted by a regular expression. • It’s based on computing the symbolic derivative of a regular expression with respect to the characters in a string.
Some Algebraic Structure Think of ε as one, as zero, concatenation as multiplication, and alternation as addition: | R = R | = R R = R = ε R = R ε = R R | R = R ε* = ε R** = R* R2 | R1 = R1 | R2
Derivatives • The derivative of a regular expression R with respect to a character ‘a’ is:[[deriv R a]] = { s | “a” ^ s in R } • So it’s the residual strings once we remove the ‘a’ from the front. • e.g., deriv (abab) ‘a’ = bab • e.g., deriv (abab | acde) = (bab | cde)
Symbolic Differentiation deriv c = deriv ε c = deriv c c = ε deriv c c’ = deriv (R1 | R2) c = (deriv R1 c) | (deriv R2 c) deriv (R1 R2) c = (deriv R1 c) R2 | (empty R1) (deriv R2 c) deriv (R*) c = (deriv R c) R*
Symbolic Differentiation deriv c = deriv ε c = deriv c c = ε deriv c c’ = deriv (R1 | R2) c = (deriv R1 c) | (deriv R2 c) deriv (R1 R2) c = (deriv R1 c) R2 | (empty R1) (deriv R2 c) deriv (R*) c = (deriv R c) R*
Symbolic Empty Think of ε as “true” and as false. empty = empty ε = ε empty c = emtpy (R1 | R2) = (empty R1) | (empty R2) empty (R1 R2) = (empty R1) (empty R2) empty (R*) = ε
Algebraic Recognition • Given a regular expression R and a string of characters c1, c2, …, cn. • We can test whether the string is in [[R]] by calculating: • deriv (… deriv (deriv R c1) c2 …) cn • And then test whether the resulting regular expression accepts the empty string.
Example: Take R = (ab | ac)* Take the string to be [‘a’ ; ‘b’; ‘a’; ‘c’] deriv R ‘a’ = (deriv (ab | ac) ‘a’) R (deriv (ab) ‘a’) | (deriv (ac) ‘a’) R = (b | c) R So deriv R ‘a’ = (b | c) R
Continuing Now we must compute: deriv ((b | c) R) ‘b’ = (deriv (b | c) ‘b’) R | (empty (b | c)) (deriv R ‘b’)
Continuing Now we must compute: deriv ((b | c) R) ‘b’ = (deriv (b | c) ‘b’) R | (empty (b | c)) (deriv R ‘b’) empty (b | c) = (empty b) | (empty c) = | =
Continuing So this simplifies to: deriv ((b | c) R) ‘b’ = (deriv (b | c) ‘b’) R | (empty (b | c)) (deriv R ‘b’) = (deriv (b | c) ‘b’) R
Continuing (deriv (b | c) ‘b’) R = (deriv b ‘b’ | deriv c ‘b’) R = (ε | ) R = ε R = R So starting with R = (ab | ac)* and calculating the derivative w.r.t. ‘a’ and then ‘b’, we are back where we started with R.
Building a DFA with Derivatives • Given your regular expression R • associate it with an initial state S0. • Calculate the derivative of R w.r.t. each character in the alphabet, and generate new states for each unique resulting regular expression.
Example: Start State (ab | ac)*
Example: Start State (ab | ac)* deriv w.r.t. ‘a’ (b | c)((ab |ac)*)
Example: Start State (ab | ac)* deriv w.r.t. ‘a’ deriv w.r.t. ‘b’ (b | c)((ab |ac)*)
Example: Start State (ab | ac)* deriv w.r.t. ‘a’ deriv w.r.t. ‘c’ deriv w.r.t. ‘b’ (b | c)((ab |ac)*)
Simplified (ab | ac)* ‘a’ ‘b’, ‘c’ (b | c)((ab |ac)*)
Then… • Then continue calculating the derivatives for each new state. • You stop when you don’t generate a new state. • (Important: must do algebraic simplifications or this process won’t stop.)
Only One Non-Zero State (ab | ac)* ‘a’ ‘b’, ‘c’ (b | c)((ab |ac)*)
Calculate its Derivatives (ab | ac)* ‘a’ ‘b’, ‘c’ (b | c)((ab |ac)*) deriv w.r.t. ‘b’ (ab | ac)*
Calculate its Derivatives (ab | ac)* ‘a’ ‘b’, ‘c’ (b | c)((ab |ac)*) deriv w.r.t. ‘b’ deriv w.r.t. ‘c’ (ab | ac)* (ab | ac)*
Calculate its Derivatives (ab | ac)* ‘a’ ‘b’, ‘c’ (b | c)((ab |ac)*) deriv w.r.t. ‘a’ deriv w.r.t. ‘b’ deriv w.r.t. ‘c’ (ab | ac)* (ab | ac)*
And then simplify – no new states (ab | ac)* ‘b’, ‘c’ ‘a’ (b | c)((ab |ac)*) ‘a’ ‘b’,’c’
Accepting States • Then figure out the accepting states by seeing if they accept the empty string. • i.e., run empty on the associated regular expression and see if its simplified form is ε or .
Which states accept empty? (ab | ac)* ‘b’, ‘c’ ‘a’ (b | c)((ab |ac)*) ‘a’ ‘b’,’c’
Just this one (our start state): (ab | ac)* ‘b’, ‘c’ ‘a’ (b | c)((ab |ac)*) ‘a’ ‘b’,’c’