Lexing and Parsing: Understanding Syntax and Semantics

CS153 Intro to Lexing& Parsing

Notes • Problem Set 0 due Friday at 5pm. • See Lucas or me if you need help! • I will be having office hours from 5:45-7:15 on Thu night in the Science Center. • Reading: • Relevant chapters on Lexing & Parsing in Appel • OCamlLex and OCamlYacc documentation • “Monadic Parsing in Haskell” by G.Hutton and E.Meijer.

Parsing • Two pieces conceptually: • Recognizing syntactically valid phrases. • Extracting semantic content from the syntax. • E.g., What is the subject of the sentence? • E.g., What is the verb phrase? • E.g., Is the syntax ambiguous? If so, which meaning do we take? • “Fruit flies like a banana” • “2 * 3 + 4” • “x ^ f y” • In practice, solve both problems at the same time.

Specifying Syntax We use grammars to specify the syntax of a language. exp  int | var | exp ‘+’ exp | exp ‘*’ exp | ‘let’ var ‘=‘ exp ‘in’ exp ‘end’ int ‘-’?digit+ var  alpha(alpha|digit)* digit ‘0’ | ‘1’ | ‘2’ | ‘3’ | ‘4’ | … | ‘9’ alpha  [a-zA-Z]

Naïve Matching To see if a sentence is legal, start with the first non-terminal), and keep expanding non-terminals until you can match against the sentence. N ‘a’ | ‘(’ N ‘)’“((a))” N ‘(’N ‘)’ ‘(’‘(’N ‘)’‘)’ ‘(’‘(’‘a’‘)’‘)’ = “((a))”

Alternatively Start with the sentence, and replace phrases with corresponding non-terminals, repeating until you derive the start non-terminal. N ‘a’ | ‘(‘ N ‘)’“((a))” ‘(’‘(‘‘a’‘)’‘)’ ‘(‘‘(‘ N ‘)’‘)’ ‘(‘ N ‘)’ N

Highly Non-Deterministic • For real grammars, automating this non-deterministic search is non-trivial. • As we’ll see, naïve implementations must do a lot of back-tracking in the search. • Ideally, given a grammar, we would like an efficient, deterministic algorithm to see if a string matches it. • There is a very general cubic time algorithm. • Only linguists use it . • (In part, we don’t because recognition is only half the problem.) • Certain classes of grammars have much more efficient implementations. • Essentially linear time with constant state (DFAs). • Or linear time with stack-like state (Pushdown Automata).

Tools in your Toolbox • Manual parsing (say, recursive descent). • Tedious, error prone, hard to maintain. • But fast & good error messages. • Parsing combinators • Encode grammars as (higher-order) functions. • Basically, functions that generate recursive-descent parsers. • Makes it easy to write & maintain grammars. • But can do a lot of back-tracking, and requires a limited form of grammar (e.g., no left-recursion.) • Lex and Yacc • Domain-Specific-Languages that generate very efficient, table-driven parsers for general classes of grammars. • Learn about the theory in 121 • Need to know a bit here to understand how to effectively use these tools.

CS153 Regular Expressions &Finite-State Automata

Regular Expressions • Non-recursive grammars •  (matches no string) • ε (epsilon – matches empty string) • Literals (‘a’, ‘b’, ‘2’, ‘+’, etc.) drawn from alphabet • Concatenation (R1 R2) • Alternation (R1 | R2) • Kleene star (R*)

Formally • A regular expression denotes a set of strings: • [[]] = { } • [[ε]] = { “” } • [[‘a’]] = { “a” } • [[R1 R2]] = { s | s = s1 ^ s2 & s1 in R1 & s2 in R2 } • [[R1 | R2]] = { s | s in R1 or s in R2 } • [[R*]] = [[ε + RR* ]] = { s | s = “” or s = s1 ^ s2 & s1 in R & s2 in R* }

Example We might recognize numbers as: digit ::= [0-9] number ::= ‘-’? digit+ [0-9] shorthand for ‘0’ | ‘1’ | … | ‘9’ ‘-’? shorthand for (‘-’ | ε) digit+ shorthand for (digit digit*) So number ::= (‘-’ | ε) ((‘0’ | ‘1’ | … | ‘9’)(‘0’ | ‘1’ | … | ‘9’)*)

Graphical Representation ε ε c b a b accept start ε d a b (c | d)* b ε

Non-Deterministic, Finite-State Automaton ε ε c b a b accept start ε d • Formally: • an alphabet Σ • a set V of states • a distinguished start state • one or more accepting states • transition relation: δ : V * (Σ + ε) * V  bool ε

Translating RegExps ε • Epsilon: • Literal ‘a’: • R1 R2 • R1 | R2 a ε R2 R1 R1 ε ε R2 ε ε

Translating RegExps ε • R* R1 ε ε

Converting to Deterministic • Naively: • Give each state a unique ID (1,2,3,…) • Create super states • One super-state for each subset of all possible states in the original NDFA. • (e.g., {1},{2},{3},{1,2},{1,3},{2,3},{123}) • For each super-state (say {23}): • For each original state s and character c: • Find the set of accessible states (say {1,2}) skipping over epsilons. • Add an edge labeled by c from the super state to the corresponding super-state. • In practice, super-states are created lazily.

ε 4 ε c b a b 3 1 7 2 6 ε d 5 ε 4,6,3 c b a b 7 3,6 1 2 d 5,6,3

ε 4 ε c b a b 3 1 7 2 6 ε d 5 ε c 4,6,3 b c b a b 7 3,6 1 2 b d 5,6,3 d

ε 4 ε c b a b 3 1 7 2 6 ε d 5 ε c d 4,6,3 b c b a b 7 3,6 1 2 b d c 5,6,3 d

c d 4,6,3 b c b a b 7 3,6 1 2 b d c 5,6,3 d

Once We have a DFA • Deterministic Finite State automata are easy to simulate: • For each state and character, there is at most one transition we can take. • Usually record the transition function as an array, indexed by states and characters. • Lexer starts off with a variable s initialized to the start state: • Reads a character, uses transition table to find next state. • Look at the output of Lex!

An Algebraic Approach • There’s a purely algebraic way to compute whether a string is in the set denoted by a regular expression. • It’s based on computing the symbolic derivative of a regular expression with respect to the characters in a string.

Some Algebraic Structure Think of ε as one, as zero, concatenation as multiplication, and alternation as addition:  | R = R | = R  R = R =  ε R = R ε = R R | R = R ε* = ε R** = R* R2 | R1 = R1 | R2

Derivatives • The derivative of a regular expression R with respect to a character ‘a’ is:[[deriv R a]] = { s | “a” ^ s in R } • So it’s the residual strings once we remove the ‘a’ from the front. • e.g., deriv (abab) ‘a’ = bab • e.g., deriv (abab | acde) = (bab | cde)

Symbolic Differentiation deriv  c =  deriv ε c =  deriv c c = ε deriv c c’ =  deriv (R1 | R2) c = (deriv R1 c) | (deriv R2 c) deriv (R1 R2) c = (deriv R1 c) R2 | (empty R1) (deriv R2 c) deriv (R*) c = (deriv R c) R*

Symbolic Empty Think of ε as “true” and  as false. empty  =  empty ε = ε empty c =  emtpy (R1 | R2) = (empty R1) | (empty R2) empty (R1 R2) = (empty R1) (empty R2) empty (R*) = ε

Algebraic Recognition • Given a regular expression R and a string of characters c1, c2, …, cn. • We can test whether the string is in [[R]] by calculating: • deriv (… deriv (deriv R c1) c2 …) cn • And then test whether the resulting regular expression accepts the empty string.

Continuing Now we must compute: deriv ((b | c) R) ‘b’ = (deriv (b | c) ‘b’) R | (empty (b | c)) (deriv R ‘b’)

Continuing (deriv (b | c) ‘b’) R = (deriv b ‘b’ | deriv c ‘b’) R = (ε | ) R = ε R = R So starting with R = (ab | ac)* and calculating the derivative w.r.t. ‘a’ and then ‘b’, we are back where we started with R.

Building a DFA with Derivatives • Given your regular expression R • associate it with an initial state S0. • Calculate the derivative of R w.r.t. each character in the alphabet, and generate new states for each unique resulting regular expression.

Example: Start State (ab | ac)*

Example: Start State (ab | ac)* deriv w.r.t. ‘a’ (b | c)((ab |ac)*)

Example: Start State (ab | ac)* deriv w.r.t. ‘a’ deriv w.r.t. ‘b’ (b | c)((ab |ac)*) 

Example: Start State (ab | ac)* deriv w.r.t. ‘a’ deriv w.r.t. ‘c’ deriv w.r.t. ‘b’ (b | c)((ab |ac)*)  

Simplified (ab | ac)* ‘a’ ‘b’, ‘c’  (b | c)((ab |ac)*)

Then… • Then continue calculating the derivatives for each new state. • You stop when you don’t generate a new state. • (Important: must do algebraic simplifications or this process won’t stop.)

Only One Non-Zero State (ab | ac)* ‘a’ ‘b’, ‘c’  (b | c)((ab |ac)*)

Calculate its Derivatives (ab | ac)* ‘a’ ‘b’, ‘c’  (b | c)((ab |ac)*) deriv w.r.t. ‘b’ (ab | ac)*

And then simplify – no new states (ab | ac)* ‘b’, ‘c’ ‘a’  (b | c)((ab |ac)*) ‘a’ ‘b’,’c’

Accepting States • Then figure out the accepting states by seeing if they accept the empty string. • i.e., run empty on the associated regular expression and see if its simplified form is ε or .

Which states accept empty? (ab | ac)* ‘b’, ‘c’ ‘a’  (b | c)((ab |ac)*) ‘a’ ‘b’,’c’

Just this one (our start state): (ab | ac)* ‘b’, ‘c’ ‘a’  (b | c)((ab |ac)*) ‘a’ ‘b’,’c’

Lexing and Parsing: Understanding Syntax and Semantics

Lexing and Parsing: Understanding Syntax and Semantics

Presentation Transcript

LR(k) Parsing

Intro

Art 1 Intro To ART

Art 1 Intro To ART

CHAPTER 8

Intro to Solar-PV

Parsing

Art 1 Intro To ART

Intro to Arduino

Programming Language Implementation Lexical and Syntax Analysis Part II

Bottom up parsing

Dependency Parsing by Belief Propagation

COMPILER CONSTRUCTION

CS Intro to AI Welcome to LISP

Intro to Ecology

WELCOME!

Syntax Analysis

DEPENDENCY PARSING ， Framenet , SEMANTIC ROLE LABELING, SEMANTIC PARSING

Chapter 6: Strings, I/O, Formatting, and Parsing

Clojure in the Cloud

Sections 4.1-4.4: Syntax Analysis