cs153 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Intro to Lexing & Parsing PowerPoint Presentation
Download Presentation
Intro to Lexing & Parsing

Loading in 2 Seconds...

play fullscreen
1 / 49

Intro to Lexing & Parsing - PowerPoint PPT Presentation


  • 58 Views
  • Uploaded on

CS153. Intro to Lexing & Parsing. Notes. Problem Set 0 due Friday at 5pm. See Lucas or me if you need help! I will be having office hours from 5:45-7:15 on Thu night in the Science Center. Reading: Relevant chapters on Lexing & Parsing in Appel OCamlLex and OCamlYacc documentation

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Intro to Lexing & Parsing' - nuala


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
notes
Notes
  • Problem Set 0 due Friday at 5pm.
    • See Lucas or me if you need help!
    • I will be having office hours from 5:45-7:15 on Thu night in the Science Center.
  • Reading:
    • Relevant chapters on Lexing & Parsing in Appel
    • OCamlLex and OCamlYacc documentation
    • “Monadic Parsing in Haskell” by G.Hutton and E.Meijer.
parsing
Parsing
  • Two pieces conceptually:
    • Recognizing syntactically valid phrases.
    • Extracting semantic content from the syntax.
      • E.g., What is the subject of the sentence?
      • E.g., What is the verb phrase?
      • E.g., Is the syntax ambiguous? If so, which meaning do we take?
        • “Fruit flies like a banana”
        • “2 * 3 + 4”
        • “x ^ f y”
  • In practice, solve both problems at the same time.
specifying syntax
Specifying Syntax

We use grammars to specify the syntax of a language.

exp  int | var | exp ‘+’ exp | exp ‘*’ exp |

‘let’ var ‘=‘ exp ‘in’ exp ‘end’

int ‘-’?digit+

var  alpha(alpha|digit)*

digit ‘0’ | ‘1’ | ‘2’ | ‘3’ | ‘4’ | … | ‘9’

alpha  [a-zA-Z]

na ve matching
Naïve Matching

To see if a sentence is legal, start with the first non-terminal), and keep expanding non-terminals until you can match against the sentence.

N ‘a’ | ‘(’ N ‘)’“((a))”

N ‘(’N ‘)’

‘(’‘(’N ‘)’‘)’

‘(’‘(’‘a’‘)’‘)’ = “((a))”

alternatively
Alternatively

Start with the sentence, and replace phrases with corresponding non-terminals, repeating until you derive the start non-terminal.

N ‘a’ | ‘(‘ N ‘)’“((a))”

‘(’‘(‘‘a’‘)’‘)’

‘(‘‘(‘ N ‘)’‘)’

‘(‘ N ‘)’ N

highly non deterministic
Highly Non-Deterministic
  • For real grammars, automating this non-deterministic search is non-trivial.
    • As we’ll see, naïve implementations must do a lot of back-tracking in the search.
  • Ideally, given a grammar, we would like an efficient, deterministic algorithm to see if a string matches it.
    • There is a very general cubic time algorithm.
    • Only linguists use it .
    • (In part, we don’t because recognition is only half the problem.)
  • Certain classes of grammars have much more efficient implementations.
    • Essentially linear time with constant state (DFAs).
    • Or linear time with stack-like state (Pushdown Automata).
tools in your toolbox
Tools in your Toolbox
  • Manual parsing (say, recursive descent).
    • Tedious, error prone, hard to maintain.
    • But fast & good error messages.
  • Parsing combinators
    • Encode grammars as (higher-order) functions.
      • Basically, functions that generate recursive-descent parsers.
      • Makes it easy to write & maintain grammars.
    • But can do a lot of back-tracking, and requires a limited form of grammar (e.g., no left-recursion.)
  • Lex and Yacc
    • Domain-Specific-Languages that generate very efficient, table-driven parsers for general classes of grammars.
    • Learn about the theory in 121
    • Need to know a bit here to understand how to effectively use these tools.
regular expressions
Regular Expressions
  • Non-recursive grammars
    •  (matches no string)
    • ε (epsilon – matches empty string)
    • Literals (‘a’, ‘b’, ‘2’, ‘+’, etc.) drawn from alphabet
    • Concatenation (R1 R2)
    • Alternation (R1 | R2)
    • Kleene star (R*)
formally
Formally
  • A regular expression denotes a set of strings:
    • [[]] = { }
    • [[ε]] = { “” }
    • [[‘a’]] = { “a” }
    • [[R1 R2]] = { s | s = s1 ^ s2 & s1 in R1 & s2 in R2 }
    • [[R1 | R2]] = { s | s in R1 or s in R2 }
    • [[R*]] = [[ε + RR* ]] = { s | s = “” or s = s1 ^ s2 & s1 in R & s2 in R* }
example
Example

We might recognize numbers as:

digit ::= [0-9]

number ::= ‘-’? digit+

[0-9] shorthand for ‘0’ | ‘1’ | … | ‘9’

‘-’? shorthand for (‘-’ | ε)

digit+ shorthand for (digit digit*)

So number ::= (‘-’ | ε) ((‘0’ | ‘1’ | … | ‘9’)(‘0’ | ‘1’ | … | ‘9’)*)

graphical representation
Graphical Representation

ε

ε

c

b

a

b

accept

start

ε

d

a b (c | d)* b

ε

non deterministic finite state automaton
Non-Deterministic, Finite-State Automaton

ε

ε

c

b

a

b

accept

start

ε

d

  • Formally:
  • an alphabet Σ
  • a set V of states
  • a distinguished start state
  • one or more accepting states
  • transition relation: δ : V * (Σ + ε) * V  bool

ε

translating regexps
Translating RegExps

ε

  • Epsilon:
  • Literal ‘a’:
  • R1 R2
  • R1 | R2

a

ε

R2

R1

R1

ε

ε

R2

ε

ε

converting to deterministic
Converting to Deterministic
  • Naively:
    • Give each state a unique ID (1,2,3,…)
    • Create super states
      • One super-state for each subset of all possible states in the original NDFA.
      • (e.g., {1},{2},{3},{1,2},{1,3},{2,3},{123})
    • For each super-state (say {23}):
      • For each original state s and character c:
        • Find the set of accessible states (say {1,2}) skipping over epsilons.
        • Add an edge labeled by c from the super state to the corresponding super-state.
  • In practice, super-states are created lazily.
slide18

ε

4

ε

c

b

a

b

3

1

7

2

6

ε

d

5

ε

4,6,3

c

b

a

b

7

3,6

1

2

d

5,6,3

slide19

ε

4

ε

c

b

a

b

3

1

7

2

6

ε

d

5

ε

c

4,6,3

b

c

b

a

b

7

3,6

1

2

b

d

5,6,3

d

slide20

ε

4

ε

c

b

a

b

3

1

7

2

6

ε

d

5

ε

c

d

4,6,3

b

c

b

a

b

7

3,6

1

2

b

d

c

5,6,3

d

slide21

c

d

4,6,3

b

c

b

a

b

7

3,6

1

2

b

d

c

5,6,3

d

once we have a dfa
Once We have a DFA
  • Deterministic Finite State automata are easy to simulate:
    • For each state and character, there is at most one transition we can take.
  • Usually record the transition function as an array, indexed by states and characters.
  • Lexer starts off with a variable s initialized to the start state:
    • Reads a character, uses transition table to find next state.
    • Look at the output of Lex!
an algebraic approach
An Algebraic Approach
  • There’s a purely algebraic way to compute whether a string is in the set denoted by a regular expression.
  • It’s based on computing the symbolic derivative of a regular expression with respect to the characters in a string.
some algebraic structure
Some Algebraic Structure

Think of ε as one, as zero, concatenation as multiplication, and alternation as addition:

 | R = R | = R

 R = R = 

ε R = R ε = R

R | R = R

ε* = ε

R** = R*

R2 | R1 = R1 | R2

derivatives
Derivatives
  • The derivative of a regular expression R with respect to a character ‘a’ is:[[deriv R a]] = { s | “a” ^ s in R }
  • So it’s the residual strings once we remove the ‘a’ from the front.
    • e.g., deriv (abab) ‘a’ = bab
    • e.g., deriv (abab | acde) = (bab | cde)
symbolic differentiation
Symbolic Differentiation

deriv  c = 

deriv ε c = 

deriv c c = ε

deriv c c’ = 

deriv (R1 | R2) c = (deriv R1 c) | (deriv R2 c)

deriv (R1 R2) c = (deriv R1 c) R2 | (empty R1) (deriv R2 c)

deriv (R*) c = (deriv R c) R*

symbolic differentiation1
Symbolic Differentiation

deriv  c = 

deriv ε c = 

deriv c c = ε

deriv c c’ = 

deriv (R1 | R2) c = (deriv R1 c) | (deriv R2 c)

deriv (R1 R2) c = (deriv R1 c) R2 | (empty R1) (deriv R2 c)

deriv (R*) c = (deriv R c) R*

symbolic empty
Symbolic Empty

Think of ε as “true” and  as false.

empty  = 

empty ε = ε

empty c = 

emtpy (R1 | R2) = (empty R1) | (empty R2)

empty (R1 R2) = (empty R1) (empty R2)

empty (R*) = ε

algebraic recognition
Algebraic Recognition
  • Given a regular expression R and a string of characters c1, c2, …, cn.
  • We can test whether the string is in [[R]] by calculating:
    • deriv (… deriv (deriv R c1) c2 …) cn
  • And then test whether the resulting regular expression accepts the empty string.
example1
Example:

Take R = (ab | ac)*

Take the string to be [‘a’ ; ‘b’; ‘a’; ‘c’]

deriv R ‘a’ = (deriv (ab | ac) ‘a’) R

(deriv (ab) ‘a’) | (deriv (ac) ‘a’) R = (b | c) R

So deriv R ‘a’ = (b | c) R

continuing
Continuing

Now we must compute:

deriv ((b | c) R) ‘b’ =

(deriv (b | c) ‘b’) R | (empty (b | c)) (deriv R ‘b’)

continuing1
Continuing

Now we must compute:

deriv ((b | c) R) ‘b’ =

(deriv (b | c) ‘b’) R | (empty (b | c)) (deriv R ‘b’)

empty (b | c) = (empty b) | (empty c) = |  = 

continuing2
Continuing

So this simplifies to:

deriv ((b | c) R) ‘b’ =

(deriv (b | c) ‘b’) R | (empty (b | c)) (deriv R ‘b’) =

(deriv (b | c) ‘b’) R

continuing3
Continuing

(deriv (b | c) ‘b’) R =

(deriv b ‘b’ | deriv c ‘b’) R =

(ε | ) R = ε R = R

So starting with R = (ab | ac)* and calculating the derivative w.r.t. ‘a’ and then ‘b’, we are back where we started with R.

building a dfa with derivatives
Building a DFA with Derivatives
  • Given your regular expression R
    • associate it with an initial state S0.
  • Calculate the derivative of R w.r.t. each character in the alphabet, and generate new states for each unique resulting regular expression.
example2
Example:

Start State

(ab | ac)*

example3
Example:

Start State

(ab | ac)*

deriv w.r.t. ‘a’

(b | c)((ab |ac)*)

example4
Example:

Start State

(ab | ac)*

deriv w.r.t. ‘a’

deriv w.r.t. ‘b’

(b | c)((ab |ac)*)

example5
Example:

Start State

(ab | ac)*

deriv w.r.t. ‘a’

deriv w.r.t. ‘c’

deriv w.r.t. ‘b’

(b | c)((ab |ac)*)

simplified
Simplified

(ab | ac)*

‘a’

‘b’, ‘c’

(b | c)((ab |ac)*)

slide41
Then…
  • Then continue calculating the derivatives for each new state.
  • You stop when you don’t generate a new state.
  • (Important: must do algebraic simplifications or this process won’t stop.)
only one non zero state
Only One Non-Zero State

(ab | ac)*

‘a’

‘b’, ‘c’

(b | c)((ab |ac)*)

calculate its derivatives
Calculate its Derivatives

(ab | ac)*

‘a’

‘b’, ‘c’

(b | c)((ab |ac)*)

deriv w.r.t. ‘b’

(ab | ac)*

calculate its derivatives1
Calculate its Derivatives

(ab | ac)*

‘a’

‘b’, ‘c’

(b | c)((ab |ac)*)

deriv w.r.t. ‘b’

deriv w.r.t. ‘c’

(ab | ac)*

(ab | ac)*

calculate its derivatives2
Calculate its Derivatives

(ab | ac)*

‘a’

‘b’, ‘c’

(b | c)((ab |ac)*)

deriv w.r.t. ‘a’

deriv w.r.t. ‘b’

deriv w.r.t. ‘c’

(ab | ac)*

(ab | ac)*

and then simplify no new states
And then simplify – no new states

(ab | ac)*

‘b’, ‘c’

‘a’

(b | c)((ab |ac)*)

‘a’

‘b’,’c’

accepting states
Accepting States
  • Then figure out the accepting states by seeing if they accept the empty string.
  • i.e., run empty on the associated regular expression and see if its simplified form is ε or .
which states accept empty
Which states accept empty?

(ab | ac)*

‘b’, ‘c’

‘a’

(b | c)((ab |ac)*)

‘a’

‘b’,’c’

just this one our start state
Just this one (our start state):

(ab | ac)*

‘b’, ‘c’

‘a’

(b | c)((ab |ac)*)

‘a’

‘b’,’c’