1 / 61

Lexical Analysis (2 Lectures)

Lexical Analysis (2 Lectures). Overview. Basic Concepts Regular Expressions Language Lexical analysis by hand Regular Languages Tools NFA DFA Scanning tools Lex / Flex / JFlex / ANTLR. Scanning Perspective. Purpose Transform a stream of symbols Into a stream of tokens.

temple
Download Presentation

Lexical Analysis (2 Lectures)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lexical Analysis (2 Lectures)

  2. Overview • Basic Concepts • Regular Expressions • Language • Lexical analysis by hand • Regular Languages Tools • NFA • DFA • Scanning tools • Lex / Flex / JFlex / ANTLR

  3. Scanning Perspective • Purpose • Transform a stream of symbols • Into a stream of tokens

  4. Lexical Analyzer Responsibilities • Lexical analyzer [Scanner] • Scan input • Remove white spaces • Remove comments • Manufacture tokens • Generate lexical errors • Pass token to parser

  5. Modular design • Rationale • Separate the two analysis • High cohesion / Low coupling • Improve efficiency • Improve portability / maintainability • Enable integration of third-party lexers • [lexer = lexical analysis tool]

  6. Terminology • Token • A classification for a common set of strings • Examples: Identifier, Integer, Float, Assign, LeftParen, RightParen,.... • Pattern • The rules that characterize the set of strings for a token • Examples: [0-9]+ • Lexeme • Actual sequence of characters that matches a pattern and has a given Token class. • Examples: • Identifier: Name,Data,x • Integer: 345,2,0,629,....

  7. ” “ ” Examples

  8. Lexical Errors • Error Handling is very localized, w.r.t. Input Source • Example: fi(a==f(x)) …generates no lexical error in C • In what situations do errors occur? • Prefix of remaining input doesn’t match any defined token • Possible error recovery actions: • Deleting or Inserting Input Characters • Replacing or Transposing Characters • Or, skip over to next separator to ignore problem

  9. Basic Scanning technique • Use 1 character of look-ahead • Obtain char with getc() • Do a case analysis • Based on lookahead char • Based on current lexeme • Outcome • If char can extend lexeme, all is well, go on. • If char cannot extend lexeme: • Figure out what the complete lexeme is and return its token • Put the lookahead back into the symbol stream

  10. Language Concepts • A language, L, is simply any set of strings over a fixed alphabet. Alphabet Language {0,1} {0,10,100,1000,10000,…} {0,1,100,000,111,…} {a,b,c} {abc,aabbcc,aaabbbccc,…} {A…Z} {TEE,FORE,BALL…} {FOR,WHILE,GOTO…} {A…Z,a…z,0…9, {All legal PASCAL progs} +,-,…,<,>,…} {All grammatically correct English Sentences} Special Languages: Φ – EMPTY LANGUAGE ε – contains empty string ε only

  11. Formal Language Operations

  12. Examples

  13. Regular Languages • All examples above are • Quite expressive • Simple languages • But also... • Belong to a special class: regular languages • A Regular Expression is a Set of Rules / Techniques for Constructing Sequences of Symbols (Strings) From an Alphabet. • Let Σ Be an Alphabet, r a Regular Expression Then L(r) is the Language That is Characterized by the Rules of r

  14. Rules • fix alphabet Σ • εis a regular expression denoting {ε} • If a is in Σ , a is a regular expression that denotes {a} • Let r and s be R.E. for L(r) and L(s). Then • (a) (r) | (s) is a regular expression L(r) ∪ L(s) • (b) (r)(s) is a regular expression L(r) L(s) • (c) (r)* is a regular expression (L(r))* • (d) (r) is a regular expression L(r) • All are Left-Associative. • Parentheses are dropped as allowed by precedences. Precedeence

  15. Example revisited

  16. Algebraic Properties

  17. More Examples • All Strings that start with “tab” or end with “bat”: tab{A,…,Z,a,...,z}*|{A,…,Z,a,....,z}*bat • All Strings in Which {1,2,3} exist in ascending order: {A,…,Z}*1 {A,…,Z}*2 {A,…,Z}*3 {A,…,Z}*

  18. … … “+” “?” … Tokens as R.E.

  19. Tokens as Patterns • Patterns are ??? • Tokens are ???

  20. Throw Away Tokens • Fact • Some languages define tokens as useless • Example: C • whitespace, tabulations, carriage return, and comments can be discarded without affecting the program’s meaning.

  21. Automaton • A tool to specify a token

  22. A More Complex Automaton

  23. Two More...

  24. What about keywords ? • Easy! • Use the “Identifier” token • After a match, lookup the keyword table • If found, return a token for the matched keyword • If not, return a token for the true identifier

  25. Yes... But how to scan? • Remember the algorithm? • Acquire 1 character of lookahead • Case analysis based • On lookahead • On state of automaton

  26. Scanner code class Scanner { InputStream _in; char _la; // The lookahead character char[] _window; // lexeme window Token nextToken() { startLexeme(); // reset window at start while(true) { switch(_state) { case 0: { _la = getChar(); if (_la == ‘<’) _state = 1; else if (_la == ‘=’) _state = 5; else if (_la == ‘>’) _state = 6; else failure(state); }break; case 6: { _la = getChar(); if (_la == ‘=’) _state = 7; else _state = 8; }break; } } } } case 7: { return new Token(GEQUAL); }break; case 8: { pushBack(_la); return new Token(GREATER); }

  27. Handling Failures • Meaning • The automaton for this token failed • solution • If another automaton is available • “rewind” the input to the beginning of last lexeme • Jump to start state of next automaton • Start recognizing again • If no other automaton • This is a true lexical error. • Discard lexeme (or at least first char of lexeme) • Start from state 0 again

  28. Overview • Basic Concepts • Regular Expressions • Language • Lexical analysis by hand • Regular Languages Tools • NFA / DFA • Scanning with DFAs • Scanning tools • Lex / Flex / JFlex

  29. Automata & Language Theory • Terminology • FSA • A recognizer that takes an input string and determines whether it’s a valid string of the language. • Non-Deterministic FSA (NFA) • Has several alternative actions for the same input symbol • Deterministic FSA (DFA) • Has at most 1 action for any given input symbol • Bottom Line • expressive power(NFA) == expressive power(DFA) • Conversion can be automated

  30. NFA An NFA is a mathematical model that consists of : • S, a set of states •Σ, the symbols of the input alphabet •move, a transition function. •move(state, symbol) → set of states •move : S ×Σ∪{∈} → Pow(S) • A state, s0∈ S, the start state • F ⊆ S, a set of final or accepting states.

  31. Representing NFA Transition Diagrams : Transition Tables: Number states (circles), arcs, final states, … More suitable to representation within a computer We’ll see examples of both !

  32. 0 2 1 j i a start a b b 3 b Example NFA S = { 0, 1, 2, 3 } s0 = 0 F = { 3 } Σ = { a, b } What Language is defined ? What is the Transition Table ? ∈(null) moves possible i n p u t a b 0 { 0, 1 } { 0 } state 1 -- { 2 } Switch state but do not use any input symbol 2 -- { 3 }

  33. Epsilon-Transitions • Given the regular expression : (a (b*c)) | (a (b | c+)?) • Find a transition diagram NFA that recognizes it. • Solution ?

  34. NFA Construction • Automatic construction example • a(b*c) • a(b|c+)? Build a Disjunction

  35. Resulting NFA

  36. 0 2 1 a start a b b 3 b Working NFA • Given an input string, we trace moves • If no more input & in final state, ACCEPT EXAMPLE: Input: ababb -OR- move(0, a) = 0 move(0, b) = 0 move(0, a) = 1 move(1, b) = 2 move(2, b) = 3 ACCEPT ! move(0, a) = 1 move(1, b) = 2 move(2, a) = ? (undefined) REJECT !

  37. 0 2 1 4 a start a b b 3 a b a a, b Σ Handling Undefined Transitions • We can handle undefined transitions by defining one more state, a “death” state, and transitioning all previously undefined transition to this death state.

  38. 0 2 1 a start a b b 3 b Worse still... • Not all path result in acceptance! aabb is accepted along path : 0 → 0 → 1 → 2 → 3 BUT… it is not accepted along the valid path: 0 → 0 → 0 → 0 → 0

  39. The NFA “Problem” • Two problems • Valid input may not be accepted • Non-deterministic behavior from run to run... • Solution?

  40. The DFA Save The Day • A DFA is an NFA with a few restrictions • No epsilon transitions • For every state s, there is only one transition (s,x) from s for any symbol x in Σ • Corollaries • Easy to implement a DFA with an algorithm! • Deterministic behavior

  41. NFA vs. DFA • NFA • smaller number of states Qnfa • In order to simulate it requires a |Qnfa| computation for each input symbol. • DFA • larger number of states Qdfa • In order to simulate it requires a constant computation for each input symbol. • caveat - generic NFA=>DFA construction: Qdfa ~ 2^{Qnfa} • but: DFA’s are perfectly optimizable! (i.e., you can find smallest possible Qdfa )

  42. One catch... • NFA-DFA comparison

  43. NFA to DFA Conversion • Idea • Look at the state reachable without consuming any input • Aggregate them in macro states

  44. Final Result • A state is final • IFF one of the NFA state was final

  45. Preliminary Definitions • NFA N = ( S, Σ, s0, F, MOVE ) • ε-Closure(s) : s ε S • set of states in S that are reachable from s via ε-moves of N that originate from s. • ε-Closure(T) : T ⊆ S • NFA states reachable from all t ε T on ε-moves only. • move(T,a) : T ⊆ S, a ε Σ • Set of states to which there is a transition on input a from some t ε T

  46. Algorithm computing the ε-closure forall(t in T) push(t); initialize ε-closure(T) to T; while stack is not empty do begin t = pop(); for each u ε S with edge t→u labeled ε if u is not in ε-closure(T) add u to ε-closure(T) ; push u onto stack

  47. DFA construction computing the The set of states The transitions let Q = ε-closure(s0) ; D = { Q }; enQueue(Q) while queue not empty do X = deQueue(); for each a ε Σ do Y := ε-closure(move(X,a)); T[X,a] := Y if Y is not in D D = D U { Y } enQueue(Y); end end

  48. Summary • We can • Specify tokens with R.E. • Use DFA to scan an input and recognize token • Transform an NFA into a DFA automatically • What we are missing • A way to transform an R.E. into an NFA • Then, we will have a complete solution • Build a big R.E. • Turn the R.E. into an NFA • Turn the NFA into a DFA • Scan with the obtained DFA

  49. R.E. To NFA • Process • Inductive definition • Use the structure of the R.E. • Use atomic automata for atomic R.E. • Use composition rules for each R.E. expression • Recall • RE ::= ε ::= s in Σ ::= rs ::= r | s ::= r*

  50. Epsilon Construction • RE ::= ε

More Related