1 / 32

Lexical Analyzer (Checker)

Lexical Analyzer (Checker). Lexical Analyzer. Lexical Analyzer reads the source program character by character to produce tokens. Normally a lexical analyzer doesn’t return a list of tokens at one shot, it returns a token when the parser asks a token from it. Tokens, Lexemes, and Patterns.

brosh
Download Presentation

Lexical Analyzer (Checker)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lexical Analyzer (Checker)

  2. Lexical Analyzer • Lexical Analyzer reads the source program character by character to produce tokens. • Normally a lexical analyzer doesn’t return a list of tokens at one shot, it returns a token when the parser asks a token from it.

  3. Tokens, Lexemes, and Patterns • Tokens include keywords, operators, identifiers, constants, literal strings, punctuation symbols • e.g: identifier, number, addop, assgop • A lexeme is a sequence of characters in the source program representing a token • e.g: newval, oldval • A pattern is a rule describing a set of lexemes that can represent a particular token • e.g: Identifier represents a set of strings which start with a letter continues with letters and digits

  4. Attributes • Since a token can represent more than one lexeme, attributes provide additional information about tokens • For simplicity, a token may have a single attribute. • For an identifier, attribute is a pointer to the symbol table • Examples of some attributes: • <id,attr> where attr is pointer to the symbol table • <assgop,_> no attribute is needed (only one assignment operator) • <num,val> where val is the actual value of the number. • Token and its attribute uniquely identifies a lexeme.

  5. Strings and Languages • Alphabet – any finite set of symbols (e.g. ASCII, binary alphabet, or a set of tokens) • String – A finite sequence of symbols drawn from an alphabet • Language – A set of strings over a fixed alphabet

  6. Operations on Languages • Union: • Concatenation: • Kleene closure: • Zero or more concatenations • Positive closure: • One or more concatenations

  7. Regular Expressions • Can give “names” to regular expressions • Convention: names in boldface (to distinguish them from symbols) letter A|B|…|Z|a|b|…|z digit 0|1|…|9 idletter (letter | digit)*

  8. Notational Shorthands • One or more instances: r+ denotes rr* • Zero or one Instance: r? denotes r|ε • Character classes: [a-z] denotes [a|b|…|z] digit [0-9] digits  digit+ optional_fraction (. digits )? numdigitsoptional_fraction

  9. Limitations • Can not describe balanced or nested constructs • Example, all valid strings of balanced parentheses • This can be done with Context Free Grammar ( CFG)

  10. Grammar Fragment (Pascal) stmt ifexprthenstmt | ifexprthenstmtelsestmt | ε expr  termrelopterm | term term  id | num

  11. Related Regular Expression Definitions if if then then else else relop < | <= | = | <> | > | >= idletter ( letter | digit )* numdigit+ (. digit+ )? ws  delim+ delim  blank | tab | newline

  12. Tokens and Attributes

  13. Transition Diagrams • A stylized flowchart • Transition diagrams consist of states connected by edges • Edges leaving a state s are labeled with input characters that may occur after reaching state s • Assumed to be deterministic • There is one start state and at least one accepting (final) state

  14. Transition Diagram for “relop”

  15. Identifiers and Keywords • Share a transition diagram • After reaching accepting state, code determines if lexeme is keyword or identifier

  16. Numbers

  17. Finding the Next Token token nexttoken(void) { while (1) { switch (state) { case 0: c = nextchar(); if (c == ' ' || c=='\t' || c == '\n') { state = 0; lexeme_beginning++; } else if (c == '<') state = 1; else if (c == '=') state = 5 else if (c == '>') state = 6 else state = next_td(); break; … /* other cases here */

  18. Trying Transition Diagrams int next_td(void) { switch (start) { case 0: start = 9; break; case 9: start = 20; break; case 20: start = 25; break; case 25: recover(); break; default: error("invalid start state"); } /* Possibly additional actions here */ return start; }

  19. Finite Automata • Generalized transition diagrams that act as “recognizer” for a language • Can be nondeterministic (NFA) or deterministic (DFA) • NFAs can have ε-transitions, DFAs can not • NFAs can have multiple edges with same symbol leaving a state, DFAs can not • Both can recognize exactly what regular expressions can denote

  20. NFAs • A set of states S • A set of input symbols Σ (input alphabet) • A transition function move that maps state, symbol pairs to a set of states • A single start state s0 • A set of accepting (or final) states F • An NFA accepts a string s if and only if there exists a path from the start state to an accepting state such that the edge labels spell out s

  21. NFA (Example) • 0 is the start state s0 • {2} is the set of final states F • = {a,b} S = {0,1,2} a start a b 0 1 2 b Transition graph of the NFA The language recognized by this NFA is (a|b) * ab

  22. Transition Tables

  23. DFAs • No state has an ε-transition • For each state s and input symbol a, there as at most one edge labeled a leaving s

  24. Example: r = (a|b)*abb

  25. Functions ε-closure and move • ε-closure(s) is the set of NFA states reachable from NFA state s on ε-transitions alone • move(T,a) is the set of NFA states to which there is a transition on input a from any NFA state s in T

  26. Constructed DFA

  27. Simulating a DFA s := s0 c := nextchar while c != eof do s := move(s, c) c := nextchar end if s is in F then return “yes” else return “no”

  28. Simulating an NFA S := ε-closure({s0}) a := nextchar while a != eof do S := ε-closure(move(S,a)) a := nextchar if S ∩ F != Ø return “yes” else return “no”

  29. Space/Time Tradeoff (Worst Case)

  30. Simulating a Regular Expression • First use Thompson’s Construction to convert RE to NFA • Then there are two choices: • Use subset construction to convert NFA to DFA, then simulate the DFA • Simulate the NFA directly

  31. Some Other Issues in Lexical Analyzer • The lexical analyzer has to recognize the longest possible string. • Ex: identifier newval -- n ne new newvnewvanewval • What is the end of a token? Is there any character which marks the end of a token?

  32. Some Other Issues in Lexical Analyzer (cont.) • Skipping comments • Normally we don’t return a comment as a token. • So, the comments are only processed by the lexical analyzer, and don’t complicate the syntax of the language. • Symbol table interface • symbol table holds information about tokens (at least lexeme of identifiers) • how to implement the symbol table, and what kind of operations. • hash table – open addressing, chaining • putting into the hash table, finding the position of a token from its lexeme. • Positions of the tokens in the file (for the error handling).

More Related