1 / 37

Lexical Analysis

Lexical Analysis. The Input. Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1 ISO 10646 (16-bit = unicode) Ada, Java Others (EBCDIC, JIS, etc). The Output. A series of tokens: kind, location, name (if any)

misha
Download Presentation

Lexical Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lexical Analysis (C) Edmond Schonberg, New-York University

  2. The Input • Read string input • Might be sequence of characters (Unix) • Might be sequence of lines (VMS) • Character set: • ASCII • ISO Latin-1 • ISO 10646 (16-bit = unicode) Ada, Java • Others (EBCDIC, JIS, etc) (C) Edmond Schonberg, New-York University

  3. The Output • A series of tokens: kind, location, name (if any) • Punctuation ( ) ; , [ ] • Operators + - ** := • Keywords begin end if while try catch • Identifiers Square_Root • String literals “press Enter to continue” • Character literals ‘x’ • Numeric literals • Integer: 123 • Floating_point: 4_5.23e+2 • Based representation: 16#ac# (C) Edmond Schonberg, New-York University

  4. Free form vs Fixed form • Free form languages (all modern ones) • White space does not matter. Ignore these: • Tabs, spaces, new lines, carriage returns • Only the ordering of tokens is important • Fixed format languages (historical) • Layout is critical • Fortran, label in cols 1-6 • COBOL, area A B • Lexical analyzer must know about layout to find tokens (C) Edmond Schonberg, New-York University

  5. Keywords • Reserved identifiers • E.g. BEGIN END in Pascal, if in C, catch in C++ • Returned as kind of token (C) Edmond Schonberg, New-York University

  6. Identifiers • Rules differ • Length, allowed characters, separators • Need to build a names table(symbol table) • Single entry for all occurrences of Var1 • Language may be case insensitive: same entry for VAR1, vAr1, Var1 • Typical structure: hash table • Lexical analyzer returns token kind • And key (index) to table entry • Table entry includes location information (C) Edmond Schonberg, New-York University

  7. String Literals • Text must be stored • Actual characters are important • Not like identifiers: must preserve casing • Character set issues: uniform internal representation • Table needed • Lexical analyzer returns key into table • May or may not be worth hashing to avoid duplicates (C) Edmond Schonberg, New-York University

  8. Handling Comments • Comments have no effect on program • Can be eliminated by scanner • But may need to be retrieved by tools • Error detection issues • E.g. unclosed comments • Scanner skips over comments and returns next meaningful token (C) Edmond Schonberg, New-York University

  9. Case Equivalence • Some languages are case-insensitive • Pascal, Ada • Some are not • C, Java • Lexical analyzer ignores case if needed • This_Routine = THIS_RouTine • Error analysis may need exact casing • Friendly diagnostics follow user’s conventions (C) Edmond Schonberg, New-York University

  10. Performance Issues • Speed • Lexical analysis can become bottleneck • Minimize processing per character • Skip blanks fast • I/O is also an issue (read large blocks) • We compile frequently • Compilation time is important • Especially during development • Communicate with parser through global variables (C) Edmond Schonberg, New-York University

  11. General approach to writing lexical analyser • Define set of token kinds: • An enumeration type (tok_int, tok_if, tok_plus, tok_left_paren, tok_assign etc). • Or a series of integer definitions in more primitive languages… • Some tokens carry associated data • E.g. key for identifier table • May be useful to build tree node • For identifiers, literals etc (C) Edmond Schonberg, New-York University

  12. Interface to Lexical Analyzer • Either: Convert entire file to a file of tokens • Lexical analyzer is separate phase • Or: Parser calls lexical analyzer to supply next token • This approach avoids extra I/O • Parser builds tree incrementally, using successive tokens as tree nodes (C) Edmond Schonberg, New-York University

  13. Relevant Formalisms • Type 3 (Regular) Grammars • Regular Expressions • Finite State Machines • Equivalent in expressive power • Useful for program construction, even if hand-written (C) Edmond Schonberg, New-York University

  14. Regular Grammars • Regular grammars • Non-terminals (arbitrary names) • Terminals (characters) • Productions limited to the following: • Non-terminal ::= terminal • Non-terminal ::= terminal Non-terminal • Treat character class (e.g. digit) as terminal • Regular grammars cannot count: cannot express size limits on identifiers, literals • Cannot express proper nesting (parentheses) (C) Edmond Schonberg, New-York University

  15. Grammars – an example • grammar for real literals with no exponent • digit :: = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • REAL ::= digit REAL1 • REAL1 ::= digit REAL1 (arbitrary size) • REAL1 ::= . INTEGER • INTEGER ::= digit INTEGER (arbitrary size) • INTEGER ::= digit • Start symbol is REAL (C) Edmond Schonberg, New-York University

  16. Regular Expressions • Regular expressions (RE) defined by an alphabet (terminal symbols) and three operations: • Alternation RE1 | RE2 • Concatenation RE1 RE2 • Repetition RE* (zero or more RE’s) • Language of RE’s = regular grammars • Regular expressions are more convenient for some applications (C) Edmond Schonberg, New-York University

  17. Specifying RE’s in Unix Tools • Single characters a b c d \x • Alternation [bcd] [b-z] ab|cd • Any character . (period) • Match sequence of characters x* y+ • Concatenation abc[d-q] • Optional RE [0-9]+(\.[0-9]*)? (C) Edmond Schonberg, New-York University

  18. Finite State Machines • A language defined by a grammar is a (possibly infinite) set of strings • An automaton is a computation that determines whether a given string belongs to a specified language • A finite state machine (FSM) is an automaton that recognize regular languages (regular expressions) • Simplest automaton: memory is single number (state) (C) Edmond Schonberg, New-York University

  19. Specifying an FSM • A set of labeled states • Directed arcs between states labeled with character • One or more states may be terminal (accepting) • A distinguished state is start • Automaton makes transition from state S1 to S2 • If and only if arc from S1 to S2 is labeled with next character in input • Token is legal if automaton stops on terminal state (C) Edmond Schonberg, New-York University

  20. Building FSM from Grammar • One state for each non-terminal • A rule of the form • Nt1 ::= terminal • Generates transition from S1 to final state • A rule of the form • Nt1 ::= terminal Nt2 • Generates transition from S1 to S2 on an arc labeled by the terminal (C) Edmond Schonberg, New-York University

  21. Graphic representation digit digit S Int letter letter letter underscore digit id digit (C) Edmond Schonberg, New-York University

  22. Building FSM’s from RE’s • Every RE corresponds to a grammar • For all regular expressions • A natural translation to FSM exists • Alternation often leads to non-deterministic machines (C) Edmond Schonberg, New-York University

  23. Non-Deterministic FSM • A non-deterministic FSM • Has at least one state • With two arcs to two distinct states • Labeled with the same character • Example: from start state, a digit can begin an integer literal or a real literal • Implementation requires backtracking • Nasty  (C) Edmond Schonberg, New-York University

  24. Deterministic FSM • For all states S • For all characters C: • There is at most one arc from any state S that is labeled with C • Much easier to implement • No backtracking  (C) Edmond Schonberg, New-York University

  25. From NFSM to DFSM • There is an algorithm for converting a non-deterministic machine to a deterministic one • Result may have exponentially more states • Intuitively: need new states to express uncertainty about token: int or real • Algorithm is efficient in practice (e.g. grep) • Other algorithms for minimizing number of states of FSM, for showing equivalence, etc. (C) Edmond Schonberg, New-York University

  26. Implementing the Scanner • Three methods • Hand-coded approach: • draw DFSM, then implement with loop and case statement • Hybrid approach : • define tokens using regular expressions, convert to NFSM, apply algorithm to obtain minimal DSFM • Hand-code resulting DFSM • Automated approach: • Use regular grammar as input to lexical scanner generator (e.g. LEX) (C) Edmond Schonberg, New-York University

  27. Hand-coding • Normal coding techniques • Scan over white space and comments till non-blank character found. • Branch depending on first character: • If digit, scan numeric literal • If character, scan identifier or keyword • If operator, check next character (++, etc.) • Need table to determine character type efficiently • Return token found • Write aggressive efficient code: goto’s, global variables (C) Edmond Schonberg, New-York University

  28. Using grammar and FSM • Start with regular grammar or RE • Typically found in the language reference • example (Ada): • Chapter 2. Lexical Elements • Digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • decimal-literal ::= integer [.integer][exponent] • integer ::= digit {[underline] digit} • exponent ::= E [+] integer | E - integer (C) Edmond Schonberg, New-York University

  29. Using grammar and FSM • Create one state for each non-terminal • Label edges according to productions in grammar • Each state becomes a label in the program • Code for each state is a switch on next character, corresponding to edges out of current state • If no possible transition on next character, then: • If state is accepting, return the corresponding token • If state is not accepting, report error (C) Edmond Schonberg, New-York University

  30. Hand-coded version: • Each state is encoded as follows: • <<state1>>case Next_Character iswhen ‘a’ => goto state3;when ‘b’ => goto state1;when others => End_of_token_processing;endcase; • <<state2>> … • No explicit mention of state of automaton (C) Edmond Schonberg, New-York University

  31. Translating from FSM to code • variable holds current state: loopcase State iswhen state1 => <<state1>>case Next_Character iswhen ‘a’ => State := state3;when ‘b’ => State := state1;when others => End_token_processing;end case;when state2 … …end case; end loop; (C) Edmond Schonberg, New-York University

  32. Automatic scanner construction • LEX builds a transition table, indexed by state and by character. • Code gets transition from table: Tab : array (State, Character) of State := … begin while More_Input loop Curstate := Tab (Curstate, Next_Char); if Curstate = Error_State then …end loop; (C) Edmond Schonberg, New-York University

  33. Automatic FSM Generation • Our example, FLEX • See home page for manual in HTML • FLEX is given • A set of regular expressions • Actions associated with each RE • It builds a scanner • Which matches RE’s and executes actions (C) Edmond Schonberg, New-York University

  34. An Example of a Flex scanner • DIGIT [0-9]ID [a-z][a-z0-9]*%%{DIGIT}+ { printf (“an integer %s (%d)\n”, yytext, atoi (yytext)); }{DIGIT}+”.”{DIGIT}* { printf (“a float %s (%g)\n”, yytext, atof (yytext));if|then|begin|end|procedure|function { printf (“a keyword: %s\n”, yytext)); (C) Edmond Schonberg, New-York University

  35. Flex Example (continued) {ID} printf (“an identifier %s\n”, yytext);“+”|“-”|“*”|“/” { printf (“an operator %s\n”, yytext); } “--”.*\n /* eat Ada style comment */ [ \t\n]+ /* eat white space */ . printf (“unrecognized character”);%% (C) Edmond Schonberg, New-York University

  36. Assembling the flex program %{ #include <math.h> /* for atof */ %} <<flex text we gave goes here>> %% main (argc, argv) int argc; char **argv; { yyin = fopen (argv[1], “r”); yylex(); } (C) Edmond Schonberg, New-York University

  37. Choice Between Methods? • Hand written scanners • Typically much faster execution • Easy to write (standard structure) • Preferable for good error recovery • Flex approach • Simple to Use • Easy to modify token language (C) Edmond Schonberg, New-York University

More Related