1 / 15

SCANNING

SCANNING. Chuen-Liang Chen Department of Computer Science and Information Engineering National Taiwan University Taipei, TAIWAN. Scanner (lexical analyzer). primary function -- grouping input characters into tokens called by -- parser return -- 1. token code 2. attribute (optional)

fullerj
Download Presentation

SCANNING

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SCANNING Chuen-Liang Chen Department of Computer Science and Information Engineering National Taiwan University Taipei, TAIWAN

  2. Scanner (lexical analyzer) • primary function -- grouping input characters into tokens • called by -- parser • return -- 1. token code 2. attribute (optional) • theoretical bases -- regular expression, finite automata • implementation • dedicated program (hardwired) • table-driven • construction • hand-coded • by generator, in order to limit the effort in building a scanner by specifying which tokens the scanner is to recognize • program [lex] • table + standard driver program [ScanGen]

  3. Regular expression (1/2) • being used to • specify simple set of strings (regular set) • specify tokens of programming language • program a scanner generator • string -- catenation of characters in vocabulary, denoted V • regular expression • meta-characters: ( ) ‘ * + ? | • have to be quoted when used as ordinary characters 1.Æ -- empty set 2. l -- set of null string 3.s -- { string s } 4. A | B -- alternation of corresponding regular sets 5. A B -- catenation of corresponding regular sets 6. A* -- Kleene closure of corresponding regular set • repeating zero or more times

  4. Regular expression (2/2) • other notations • A+ = A A* • A? = A | l • Not(A) = V - A for set of characters A • Not(S) = V* - S for set of stings S • may be infinite but still regular • Ak = A A ... A (k times) • examples • -- anything Eol Comment = - - ( Not(Eol) )* Eol • fixed decimal literal Lit = D+ . D+ • identifier begin with letter ID = L ( L | D )* ( _ ( L | D )+ )* end with letter/digit without consecutive underlines • being able to represent all finite sets and many but not all infinite sets • QUIZ: counter example?

  5. - - Eol 1 2 3 4 Not(Eol) Finite automata • being used to recognize the tokens specified by a regular expression • consisting of • a finite set of states • a set of transitions labeled with characters in V • a start state • a set of final states • transition diagramltransition table ublank: error entry • deterministic finite automata (DFA) • unique transition for a given state and character • otherwise, nondeterministic finite automata (NFA)

  6. l l NFA for A l A l a l NFA for A NFA for A l l l A A NFA for B NFA for B B l l From RE to NFA • rules • luKleene closure • vocabulary • catenationualternation

  7. l a 1 2 5 a b 3, 4,5 a a 1,2 4,5 b a | b 3 4 a a a | b 5 3, 4,5 1,2 a b 3, 4,5 1,2 4,5 a b 3, 4,5 1,2 4,5 a a a | b 5 5 From NFA to DFA • major operation: l-closure • example 3.l-closure( 4, 5 ) = 5 1.l-closure(1) = 1, 2 4.l-closure( 5 ) = 5 2.l-closure( 3, 4, 5 ) = 3, 4, 5

  8. DFA optimization • major operation: partition states into equivalent classes according to • final / non-final states • transition functions • example ( A B C D E ) ( A B C D ) ( E ) ( A B C ) ( D ) ( E ) ( A C ) ( B ) ( D ) ( E )

  9. dedicated program example if (current_char == '-') { current_char = getchar(); if (current_char == '-') { do current_char = getchar(); while (current_char != '\n'); } else { ungetc(current_char, stdin); lexical_error(current_char); } } else lexical_error(current_char); /* Return or process valid token. */ ungetc() -- lookahead - - Eol 1 2 3 4 Not(Eol) From DFA to scanner (1/3)

  10. table-driven transition table + return token code + character save/toss operation + process of valid token example /* * Note: current_char is already set * to the current input character. */ state = initial_state; while (TRUE) { next_state = T[state][current_char]; if (next_state == ERROR) break; state = next_state; if (current_char == EOF) break; current_char = getchar(); } if (is_final_state(state)) /* Return or process valid token. */ else lexical_error(current_char); QUIZ: where is “lookahead” ? From DFA to scanner (2/3)

  11. NOT( " ) T( " ) T( " ) " From DFA to scanner (3/3) • toss operation • example -- ( " ( Not(") | " ")* " ) • QUIZ: how to program? " " "H i " "" " H i "

  12. Reserved words • identifiers reserved for particular usage • approach 1 • one reserved word one regular expression • approach 2 • exceptions to ordinary identifiers • approach used in our simple example • QUIZ: comparison?

  13. Lexical error recovery • strategies • delete the characters read so far • delete the first character • handling of runaway string • QUIZ: why need special handling? • " ( Not("|Eol) | " " )* " • " ( Not("|Eol) | " " )* Eol • print out special error message • handling of runaway comment • { Not({|})* } • { ( Not({|})* { Not({|})* )+ } • warning • { Not(})* Eof • error

  14. input file -- E [Ee] OtherLetter [A-DF-Za-df-z] Digit [0-9] Letter {E} | {OtherLetter} IntLit {Digit}+ %% [ \t\n]+ { /* delete */ } [Bb][Ee][Gg][Ii][Nn] { minor=0; return(4); } [Ee][Nn][Dd] { minor=0; return(5); } [Rr][Ee][Aa][Dd] { minor=0; return(6); } [Ww][Rr][Ii][Tt][Ee] { minor=0; return(7}; } {Letter}({Letter} | {Digit} | _)* { minor=0; return(1); } {IntLit} { minor=1; return(2}; } ({IntLit}[.]{IntLit})({E}[+-]?{IntLit})? { minor=2; return(2}; } \"([^\"\n] I \"\")*\" { stripquotes(); minor=3; return(2); } \"([^\"\n] I \"\"}*\n { stripquotes(); minor=0; return(3); } "(" { minor=0; return(8}; } ")" { minor=0; return(9); } ";" { minor=0; return(10); } "," { minor=0; return(11); } ":=" { minor=0; return(12); } "+" { minor=0; return(13}; } " " { minor=0; return(14}; } %% Lex (1/2) class precedence to reduce table size regular expression executed when RE is matched

  15. Lex (2/2) • input file -- /* Strip unwanted quotes from string in yytext; adjust yyleng. */ void stripquotes(void} { int frompos, topos = 0, numquotes = 2; for (frompos = 1; frompos < yyleng; frompos++) { yytext[topos++] = yytext[frompos]; if (yytext[frompos] == '"' && yytext[frompos+1] == '"') { frompos++; numquotes++; } } yyleng -= numquotes; yytext[yyleng] = '\0'; } • output -- a program • interface -- int yylex( ) char yytext; int yyleng; auxiliary routine(s)

More Related