1 / 14

Review: Regular expression: How do we define it? Given an alphabet , Base case:

Review: Regular expression: How do we define it? Given an alphabet , Base case: is a regular expression that denote { }, the set that contains the empty string. For each , a is a regular expression denote {a}, the set containing the string a. Induction case:

taryn
Download Presentation

Review: Regular expression: How do we define it? Given an alphabet , Base case:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Review: Regular expression: • How do we define it? • Given an alphabet , • Base case: • is a regular expression that denote { }, the set that contains the empty string. • For each , a is a regular expression denote {a}, the set containing the string a. • Induction case: • r and s are regular expressions denoting the language (set) L(r ) and L(s ). Then • ( r ) | ( s ) is a regular expression denoting L( r ) U L( s ) • ( r ) ( s ) is a regular expression denoting L( r ) L ( s ) • ( r )* is a regular expression denoting (L ( r )) *

  2. Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) • Lex source program {definition} %% {rules} %% {user subroutines} Rules: <regular expression> <action> Each regular expression specifies a token. Default action for anything that is not matched: copy to the output Action: C source fragments specifying what to do when a token is recognized.

  3. lex program examples: ex1.l and ex2.l • ‘lex ex1.l’ produces the lex.yy.c file. • The int yylex() routine is the scanner that finds all the regular expressions specified. • yylex() returns a non-zero value (usually token id) normally. • yylex() returns 0 when end of file is reached. • Need a drive to test the routine. • You need to have a yywrap() function in the lex file (return 1). • Something to do with compiling multiple files.

  4. Lex regular expression: contains text characters and operators. • Letters of alphabet and digits are always text characters. • Regular expression integer matches the string “integer” • Operators: “\[]^-?.*+|()$/{}%<> • When these characters happen in a regular expression, they have special meanings

  5. operators (characters that have special meanings): “\[]^-?.*+|()$/{}%<> • ‘*’, ‘+’, ‘|’, ‘(‘,’)’ -- used in regular expression • ‘ “ ‘ -- any character in between quote is a text character. • E.g.: “xyz++” == xyz”++” • ‘\’ -- escape character, • To get the operators back: “xyz++” == ?? • To specify special characters: \40 == “ “ • ‘[‘ and ‘]’ -- used to specify a set of characters • e.g: [a-z], [a-zA-Z], • Every character in it except ^, - and \ is a text character • [-+0-9], [\40-\176] • ‘^’ -- not, used as the first character after the left bracket • E.g [^abc] -- everything except a, b or c. • [^a-zA-Z] -- ??

  6. operators (characters that have special meanings): “\[]^-?.*+|()$/{}%<> • ‘.’ -- every character • ‘?’ -- optional ab?c matches ‘ac’ or ‘abc’ • ‘/’ -- used in character lookahead: • e.g. ab/cd -- matches ab only if it is followed by cd • ‘{‘’}’ -- enclose a regular definition • ‘%’ -- has special meaning in lex • ‘$’ -- match the end of a line, ‘^’ -- match the beginning of a line • ab$ == ab/\n • ‘<‘ ‘>’: start condidtion (more context sensitivity support, see the paper for details).

  7. Order of pattern matching: • Always matches the longest pattern. • When multiple patterns matches, use the first pattern. • To override, add “REJECT” in the action. ... %% Ab {printf(“rule 1\n”);} Abc {printf(“rule 2\n”);} {letter}{letter|digit}* {printf(“rule 3\n”);} %% Input: Abc What happened when at ‘.*’ as a pattern?

  8. Manipulate the lexeme and/or the input stream: • yytext -- a char pointer pointing to the matched string • yyleng -- the length of the matched string • I/O routines to manipulate the input stream: • input() -- get a character from the input character, return <=0 when reaching the end of the input stream, the character otherwise • unput( c ) -- put c back onto the input stream • Deal with comments: (/* ….. */ • “/*”.*”*/” ??? %% … “/*” {char c1; c2 = input(); if (c2 <=0) {lex_error(“unfinished comment” …} else { c1 = c2; c2 = input(); while (((c1!=‘*’) || (c2 != ‘/’)) && (c2 > 0)) {c1 = c2; c2 = input();} if (c2 <= 0) {lex_error( ….) }

  9. Reporting errors: • What kind of errors? Not too many. • Characters that cannot lead to a token • unended comments (can we do it in later phases?) • unended string constants. • How to keep track of current position (which line, which column)? • Use to global variable for this: yyline, yycolumn %{ int yyline = 1, yycolumn = 1; %} ... %% [ \t\n]+ {/* do nothing*/} If {return (IFNumber);} “+” {return (PLUSNumber);} {letter}{letter|digit}* {yylval = idtable_insert(yytext); return(IDNumber);} ... %%

  10. Reporting errors: • How to report an error character that cannot lead to a token? • How to deal with unended commend? • How to deal with unended string?

  11. Dealing with identifiers, string constants. • Data structures: • A string table that stores the lexeme value. • To avoid inserting the same lexeme multiple times, we will maintain an id table that records all identifiers found. Id table will have pointer pointing to the string table. • Implementation of the id table: hash_table, link list, tree, … • The hash_table implementation in page 433-436. cp n match last i j c p ‘\0’ n ‘\0’ m a t c h ‘\0’ l a s t ‘\0’ I ‘\0’ j ‘\0’

  12. Some code piece for the id table: #define STRINGTABLELENGTH 20000 #define PRIME 997 struct HashItem { int index; struct HashItem *next; } struct HashItem *HashTable[PRIME]; char StringTable[STRINGTABLELENGTH]; int StringTableIndex=0; int HashFunction(char *s); /* copy from page 436 */ int HashInsert(char *s);

  13. Internal representation of String constants: • Needs conversion for the special characters. • “abc” ==> ‘a’’b’’c’’\0’ • “abc\”def” ==> ‘a’’b’’c’”’d’’e’’f’’\0’ • “abc\n” ==> ‘a’’b’’c’’\n’ • Recognizing constant strings with special characters • Assuming string cannot pass line boundary. • Use yymore() “[^”\n]* {char c; c = input(); if (c != ‘”’) error else if (yytext[yyleng-1] == ‘\\’) { unput( c ); yymore(); } else {/* find the whole string, normal process*/}

  14. Put it all together • Checkout token.l program.

More Related