1 / 23

Scanning & Regular Expressions

Scanning & Regular Expressions. CPSC 388 Ellen Walker Hiram College. Scanning. Input: characters from the source code Output: Tokens Keywords: IF, THEN, ELSE, FOR … Symbols: PLUS, LBRACE, SEMI … Variable tokens: ID, NUM Augment with string or numeric value. TokenType.

ethand
Download Presentation

Scanning & Regular Expressions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scanning & Regular Expressions CPSC 388 Ellen Walker Hiram College

  2. Scanning • Input: characters from the source code • Output: Tokens • Keywords: IF, THEN, ELSE, FOR … • Symbols: PLUS, LBRACE, SEMI … • Variable tokens: ID, NUM • Augment with string or numeric value

  3. TokenType • Enumerated type (a c++ construct) Typedef enum {IF, THEN, ELSE …} TokenType • IF, THEN, ELSE (etc) are now literals of type TokenType

  4. Using TokenType void someFun(TokenType tt){ … switch (tt){ case IF: … break; case THEN: … break; … }

  5. Token Class (partial) class Token { public: TokenType tokenval; string tokenchars; double numval; }

  6. Interlude: References and Pointers • Java has primitives and references • Primitives are int, char, double, etc. • References “point to” objects • C++ has only primitives • But, one of the primitives is “address”, which serves the purpose of a reference.

  7. Interlude: References and Pointers • To declare a pointer, put * after the type char x; // a character char *y; // a pointer to a character • Using pointers: x = ‘a’; y = &x; //y gets the address of x *y = ‘b’; //thing pointed at by y becomes ‘b’; //note that x is now also b!

  8. Interlude: References and Pointers • Continuing the example… cout << x << endl; // prints b cout << *y << endl; // prints b cout << y << endl; // prints a hex address cout << &x << endl; // same as above cout << &y << endl; // a different address - where the pointer is stored

  9. GetToken(): A scanning function • Token *getToken(istream &sin) • Read characters from sin until a complete token is extracted, return (a pointer to) the token • Usually called by the parser • Note: version in the book uses global variables and returns only the token type

  10. Using GetToken Token *myToken = GetToken(cin); while (myToken != NULL){ //process the token switch (myToken->TokenType){ //cases for each token type } myToken = GetToken(cin); }

  11. Result of GetToken

  12. Tokens and Languages • The set of valid tokens of a particular type is a Language (in the formal sense) • More specifically, it is a Regular Language

  13. Language Formalities • Language: set of strings • String: sequence of symbols • Alphabet: set of legal symbols for strings • Generally  is used to denote an alphabet

  14. Example Languages • L1 = {aa, ab, bb} , S = {a, b} • L2 = {e,ab, abab, … }, S = {a, b} • L3 = {strings of N a’s where N is an odd integer}, S = {a} • L4 = {  } (one string with no symbols) • L5 = { } (no strings at all) • L5 = Ø

  15. Denoting Languages • Expressions (regular languages only) • Grammars • Set of rewrite rules that express all and only the strings in the language • Automata • Machines that “accept” all and only the strings in the language

  16. Primitive Regular Expressions •  • L() = {}(no strings) • e • L(e) = {e} (one string, no symbols) • a where a is a member of S • L(a) = {a} (one string, one symbol)

  17. Combining Regular Expressions • Choice: r | s (sometimes r+s) • L(r | s) = L(r )  L(s) • Concatenation:rs • L(rs) = L(r)L(s) • All combinations of 1 from r and 1 from s • Repetition: r* • L(r*) = e L(r )L(rr)L(rrr ) … • 0 or more strings from r concatenated

  18. Precedence • Repetition before concatenation • Concatenation before choice • Use parentheses to override • aa* vs. (aa)* • ab|c vs. a(b|c)

  19. Example Languages • L1 = {aa, ab, bb} , S = {a, b} • L2 = {e,ab, abab, … }, S = {a, b} • L3 = {strings of N a’s where N is an odd integer}, S = {a} • L4 = {  } (one string with no symbols) • L5 = { } (no strings at all) • L5 = Ø

  20. R.E.’s for Examples • L1 = aa | ab | bb • L1 = a(a|b) | bb • L1 = aa | (a|b) b • L2 = (ab)* not ab* ! • L3 = a(aa)*

  21. What are these languages? • a* | b* | c* • a*b*c* • (a*b*)* • a(a|b)*c • (a|b|c)*bab(a|b|c)*

  22. What are the RE’s? • In the alphabet {a,b,c}: • All strings that are in alphabetical order • All strings that have the first a before the first b, before the first c, e.g. ababbabca • All strings that contain “abc” • All strings that do not contain “abc”

  23. Extended Reg. Exp’s • Additional operations for convenience r+ = rr* (one or more reps) . ( any character in the alphabet) .* = any possible string from the alphabet [a-z] = a|b|c|…|z [^aeiou] = b|c|d|f|g|h|j...

More Related