1 / 8

Specification of tokens using regular expressions

Specification of tokens using regular expressions. Strings and Languages An alphabet is any finite set of symbols. Examples of symbols are letters, digits, and punctuation. The set {0,1} is the binary alphabet.

zeki
Download Presentation

Specification of tokens using regular expressions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Specification oftokens using regular expressions

  2. Strings and Languages • An alphabetis any finite set of symbols. • Examples of symbols are letters, digits, and punctuation. • The set {0,1} is the binary alphabet. • A stringover an alphabet is a finite sequence of symbols drawn from that alphabet. • "sentence" and "word" are often used as synonyms for "string.“ • The empty string, denoted , is the string of length zero. • A languageis any countable set of strings over some fixed alphabet. • The set containing only the empty string, are languages {}.

  3. Regularexpression • Regular expressions are an important notation for specifying lexeme patterns. They are effective in specifying those types of patterns that we need for tokens • Regular expression notations for identifiers are identifier=letter (letter/digit)*

  4. The regular expressions are built recursively out of smaller regular expressions, using the rules described below. • Regular expression construction rules • Є is a regular expression denoting {є}, that is, the language containing only the empty string • Ifais a symbol in ∑(alphabet), a is a regular expression denoting {a}, the language with only one string. • If r and s are regular expressions denoting languages L ( r) and L(s ) respectively, then • (r)|(s) is a regular expression denoting L( r) U L(s) • (r).(s) is a regular expression denoting L( r). L(s) • (r)* is a regular expression denoting (L(r ))*

  5. Precedence of operations • The unary operator * has highest precedence and is left associative. • Concatenation has second highest precedence and is left associative. • | has lowest precedence and is left associative • For any regular expressions R , S and T the following axioms holds • R|S=S|R (| is commutative)‏ • R|(S|T)=(R|S)|T (| is assosiative)‏ • R(ST)=(RS)T (concatenation is assosiative)‏ • R(S|T)=RS|RT (concatenation distributes over |)‏ • ЄR=Rє=R (є is the identity for concatenation)‏

  6. The regular expression a|bdenotes the language {a, b}. • (a|b)(a|b) denotes {aa, ah, ba, bb}, the set of all strings of length two over the alphabet. • a* denotes the language consisting of all strings of zero or more a's, that is, { , a , a a , a aa , . . . }. • (a|b)* denotes the set of all strings consisting of zero or more instances of a or b • a|a*b denotes the language {a, b, ab, aab, aaab,...}

  7. Regular definition • We may wish to give names to certain regular expressions and use those names in subsequent expressions as if the names were themselves symbols • di-> ri • e.g. for language of C identifiers • letter_->A|B|…|Z|a|b|…|z|_ • digit -> 0|1|…|9 • id -> letter_(letter_|digit)*

  8. Extension of regular expression • + one/ more instance • * zero/ more instance • ? Zero/ one instance • [ ] character classes e.g. [a-z] • ws -> (blank|tab|newline)+ • When ws is recognized , we do not return anything but restart to the character following white space

More Related