1 / 9

The scanning process

The scanning process. Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying what kind of token has been read (e.g. identifier, operator, integer literal, etc.)

donar
Download Presentation

The scanning process

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The scanning process • Main goal: recognize words/tokens • Snapshot: • At any point in time, the scanner has read some input and is on the way to identifying what kind of token has been read (e.g. identifier, operator, integer literal, etc.) • Once the scanner identifies a token, it sends it off to the parser and starts over with the next word. • Some tokens need additional data to be carried along with them • For example, an identifier token needs to have the identifier itself attached to it. • Alternatively, the scanner generates a file of tokens which is then input to the parser.

  2. The scanning process • A simple hand-written scanner would look a bit like this: … nextchar = getNextChar(); switch (nextchar) { case '(': return LPAREN; /* return LPAREN token */ case 0: case 1: ... case 9: nextchar = getNextChar(); while (nextchar is a digit) { concat the digits to build an integer nextchar = getNextChar(); } putBack(nextchar) make a new INTEGER token with the integer value attached return INTEGER; ... } …

  3. The scanning process • Not always as simple as it seems • Example from old versions of FORTRAN: • Instead of writing a scanner by hand, we can automate the process. • Specify what needs to be recognized and what to do when something is recognized. • Have a scanner generator create the scanner based on our specification. • Hand-written vs. automated scanner DO 5 I=1,10 vs. DO 5 I=1.10

  4. The scanning process • Specify what needs to be recognized. • Some tokens are easy to identify • e.g. = is an assignment operator, ( is a parenthesis • Others are more complex • How would the scanner recognize an identifier? The set of possible identifiers is very large or even infinite (assuming no length restrictions) • SOLUTION: Recognize a pattern! • Example: An identifier is a sequence of letters or digits that starts with a letter. • We need a way to describe this pattern to our scanner generator. • Regular expressions come to the rescue!

  5. The scanning process • Definition: Regular expressions (over alphabet ) •  is an RE denoting {} • If , then  is an RE denoting {} • If r and s are REs, then • (r) is an RE denoting L(r) • r|s is an RE denoting L(r)L(s) • rs is an RE denoting L(r)L(s) • r* is an RE denoting the Kleene closure of L(r) • Property: REs are closed under many operations • This allows us to build complex REs.

  6. Regular Definitions • A regular expression that describes digits is: 0|1|2|3|4|5|6|7|8|9 • For convenience, we'd like to give it a name and then use the name in building more complex regular expressions: digit  0|1|2|3|4|5|6|7|8|9 • This is called a regular definition. • Example • letter  a|...|z|A|...|Z • ident  letter (letter | digit)*

  7. What’s next • Given an input string, we need a “machine” that has a regular expression hard-coded in it and can tell whether the input string matches the pattern described by the regular expression or not. • A machine that determines whether a given string belongs to a language is called a finite automaton.

  8. The scanning process • Definition: Deterministic Finite Automaton • a five-tuple (, S, , s0, F) where •  is the alphabet • S is the set of states •  is the transition function (SS) • s0 is the starting state • F is the set of final states (F  S) • Notation: • Use a transition diagram to describe a DFA • states are nodes, transitions are directed, labeled edges, some states are marked as final, one state is marked as starting • If the automaton stops at a final state on end of input, then the input string belongs to the language.

  9. The scanning process • Goal: automate the process • Idea: • Start with an RE • Build a DFA • How? • We can build a non-deterministic finite automaton (Thompson's construction) • Convert that to a deterministic one (Subset construction) • Minimize the DFA (Hopcroft's algorithm) • Implement it • Existing scanner generator: flex

More Related