Chapter 2 Lexical Analysis

Chapter 2 Lexical Analysis Dr. Frank Lee

2.1 File Processing and Tokens • The lexical analysis phase of compilation does exactly what you do unconsciously – it breaks the text file (program) into smaller chunks • These chunks are called tokens • The lexical analysis phase of the compiler is often called tokenization • One efficient but complex brute-force approach is to read character by character to check if each one follows the right sequence of a token. • Fig 2.2 (p7) shows a partial code for this brute-force lexical analyzer (see next slide) • What we’d like is a set of tools that will allow us to easily create and modify a lexical analyzer that has the same run-time efficiency as the brute-force method • The first tool that we use to attack this problem is Deterministic Finite Automata (DFA).

Partial code for a brute-force lexical analyzer if (c = nextchar() == ‘c’) { if (c = nextchar() == ‘l’) { // code to handle the rest of either “class” or any identifier that // starts with “cl” } else if (c == ‘a’) { // code to handle the rest of either “case” or any identifier that // starts with “ca” } else { // code to handle any identifier that starts with c } } else if ( c = …..) { // many more nested ifs for the rest of the lexical analyzer

2.2 Deterministic Finite Automata 2.2.1 Informal Definition of A DFA • DFA is a kind of specialized flowchart for text processing • “Deterministic” means the next step is completely determined by the current state and the next symbol • “Finite” means the numbers of states must be finite • “Automata” means machines; automaton is the singular form

2.2.1 Informal Definition of DFA • The DFA is to recognize strings • When a string is fed into a DFA, if the DFA recognizes the string, it accepts the string otherwise it rejects the string. • A DFA is a collection of states and transitions • Given the input string, the transitions tell us how to move among the states • One of the states is denoted as the initial state, and a subset of the states are final states • We start from the initial state, move from state to state via the transitions, and check to see if we are in the final state when we have checked each character in the string • If we are, then the string is accepted, otherwise, the string is rejected • See Figures 2.3 to 2.6 (p8-9)

2.2.2 Formal Definition of a DFA • A DFA is a quintuple M = (Q, ∑, δ, q0, F), where • quintuple means the machine M has five parameters. • Q is a finite set of states • ∑ is a finite set called the alphabet • δ is a total function from (Q x ∑) to Q known as transition function (a function that takes a state and a symbol as inputs and returns a state) • q0εQ is the start state, and • F is subset of Q called final states (see p9) • See page 10 for detail descriptions about DFA

2.2.3 Representing DFA • Figure 2.7 (p11) defines a DFA that accepts all strings over {a, b} that start with “a” and end with “b”. • Fig 2.8 A DFA equivalent to the formal definition of Fig. 2.7 • DFA Example: • Design a DFA for a newspaper vending machine if each copy of newspaper is 35 cents and the machine accepts only nickels, dimes and quarters. • Note, there is no change for the machine, for example, if the customer drop into 40 cents, one copy of newspaper will come out and without changes.

2.2.4 Combining DFA • Figure 2.9 (p11) show a DFA that accepts strings “new” and “null” n e w 3 0 1 2 u l l 6 4 5

2.2.5 More Complicated DFA • Fig. 2.10 (p12): a DFA that accepts all legal simpleJava identifiers • Fig. 2.11 (p13): results of running various strings through the DFA of Fig. 2.10. • Fig. 2.12 (p13): A DFA that accepts “new”, “null” and all legal identifiers in simpleJava • How to design a DFA for a Coke vending machine if each can of Coke is 75 cents and if the machine accepts only dimes and quarters? Note, no change function in this machine. • See Appendix A (p318) for simpleJava Reference Manual • See Appendix B (p332) for combination DFA in detail

2.3 Regular Expressions • Regular expressions (RE) allow us to describe sets of string using 2 basic operators │ (means OR, or union) and * (means zero or more repetitions). • Є stands for the empty string • Examples: • (aab│bba) • a(ab│bc) • (a│b)(c│d) • (ab)*  {“”, “ab”, “abab”, “ababab”, ….}. • a(bc)*d • (ab│cd)*

2.3.1 Formal Languages and Strings • See Definitions 1 to 8 (pp14-15) • Def 1: Alphabet • Def 2: String • Def 3: String Length • Def 4: Empty String • Def 5: Formal Language • Def 6: Concatenation (i.e. AND): x○y (= xy) • Def 7: Language Concatenation L ○ M(=LM) • Def 8: Kleene Closure: The Kleene closure of a formal language L is denoted as L* = {Є}ULUL2 UL3 UL4 U… • Examples: (see Definition 8 on page15) • If L={a} then L* = • If L={bc} then L* = • If L={a, b} then L*= • If L={a, b, c} then L*=

2.3.2 Regular Expression Definition • A regular expression is a way to describe a formal language over an alphabet ∑ • For any regular expression α, we will use L[α] to denote the formal language described by α • Regular expressions are defined recursively on page 15 • See table 2.2 (p16) for some examples

2.3.3 Regular Expression Shorthand • (a○b)  (ab) • ((a│b)│c)  a│b│c • Is ab│c same as (ab) │c or a(b│c)? • Operator precedence is: • Kleene closure * • Concatenation (○) (means AND) • Union (│) (means OR) • So ab│c = (ab)│c , and ab*c│de = a(b*c)│(de)

2.3.3 Regular Expression Shorthand (cont.) • Use square brackets to choose a single character from a set • For examples: • [abcd] = (a │b │c │d) • [a-f] = (a │b │c │d │e │f) • [a-c0-3] = (a │b │c │0 │1 │2 │3) • JavaCC uses [“a”-”c”, “0”-”3”] to represent [a-c0-3] • Uses ^ in the first character to represent not in the set • For example: [^a-zA-Z] stands for any character that is not a letter • JavaCC uses ~ to represent “not” • For example: ~[“a”-”z”,”A”-”Z”] means [^a-zA-Z]

2.3.3 Regular Expression Shorthand (cont.) • For any regular expression α • α? is equivalent to (α│ε) • For example: ab?c = {abc, ac} • α+ is equivalent to αα* • For example: (ab)+c = {abc, ababc, abababc, …} • See some shorthand regular expressions examples on Table 2.3 (p17).

2.3.4 Intuitive Description of Regular Expressions • Concatenation means “is followed by” • │is the same as “or” • * is the same as “zero” or more occurrences • Examples: • abc(d│e) • (a │b)(c │d) • (a │b)(a │b) • a*b • (a │b)c* • (a │b)* (all strings that are made up of “a”s and “b”s) • The regular expression for an identifier in simpleJava is [a-zA-Z_][a-zA-Z_0-9]*  {a, A, _, aa, aA, a_, a0, Av7, ….} • See Table 2.4 (p18) for some examples

2.4 JavaCC – A Lexical Analyzer and Parser Generator • JavaCC takes a set of regular expressions as input that describe tokens • Creates a DFA that recognizes the input set of tokens, and • Then creates a Java class to implement that DFA • Table 2.4 (page 18) shows some tokens and their regular expressions • JavaCC also creates a parser which will be explored in Chapter 4

2.4.1 Structure of a JavaCC file • Input files to JavaCC have the extension “.jj” • A sample foo.jj file is shown on p18 • Several tokens can be defined in the same TOKEN block, with the rules separated by a “│”. (see Fig 2.3 on p19) • By convention, token names are in all UPPERCASE. • When JavaCC runs on the file foo.jj, JavaCC creates a class fooTokenManager, which contains a method getNextToken(). • The getNextToken() method implements the DFA that recognizes the tokens described by the regular expressions • Every time getNextToken is called, it finds the rule whose regular expression matches to the next sequence of characters in the input file, and returns the token for that rule • See the simple JavaCC input file simple.jj in Fig. 2.13 (p19)

2.4.1 Structure of a JavaCC file (cont.) • When there is more than one rule that matches the input, JavaCC uses the following strategy: • Always match to the longest possible string • If two different rules match the same string, use the rule that appears first in the .jj file • For example, the “else” string matches both the ELSE and IDENTIFIER rules, getNextToken() will return the ELSE token • But “else21” matches to the IDENTIFIER token only

2.4.2 Tokens in JavaCC • When we run JavaCC on the input file simple.jj, JavaCC creates several files to implement a lexical analyzer • simpleTokenManager.Java: to implement the token manager class • simpleConstants.Java: an interface that defines a set of constants (Fig 2.14, p21) • Token.Java: describes the tokens returned by getNextToken()

2.4.3 Dealing with Whitespaces – SKIP rules • If the input file contains “else ;”, the first call to getNextToken() will return an “else” token • Then we use the SKIP rules to skip the white spaces between “else” and “;” • See Fig 2.15 (p22): insert SKIP rules into the simple.jj in Fig 2.13

2.4.4 Skipping Comments – States in JavaCC • For single-line comment, use the SKIP rule below: SKIP : { < “//” (~[“\n”])* “\n” > } start with //, follow any character except \n, and end with \n. • For the multiple-line comment, if the input file contains “/*”, then use the following SKIP rule: SKIP : { < “/*” > : IN_COMMENT } to recognize the beginning /* of a multiple-line comment and enter to IN_COMMENT state

2.4.4 Skipping Comments – States in JavaCC (cont.) • Under the IN_COMMENT state, use the following SKIP rule to skip all comment lines: <IN_COMMENT> SKIP : { < “*/” > : DEFAULT │ < ~[] > } • If meets the end-of-comment symbol “*/”, then change IN_COMMENT state back to DEFAULT state, otherwise continue to skip any character ~[] (Note: ~[] means any character except the empty string []) • See Fig 2.16 (p24)

2.4.4 Skipping Comments – States in JavaCC (cont.) • For example: else /* Comment */ 12/* Comment */else • At beginning, in DEAFULT state, return the ELSE token • Then read the /* and enter to IN_COMMENT state, skip each character after /*, until meet the */, then skip */ and switch back to DEFAULT state • Next, read “12” and return an INTEGER_LITERAL token • Then read /* and enter IN_COMMENT state again, skip everything after /* • Until read */, skip */ and switch back to DEFAULT state • At last, read “else” and return another ELSE token

2.4.5 Adding Actions to SKIP and TOKEN Definitions • We need some way of keeping track of the number of comments that have been skipped • Fig 2.17 (p26) shows the way to count number of comments have been skipped • It inserts a numcomments++; statement in both comment SKIP rules (single-line and multi-line comments). • To tell how many comments are skipped.

2.4.6 Using the Generated TokenManager in Java Code • When JavaCC is run on the file foo.jj, it generates the following files (p25): • foo.Java • ParseException.Java • Token.Java • SimpleCharStream.Java • fooTokenManager.Java • fooConstants.Java • TokenMgrError.Java

2.4.6 Using the Generated TokenManagerin Java Code (cont.) • To create a Java program that uses the lexical analyzer created by JavaCC, you need to instantiate a variable of type fooTokenManager, where foo is the name of your .jj file • The constructor for fooTokenManager requires an object of type SimpleCharStream • The constructor for SimpleCharStream requires a standard Java.io.InputStream • Thus, you can use the general lexical analyzer as in next slide

2.4.6 Using the Generated TokenManagerin Java Code (cont., p26) Token t; fooTokenManager tm; Java.io.InputStream infile; tm = new fooTokenManager(new SimpleCharStream(infile)); infile = new Java.io.FileInputstream(“filename”); t = tm.getNextToken(); while (t.kind != fooConstants.EOF) { /* Process t */ t = tm.getNextToken(); }

2.4.6 Using the Generated TokenManagerin Java Code (cont.) • Any public variables that are declared in the TOKEN_MGR_DECLS block (p18) can be accessed through the token manager • For instance, the main program in RemoveComments.Java (Fig. 2.18, p28) uses (calls) the lexical analyzer (which is generated by remComments.jj (Fig. 2.17, p26)) to strip all of the comments out of an input file • Assume that the files remComments.jj and RemoveComments.Java are in the same directory, the comment stripper can be created with the following commands: javacc remComments.jj javac *.java • The code can be tested with the command: java RemoveComments <filename>

2.5 Creating a Lexical Analyzer 2.5.1 Project Definition • To write a JavaCC file “sjava.jj” that creates a lexical analyzer for simpleJava tokens • A skeleton of sjava.jj can be found in Fig. 2.20 (p32) to get you started • Once built your lexical analyzer, it can be tested with the Java class TokenTest in Fig. 2.21 (p33) • Your own lexical analyzer should be able to recognize the following keywords and symbols (p30): and class do else false for if true while + - * / [ ] \{ \} ( ) . , ; == != < > <= >= = && ││ ! • Also recognize integer literals and identifiers • Skip single-line and multi-line comments • Refer to page 319 for definitions of tokens and comments

2.5.2 Project Difficulty • To learn the syntax of JavaCC • To minimize your frustration, add to the sjava.jj file a little at a time, and constantly checking it with JavaCC to find bugs. • Provided files: • sjava.jj (Fig. 2.20, p32) • TokenTest.Java (Fig. 2.21, p33) • test1.sjava to test4.sjava: sample test programs to test your lexical analyzer • Once you have modified sjava.jj to implement a complete lexical analyzer, you can build the lexical analyzer with the command: javacc sjava.jj javac *.java • You can test your lexical analyzer with the command: java TokenTest <filename>

Chapter 2 Homework • Due: 2/13/2013 • Part A: • Design a DFA for a chewing gum vending machine if each pack of chewing gum is 25 cents and the machine accepts only nickels, dimes and quarters. • Note, there is no change for the machine. For example, if the customer drops into 30 cents, one pack of chewing gum will come out without changes. • Part B: • Do the following Exercises in Chapter 2: 1. (a), (c), (e), (g) 2. (a), (b), (c) 3. (a), (b), (c) 4. Give a DFA for each of the languages in question 1 (a), (c), (e), (g)

Project 1 • Due: 2/20/2013 • Project Title: A Lexical Analyzer for simpleJava tokens • You need to write a JavaCC file “sjava.jj”that create a lexical analyzer for simpleJava tokens • Study chapter 2 and JavaCC completely to figure out how to design, implement and test your lexical analyzer. • Hand in & Email me: • Your sjava.jj program • Your compilation results (outputs) with javacc and javac commands • Your testing results (outputs) with java TokenTest <filename> command • Your input files for testing the TokenTest program. • A README file to describe the steps to run and test your programs.

Chapter 2 Lexical Analysis