200 likes | 348 Views
This overview explores the concepts of regular grammars, finite state machines (FSM), and regular expressions (RE) essential for lexical analysis. Regular grammars, defined by terminals, non-terminals, and specific production rules, are highlighted alongside their limitations, such as the inability to express nesting or size restrictions. The relationship between regular expressions and their corresponding finite automata is examined, including the differences between deterministic (DFA) and non-deterministic (NFA) machines. Practical applications and implementations of these concepts in software development are also discussed.
E N D
Scanning, or Lexical Analysis. • Regular Grammars • Non-terminals (arbitrary names) • Terminals (characters) • Productions limited to the following: • Non-terminal ::= terminal • Non-terminal ::= terminal Non-terminal • Treat character class (e.g. digit) as terminal • Regular grammars cannot count: cannot express size limits on identifiers, literals • Cannot express proper nesting (parentheses) Department of Software & Media Technology
Regular Grammars • grammar for real literals with no exponent • digit :: = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • REALVAL ::= digit REALVAL1 • REALVAL1 ::= digit REALVAL1 (arbitrary size) • REALVAL1 ::= . INTEGERVAL • INTEGERVAL ::= digit INTEGERVAL (arbitrary size) • INTEGERVAL ::= digit • Start symbol is ? Department of Software & Media Technology
Regular Expressions • RE are defined by an alphabet (terminal symbols) and three operations: • Alternation RE1 | RE2 • Concatenation RE1 RE2 • Repetition RE* (zero or more RE’s) • Language of RE’s = regular grammars • Regular expressions are more convenient for some applications Department of Software & Media Technology
Finite State Machines or Finite Automata (FSM or FA) • A language defined by a grammar is a (possibly infinite) set of strings • An automaton is a computation that determines whether a given string belongs to a specified language • A finite state machine (FSM) is an automaton that recognize regular languages (regular expressions) • Simplest automaton: memory is single number (state) Department of Software & Media Technology
Specifying an Finite State Machine (FA) • A set of labeled states, directed arcs between states labeled with character • One or more states may be terminal (accepting) • Start is a distinguished state • Automaton makes transition from state S1 to S2 • If and only if arc from S1 to S2 is labeled with next character in input • Token is legal if automaton stops on terminal state Department of Software & Media Technology
FA from Grammar • One state for each non-terminal • A rule of the form • Nt1 ::= terminal, generates transition from a state to final state • A rule of the form • Nt1 ::= terminal Nt2 • Generates transition from state 1 to state 2 on an arc labeled by the terminal Department of Software & Media Technology
digit digit S letter letter letter underscore digit identifier digit Graphic representation of FA Department of Software & Media Technology
FA from RE • Each RE corresponds to a grammar • For all REs • A natural translation to FSM exists • Alternation often leads to non-deterministic machines Department of Software & Media Technology
Deterministic Finite Automata (DFA) • For all states S • For all characters C • There is at most one arc from any state S that is labeled with C • Easier to implement • No backtracking Conventions for DFA: • Error transitions are not explicitly shown • Input symbols that result in the same transition are grouped together (this set can even be given a name) • Still not displayed: stopping conditions and actions Department of Software & Media Technology
Non-Deterministic Finite Automata (NFA) • A non-deterministic FA • Has at least one state • With two arcs to two distinct states • Labeled with the same character • Example: from start state, a digit can begin an integer literal or a real literal • Implementation requires backtracking Department of Software & Media Technology
letter letter [other] start in_id finish return id digit Lookahead & Backtracking in NFA Department of Software & Media Technology
letter letter [other] start in_id finish return id digit Implementation of FA Department of Software & Media Technology
letter letter [other] start in_id finish return id digit From RE to DFA & RE to NFA Department of Software & Media Technology
NFA to DFA • There is an algorithm for converting a non-deterministic machine to a deterministic one • Result may have exponentially more states • Intuitively: need new states to express uncertainty about token: int or real • Other algorithms for minimizing number of states of FSM, for showing equivalence, etc. Department of Software & Media Technology
Example DFA Department of Software & Media Technology
Another view of the same DFA Department of Software & Media Technology
Yet another view of the same DFA Department of Software & Media Technology
State Minimization in DFA Department of Software & Media Technology
TINY DFA: Department of Software & Media Technology
Lex for Scanner • Lex Conventions for RE • Format of a Lex Input File Department of Software & Media Technology