Layered Combinator Parsers with a Unique State
This document presents an innovative system for creating parser combinators that leverage unique state management techniques. It explores traditional parser combinators while outlining requirements for new combinators, focusing on architecture and error handling. Key features include separate scanning and parsing, non-deterministic outcomes, and enhanced error recovery. The architecture accommodates user-defined extensions and employs efficient implementations using continuations, ensuring robust processing for programming languages. By integrating context-sensitive parsing and arbitrary look-ahead, this approach empowers developers with concise and powerful parsing capabilities.
Layered Combinator Parsers with a Unique State
E N D
Presentation Transcript
Layered Combinator Parsers with a Unique State Pieter KoopmanRinus PlasmeijerNijmegen, The Netherlands
Overview • conventional parser combinators • requirements new combinators • system-architecture • new parser combinators • separate scanner and parser • error handling Pieter Koopman
parser combinators • Non deterministic, list of results :: Parser s r :== [s] -> [ ParseResult s r ]:: ParseResult s r :== ([s],r) • fail & yield fail = \ss = []yield r = \ss = [(ss,r)] • recognize symbol satisfy :: (s->Bool) -> Parser s ssatisfy f = pwhere p [s:ss] | f s = [(ss,s)] p _ = [] symbol sym :== satisfy ((==) sym) Pieter Koopman
parser combinators 2 • sequence-combinators (<&>) infixr 6::(Parser s r)(r->Parser s t)->Parser s t(<&>) p1 p2 = \ss1 = [ tuple \\ (ss2,r1) <- p1 ss1 , tuple <- p2 r1 ss2 ] (<+>)infixl 6::(Parser s(r->t))(Parser s r)->Parser s t (<+>) p1 p2 = \ss1 = [ (ss3,f r) \\ (ss2,f) <- p1 ss1 , (ss3,r) <- p2 ss2 ] • choose-combinator (<||>) infixr 4::(Parser s r) (Parser s r)->Parser s r(<||>) p1 p2 = \ss = p1 ss ++ p2 ss Pieter Koopman
parser combinators 3 • some useful abbreviations (@>) infixr 7(@>) f p :== yield f <+> p (<:>) infixl 6(<:>) p1 p2 :== (\h t=[h:t]) @> p1 <+> p2 Pieter Koopman
parser combinators 4 • Kleene star star p = p <:> star p <||> yield [] plus p = p <:> star p • parsing an identifier identifier :: Parser Char String identifier = toString @> satisfy isAlpha <:> star (satisfy isAlphanum) Pieter Koopman
parser combinators 5 • context sensitive parsers twice the same character doubleChar = satisfy isAlpha <&> \c -> symbol c • arbitrary look ahead lookAhead = symbol 'a' +> symbol 'b' <||> symbol 'a' +> symbol 'c' Pieter Koopman
parser combinators 5 • context sensitive parsers twice the same character doubleChar = satisfy isAlpha <&> \c -> symbol c • arbitrary look ahead lookAhead = symbol 'a' +> symbol 'b' <||> symbol 'a' +> symbol 'c' <||> star (satisfy isSpace) +> symbol 'a' <||> symbol 'x' Pieter Koopman
properties of combinators + concise and clear parsers + full power of fpl available + context sensitive + arbitrary look-ahead + can be efficient, continuations IFL '98 - no error handling (messages & recovery) - no unique symbol tables - separate scanner yields problems scan entire file before parser starts Pieter Koopman
Requirements • parse state with • error file • notion of position • user-defined extension e.g. symbol table • possibility to add separate scanner • efficient implementation, continuations • for programming languages we want a single result (deterministic grammar) Pieter Koopman
Uniqueness • files and windows that should be single-threaded are unique in Clean fwritec :: Char *File -> *File • data-structures can be updated destructively when they are unique • only unique arrays can be changed Pieter Koopman
System-architecture • replace the list of symbols by a structure containing • actual input • position • error administration • user defined part of the state • use a type constructor class to allow multiple levels Pieter Koopman
Type constructor class • Reading a symbol class PSread ps s st :: (*ps s *st)->(s, *ps s *st) • Copying the state is not allowed,use functions to manipulate the input class PSsplit ps s st :: (s, *ps s *st)->(s, *ps s *st) class PSback ps s st :: (s, *ps s *st)->(s, *ps s *st) class PSclear ps s st :: (s, *ps s *st)->(s, *ps s *st) • Minimal parser state requires Clean 2.0 class ParserState ps symbol state | PSread, PSsplit, PSback, PSclear ps symbol state Pieter Koopman
New parser combinators • Parsers have three arguments • success-continuation determines action upon success SuccCont :== Item failCont State -> (Result, State) • fail-continuation specifies what to do if parser fails FailCont :== State -> (Result, State) • current input state State :== (Symbol, ParserState) Pieter Koopman
New parser combinators 2 • yield and fail, apply appropriate continuation yield r = \succ fail tuple = succ r fail tuple failComb = \succ fail tuple = fail tuple • sequence of parsers, change continuation <&> p1 p2 = \sc fc t -> p1 (\a _ -> p2 a sc fc) fc t • choice, change continuations (<|>) p1 p2= \succ fail tuple =p1 (\r f t=succ r fail (PSclear t))(\t2 =p2 succ fail (PSback t2))(PSsplit tuple) Pieter Koopman
string input • a very simple instance of ParserState :: *StringInput symbol state = { si_string:: String // string holds input, si_pos:: Int // index of current char, si_hist:: [Int] // to remember old positions, si_state:: state // user-defined extension , si_error :: ErrorState } instance PSread StringInput Char statewhere PSread si=:{si_string,si_pos}= (si_string.[si_pos],{si & si_pos = si_pos+1}) instance PSsplit StringInput Char statewhere PSsplit (c,si=:{si_pos,si_hist})= (c,{si & si_hist = [si_pos:si_hist]}) instance PSback StringInput Char statewhere PSback (_,si=:{si_string,si_hist=[h:t]})= (si_string.[h-1],{si & si_pos = h, si_hist = t}) Pieter Koopman
Separate scanner and parser • sometimes it is convenient to have a separate scannere.g. to implement the offside rule • task of scanner and parser is similar.So, use the same combinators • due to the type constructor class we can nest parser states Pieter Koopman
a simple scanner • use of combinators doesn’t change • produces tokens (algebraic datatype) scanner = skipSpace +> (generateOffsideToken <|> satisfy isAlpha <:> star (satisfy isAlphanum)<@ testReserved o toString <|> plus (satisfy isDigit)<@ IntToken o to_number 0 <|> symbol '=' <@ K EqualToken <|> symbol '(' <@ K OpenToken <|> symbol ')' <@ K CloseToken ) Pieter Koopman
generating offside tokens • use an ordinary parse function generateOffsideToken = pAcc getCol <&> \col -> // get current coloumn pAcc getOffside <&> \os_col -> // get offside position handleOS col os_col where handleOS col os_col | EndGroupGenerated os_col | col < os_col = pApp popOffside (yield EndOfGroupToken) = pApp ClearEndGroup failComb | col <= os_col = pApp SetEndGroup (yield EndOfDefToken) = failComb Pieter Koopman
Parser state for nesting • parser state contains scanner and its state :: *NestedInput token state = E. .ps sym scanState: { ni_scanSt :: (ps sym scanState) , ni_scanner :: (ps sym scanState) -> *(token,ps sym scanState)) , ni_buffer :: [token] , ni_history :: [[token]] , ni_state :: state } • can be nested to any depth • we can, but doesn’t have to, use this Pieter Koopman
Parser state for nesting 2 NestedInput ScanState *File *ErrorState *OffsideState scanner *HashTable Pieter Koopman
Parser state for nesting 3 • apply scanner to read token instance PSread NestedState token state where PSread ns=:{ns_scanner,ns_scanSt} # (tok,state) = ns_scanner ns_scanSt = (tok,{ns & ns_scanSt = state}) • here, we ignored the buffer • define instances for other functions in class ParserState Pieter Koopman
error handling • general error correction is difficult • correct simple errors • skip to new definition otherwise • Good error messages: • location: position in file • what are we parsing: stack of contexts Error [t.icl,20,[caseAlt,Expression]]: ) expected instead of = Pieter Koopman
error handling 2 • basic error generation parseError expected val= \succ fail (t,ps)= let msg = toString expected +++" expected instead of " +++toString t in succ val fail(PSerror msg (PSread ps)) • useful primitives wantSymbol sym = symbol sym <|> parseError sym sym want p msg value = p <|> parseError msg value skipToSymbol sym = symbol sym <|> parseError sym sym +> star (satisfy ((<>) sym)) +> symbol sym Pieter Koopman
Parser • Parsing expressions pExpression = "Expression" ::>BV @> match mBasicValue <|> pIdentifier <|> symbol CaseToken +> pDeterCase @> pCompoundExpression <+ wantSymbol OfToken <+> star pCaseAlt <+ skipToSymbol EndOfGroupToken <|> symbol OpenToken +> pCompoundExpression<+ wantSymbol CloseToken Pieter Koopman
identifiers in hashtable • use a parse-function • hashtable is user defined state in ParserState pIdentifier = match mIdentToken <&> \ident =pAccSt (putNameInHashTable ident) <@ \name={app_symb=UnknownSymbol name, app_args=[]} • the function pAccSt applies a function to the user defined state Pieter Koopman
limitations of this approach • syntax specified by parse functions • grammar is not a datastructure • no detection of left recursionruntime error instead of nice message • no automatic left-factoringdo it by hand, or runtime overheadp1 = p <&> q1 <|> p <&> q2p2 = p <&> (q1 <|> q2) Pieter Koopman
discussion • old advantages • concise, fpl-power, arbitrary look ahead, context sensitve • new advantages • unique and extendable parser state • one or more layers • decent error handling,simple error correction can be added • still efficient, overhead < 2 • non-determinism only when needed Pieter Koopman