Lecture 4: Implementing Lexers and Parsers Programming Languages Course Aarne Ranta (aarne@chalmers.se) %!target:html %!postproc(html): #NEW %!postproc(html): #HR
Book: 3.1, 3.3, 3.4, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9 #NEW ==Overview of this lecture== Implementing lexers by hand Regular expressions Lexer tools: Alex/JLex/Flex Implementing top-down (LL) parsers by hand LL parsing tables Bottom-up (LR) parsing methods LALR parsing tools Debugging and conflict solving #NEW ==Before parsing== Parsing is done in accordance with BNF rules such as ``` Stm ::= "if" "(" Exp ")" Stm "else" Stm ; Exp ::= Exp "+" Exp ; Exp ::= Ident ; Exp ::= Integer ; ``` The terminals (in quotes) are **tokens**, and in the end all rules generate and recognize lists of tokens. Also ``Ident`` and ``Integer`` are a special kinds of tokens. The task of **lexical analysis** is to split the input string into tokens. #NEW ==Spaces== Tokens are sometimes separated by spaces (/newlines/tabs), but not necessarily: ``` if (i>0) {i=i+27;} else {i++;} if, (, i, >, 0, ), {, i, =, i, +, 27, ;, }, else, {, i, ++, ;, } ``` Good design: it should //always// be //legal// to have spaces, newlines, or tabs between tokens - even if not necessary. **Quiz**: which programming languages violate this rule? #NEW ==Lexical analysis in parser (don't do like this!)== It //is// possible for the parser to read input character by character; in this case, the grammar should tell where spaces are possible/necessary. ``` Exp ::= Exp Space "+" Space Exp ; Space ::= [SpaceChar] ; SpaceChar ::= " " | "\n" | "\t" ; ``` But this really clutters the grammar! Programming languages are usually designed in such a way that lexical analysis can be done before parsing, and parsing gets tokens as its input. #NEW ==Classes of tokens== The lexer //splits// the code into tokens. The lexer also //classifies// each token into classes. For example, the string ``` i2=i1+271 ``` results in the following list of classified tokens: ``` identifier "i1" symbol "=" identifier "i2" symbol "+" integer "271" ``` #NEW ==The lexer program== The lexer should read the source code character by character, and send tokens to the parser. After each token, it should use the next character ``c``to decide what kind of token to read. - if ``c`` is a digit, collect an integer as long as you read digits - if ``c`` is a letter, collect an identifier as long as you read identifier characters (digit, letter, ``'``) - if ``c`` is a double quote, collect a string literal as long as you read characters other than a double quote - if ``c`` is space character (i.e. space, newline, or tab), ignore it and read next character **Longest match**: read a token //as long as// you get characters that can belong to it. #NEW ==Comments== Since comments (``/* ... */``) are not saved by the lexer, an easy way to remove them might be in a separate, earlier pass. However, if we have string literals, we have to spare comment signs in them: ``` "no comment /* here */ please!" ``` Thus also comments belong to the lexer: - if you read ``/`` and the next character is ``*``, then ignore characters until you encounter ``*`` and ``/`` The lexer here needs a 1-character **lookahead** - it must read one more than the next character When lexing, the lexer knows if it is reading a string, and does not start reading a comment in the middle of this. #NEW ==Reserved words and symbols== Reserved symbols and words: the terminals in the BNF grammar. There is a finite number of reserved symbols and words (why?) Reserved symbols: ``+ > ++ >> >=`` etc. Longest match: if ``++`` is found, ``+`` is not returned. This explains a detail of C++ template syntax: space is necessary in ``` vector > ``` Reserved words: usually similar to identifiers, longer than special symbols: ``if inline while`` etc #NEW ==Transition diagram for reading tokens and reserved words== Book 3.4.1, 3.4.3 A picture will be shown #NEW ==Implementing lexers (don't do like this!)== Transition diagrams can be hand-coded by using ``case`` expressions. Book 3.4.4 gives a Java example; here is a Haskell one ``` lexCmm :: String -> [Tok] lexCmm s = case s of c:cs | isSpace c -> lexCmm cs c:cs | isAlpha c -> getId s c:cs | isDigit c -> getInt s c:d:cs | isSymb [c,d] -> TS [c,d] : lexCmm cs c:cs | isSymb [c] -> TS [c] : lexCmm cs _ -> [] -- covers end of file and unknown characters where getId s = lx i : lexCmm cs where (i,cs) = span isIdChar s getInt s = TI (read i) : lexCmm cs where (i,cs) = span isDigit s isIdChar c = isAlpha c || isDigit c lx i = if isReservedWord i then TS i else TI i isSymb s = elem s $ words "++ -- == <= >= ++ { } = , ; + * - ( ) < >" isReservedWord w = elem w $ words "else if int main printInt return while" ``` But the code is messy, and hard to read and maintain. #NEW ==Regular expressions== A good way of specifying and documenting the lexer is transition diagrams. More concisely, we can use **regular expressions**: ``` Tokens = Space (Token Space)* Token = TInt | TId | TKey | TSpec TInt = Digit Digit* Digit = '0' | '1' | '2' |'3' | '4' |'5' | '6' |'7' | '8' | '9' TId = Letter IdChar* Letter = 'A' | ... | 'Z' | 'a' | ... | 'z' IdChar = Letter | Digit TKey = 'i' 'f' | 'e' 'l' 's' 'e' | ... TSpec = '+''+' | '+' | ... Space = (' ' | '\n' | '\t')* ``` What is more, these expressions can be compiled into a program that performs the lexing! The program implements a transition diagram - more precisely, a **finite state automaton**. #NEW ==The syntax and semantics of regular expressions== || name | notation | semantics | verbally || | symbol | 'a' | {a} | the symbol 'a' | sequence | A B | ``{xy | x : A, y : B}`` | A followed by B | union | ``A | B`` | A U B | A or B | closure | ``A*`` | ``{A^n | n = 0,1,2,..}`` | any number of A's | empty | eps | {[]} | the empty string This is BNFC notation. The semantics is a **regular language**, a set of strings of symbols. One often writes //L(E)// for the language denoted by the expression //E//. There are variants of the notation, especially for symbols and empties. Quoting symbols is not so common, since it lengthens the expressions - but it is very good for clarity! #NEW ==More notations for regular expressions== || name | notation | definition/semantics | verbally || | symbol range | ``['c'..'t']`` | 'c'|...|'t' | any symbol from 'c' to 't' | symbol | ``char`` | ``{x | x is a character}`` | any character | letter | ``letter`` | ``['A'..'Z'] | ['a'..'z']`` | any letter | digit | ``digit`` | ``['0'..'9']`` | any digit | option | ``A?`` | ``A | eps`` | optional A | n-sequence | ``A^n`` | ``{x1...xn | xi : A`` | n A's | nonempty closure | ``A+`` | ``A A*`` | any positive number of A's | difference | ``A - B`` | ``{x | x : A & not (x : B)}`` | the difference of A and B The symbol range and n-sequence are not supported by BNFC. "Any character" in BNFC is ASCII character 0..255. #NEW ==State-of-the-art lexer implementation== Define each token type by using regular expressions Compile the regular expressions into a code for a transition diagram (finite-state automaton). Compile the resulting code into a binary. ``` regexp in BNFC file | v bnfc program Flex/Jlex/Alex file | v flex/jlex/alex program C/C++/Java/Haskell file | v gcc/javac/ghc program binary file ``` #NEW ==Pros and cons of hand-written lexer code== **+** full control on the program (lookahead) **+** usually smaller program **+** can be quick to create (if the lexer is simple) **+** can be made more powerful than regular expressions **-** difficult to get right (if the lexer is complex) **-** much extra work to include position information etc. **-** may compromise performance (lookahead) **-** not self-documenting **-** not portable **-** lexer generators are state-of-the-art in compilers **Thus the usual trade-offs between hand-made and generated code hold**. In this course, we use generated lexers in the labs. #NEW ==A look at lexer generation tools== You can take a look at the web pages, manuals, and bnfc output for Alex: Haskell JLex: Java Flex: C, C++ The [BNF Converter web page http://www.cs.chalmers.se/~markus/BNFC/] has links to all these tools. #NEW ==Lexer definitions in BNFC== Reserved word and symbols: terminals in the rules. Built-in types ``` Integer Double String Char ``` Comments ``` comment "/*" "*/" ; comment "//" ; ``` Token ``` token Id (letter (letter | digit | '_')*) ``` Position token (for better error messages in later phases) ``` position token Id (letter (letter | digit | '_')*) ``` #NEW ==Built-in lexical types of BNFC== || type | definition || | ``Ident`` | ``letter (letter | digit | '_' | '\'')*`` | ``Integer`` | ``digit+`` | ``Double`` | ``digit+ '.' digit+ ('e' '-'? digit+)?`` | ``String`` | ``'"' ((char - ["\"\\"]) | ('\\' ["\"\\nt"]))* '"'`` | ``Char`` | ``'\'' ((char - ["'\\"]) | ('\\' ["'\\nt"])) '\''`` #NEW ==Parsing== From token list to syntax tree. Initially: to parse tree. In practice, at the same time: to abstract syntax tree. Sometimes even: to type-checked tree, to intermediate code... but we will postpone such **semantic actions** to **syntax-directed translation** (next lecture). #NEW ==How to implement parsing== Like lexing: use standard tools. - Happy for Haskell - CUP for Java - Bison for C and C++ - (YACC = Yet Another Compiler Compiler - the classic for C) Input to the tool: a BNF grammar (+ semantic actions). Output of the tool: parser code in target programming language. Like lexers, it //is// possible to write a parser by hand - but this is tedious and error-prone. It is also easy to end up with inefficiency and nontermination. #NEW ==Top-down (predictive, recursive descent) parsing== Structure: for each category, write a function that constructs a tree in a way depending on the first token. Example grammar: ``` SIf. Stm ::= "if" "(" Exp ")" Stm ; SWhile. Stm ::= "while" "(" Exp ")" Stm ; SExp. Stm ::= Exp ; EInt. Exp ::= Integer ; ``` Two functions: ``` Stm pStm(): next == "if" -> ... build tree with SIf next == "while" -> ... build tree with SWhile next is integer -> ... build tree with SExp Exp pExp(): next is integer k -> return EInt k ``` How do we fill the three dots? #NEW ==Recursive descent parsing, completed== Follow the structure of the grammar ``` SIf. Stm ::= "if" "(" Exp ")" Stm ; SWhile. Stm ::= "while" "(" Exp ")" Stm ; SExp. Stm ::= Exp ";" ; ``` We also need an auxiliary for just ignoring tokens, and a global variable ``next`` for the next token. ``` void ignore(Token t): check next==t shift next to the token after ``` There is a parsing function for each category, calling recursively other parsing functions and itself, as guided by the grammar. ``` Stm pStm(): next == "if" -> ignore("if") ignore("(") Exp e = pExp() ignore(")") Stm s = pStm() return SIf(e,s) //next == "while" similar next is integer -> Exp e = pExp() return SExp(e) ``` #NEW ==Implementing recursive descent parsing== The pseudocode on previous page is directly translatable to both imperative and functional code. In Java, see Book 2.4.2, 4.4.1 In Haskell, you can use monads or parser combinators. In either case, this is a quick way to implement small parsers. But it is quite limited, as we shall see. #NEW ==Conflicts== Example: ``` SIf. Stm ::= "if" "(" Exp ")" Stm SIfElse. Stm ::= "if" "(" Exp ")" Stm "else" Stm ``` which one should we choose when we see "if"? This situation is called a **conflict**. It can be solved by **left factoring** - sharing the common left part of the rules (Book 4.3.4): ``` SIE. Stm ::= "if" "(" Exp ")" Stm Rest RElse. Rest ::= "else" Stm REmp. Rest ::= ``` To get the original abstract syntax, we need a function that depends on ``Rest``. ``` f(SIE exp stm REmp) = SIf exp stm f(SIE exp stm (RElse stm2)) = SIfElse exp stm stm2 ``` #NEW ==Left recursion== What can we do in the grammar ``` Exp ::= Exp "+" Integer Exp ::= Integer ``` To build an ``Exp``, the parser first tries to build an ``Exp``, and so on. **Left recursion**: the value category is itself the leftmost item in a production. Solution: translate it away as follows (4.3.3) ``` Exp ::= Integer Rest Rest ::= "+" Integer Rest Rest ::= ``` The category ``Rest`` now has **right recursion**, which is harmless. Again, translation functions are needed to get the original abstract syntax. Notice: left recursion can be **implicit** (see Book, Algorithm 4.19) ``` Exp ::= Val "+" Integer Val ::= Exp ``` #NEW ==LL(1) tables== Used for guiding recursive descent parsers - rows: categories - columns: tokens - cells: rules Example ``` SIf. Stm ::= "if" "(" Exp ")" Stm ; SWhile. Stm ::= "while" "(" Exp ")" Stm ; SExp. Stm ::= Exp ";" ; EInt. Exp ::= Integer ; ``` || - | if | while | Int | ( | ) | ; | $ (END) || | Stm | SIf | SWhile | SExp | - | - | - | - | Exp | - | - | EInt | - | - | - | - LL(1) conflict: a cell contains more than one rule. #NEW ==Derivations== Leftmost derivation of ``while(1) if (0) 6 ;`` ``` Stm --> while ( Exp ) Stm --> while ( 1 ) Stm --> while ( 1 ) if ( Exp ) Stm --> while ( 1 ) if ( 0 ) Stm --> while ( 1 ) if ( 0 ) Exp ; --> while ( 1 ) if ( 0 ) 6 ; ``` Rightmost derivation of ``while(1) if (0) 6 ;`` ``` Stm --> while ( Exp ) Stm --> while ( Exp ) if ( Exp ) Stm --> while ( Exp ) if ( Exp ) Exp ; --> while ( Exp ) if ( Exp ) 6 ; --> while ( Exp ) if ( 0 ) 6 ; --> while ( 1 ) if ( 0 ) 6 ; ``` #NEW ==Top-down and bottom-up parsing== **Top-down**: try to "fill" the tree from root to the leaves. Example: LL(k) - Left-to-right, Leftmost derivation, k item lookahead. Task: to predict what production to use, after seeing k tokens **Bottom-up**: build larger and larger trees by combining leaves. Example: LR(k) - Left-to-right, Rightmost derivation, k item lookahead. Task: decide what to do after seeing some tokens before and k after. #NEW ==Actions and tables== Book 4.6.3 The parser reads its input, and builds a stack of results. After every symbol, it chooses among four actions: - **shift**: read one more symbol - **reduce**: pop elements from the stack and replace by a value - **accept**: return the single value on the stack when no input is left - **goto**: jump to another state and act accordingly - **reject**: report that there is input left but no move to take, or that the input is finished but the stack is not one with a single value. State: position in a rule, marked by a dot, e.g. ``` Stm ::= "if" "(" . Exp ")" Stm ``` Actions are collected to a table. This table is similar to the transition function of an NFA. It has a column for each possible input symbol. - rows: states - columns: tokens - cells: actions #NEW ==Example== Grammar ``` 1. Exp ::= Exp "+" Exp1 2. Exp ::= Exp1 3. Exp1 ::= Exp1 "*" Integer 4. Exp1 ::= Integer ``` Parsing run || stack | input | action | | | 1 + 2 * 3 | shift | 1 | + 2 * 3 | reduce 4 | Exp1 | + 2 * 3 | reduce 2 | Exp | + 2 * 3 | shift | Exp + | 2 * 3 | shift | Exp + 2 | * 3 | reduce 4 | Exp + Exp1 | * 3 | shift | Exp + Exp1 * | 3 | shift | Exp + Exp1 * 3 | 3 | reduce 3 | Exp + Exp1 | | reduce 1 | Exp | | accept #NEW ==LR(k), SLR, LALR(1)== LR(0) is the simplest variant, but not very powerful. LR(1) can produce very large tables. LR(k) for k > 1 is too large in practice. SLR (Simple LR): like LR(0), but reduce conditioned on follow. LALR(1): like LR(1), but merge states with identical items. Expressivity: - LR(0) < SLR < LALR(1) < LR(1) < LR(2) ... - LL(k) < LR(k) - none of these can be ambiguous Standard tools (YACC, Bison, CUP, Happy) use LALR(1). #NEW ==Constructing an LALR(1) table== Simple example (the previous grammar; additional rule 0 for integer literals) ``` + * $ int Integer Exp1 Exp 3 Integer -> L_integ . r0 r0 r0 - 4 Exp1 -> Integer . r4 r4 r4 - 5 Exp1 -> Exp1 . '*' Integer - s8 - - 6 %start_pExp -> Exp . s9 - a - Exp -> Exp . '+' Exp1 7 Exp -> Exp1 . r2 s8 r2 - Exp1 -> Exp1 . '*' Integer 8 Exp1 -> Exp1 '*' . Integer - - - s3 g11 9 Exp -> Exp '+' . Exp1 - - - s3 g4 g10 10 Exp -> Exp '+' Exp1 . r1 s8 r1 Exp1 -> Exp1 . '*' Integer 11 Exp1 -> Exp1 '*' Integer . r3 r3 r3 ``` #NEW ==Conflicts== The parser has several possible actions to take. Shift-reduce conflict: between shift and reduce action. Reduce-reduce conflict: between two (or more) reduce actions. The latter are more harmful, but also more easy to eliminate. #NEW ==Reduce-reduce conflicts== Plain ambiguities: ``` EVar. Exp ::= Ident ; ECons. Exp ::= Ident ; ``` Solution: wait until type checher to distinguish constants from variables. Implicit ambiguities (Lab1 !): ``` SExp. Stm ::= Exp ; SDecl. Stm ::= Decl ; DTyp. Decl ::= Typ ; EId. Exp ::= Ident ; TId. Typ ::= Ident ; ``` Solution: actually, ``DTyp`` is only valid in function parameter lists - not as statements! #NEW ==Shift-reduce conflicts== Classic example: dangling else ``` SIf. Stm ::= "if" "(" Exp ")" Stm SIfElse. Stm ::= "if" "(" Exp ")" Stm "else" Stm ``` What happens with this input at position (.)? ``` if (x > 0) if (y < 8) return y ; . else return x ; ``` Two possible outcomes: ``` shift: if (x > 0) { if (y < 8) return y ; else return x ;} reduce: if (x > 0) { if (y < 8) return y ;} else return x ; ``` Standard tools always choose shift rather than reduce. This could be avoided by rewriting the grammar (Book 4.3.2). But usually the conflict is tolerated as a "well-understood" one. #NEW ==Parser tools== File formats with special syntax - Happy for Haskell as host language - Bison for C/C++ - CUP for Java What the files look like: - BNF rules with **semantic actions** in host language (next lecture) ``` Exp :: { Exp } Exp : Exp '+' Exp1 { EAdd $1 $3 } | Exp1 { $1 } ``` - interface to lexer via special rules, e.g. ``` L_integ { PT _ (TI $$) } Integer :: { Integer } : L_integ { (read $1) :: Integer } ``` #NEW ==Info files== Shows the LALR(1) table in a human-readable form. Create an info file: ``` happy -i ParCPP.y ``` Check which rules are overshadowed in conflicts: ``` grep "(reduce" ParConf.info ``` Interestingly, conflicts tend concentrate on a few rules. If you have very many, do ``` grep "(reduce" ParConf.info | sort | uniq ``` The conflicts are (usually) the same in all tools. Since the info file contains no Haskell, you can use Happy's info file if even if you principally work with another tool. #NEW ==Conflicts in C++ as shown in an info file== ``` State 34 Const -> Ident . (rule 123) Const -> Ident . '<' ListType '>' (rule 124) '>>' reduce using rule 123 '<' shift, and enter state 176 (reduce using rule 123) State 176 Const -> Ident '<' . ListType '>' (rule 124) State 243 Stm -> 'if' '(' Exp ')' Stm . (rule 49) Stm -> 'if' '(' Exp ')' Stm . 'else' Stm (rule 50) 'do' reduce using rule 49 'else' shift, and enter state 249 (reduce using rule 49) State 249 Stm -> 'if' '(' Exp ')' Stm 'else' . Stm (rule 50) ``` #NEW ==Debug tools== Create a debugging Happy parser: ``` happy -da ParCPP.y ``` With Bison, you can use ``gdb`` (GNU Debugger), which traces back the execution to lines in the Bison source file.