Lecture 4: Implementing Lexers and Parsers

Programming Languages Course
Aarne Ranta (aarne@chalmers.se)

Book: 3.1, 3.3, 3.4, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9

Overview of this lecture

Implementing lexers by hand

Regular expressions

Lexer tools: Alex/JLex/Flex

Implementing top-down (LL) parsers by hand

LL parsing tables

Bottom-up (LR) parsing methods

LALR parsing tools

Debugging and conflict solving

Before parsing

Parsing is done in accordance with BNF rules such as

    Stm ::= "if" "(" Exp ")" Stm "else" Stm ;
    Exp ::= Exp "+" Exp ;
    Exp ::= Ident ;
    Exp ::= Integer ;

The terminals (in quotes) are tokens, and in the end all rules generate and recognize lists of tokens.

Also Ident and Integer are a special kinds of tokens.

The task of lexical analysis is to split the input string into tokens.

Spaces

Tokens are sometimes separated by spaces (/newlines/tabs), but not necessarily:

    if (i>0) {i=i+27;} else {i++;}
  
    if, (, i, >, 0, ), {, i, =, i, +, 27, ;, }, else, {, i, ++, ;, }

Good design: it should always be legal to have spaces, newlines, or tabs between tokens - even if not necessary.

Quiz: which programming languages violate this rule?

Lexical analysis in parser (don't do like this!)

It is possible for the parser to read input character by character; in this case, the grammar should tell where spaces are possible/necessary.

    Exp       ::= Exp Space "+" Space Exp ;
    Space     ::= [SpaceChar] ;
    SpaceChar ::= " " | "\n" | "\t" ;

But this really clutters the grammar!

Programming languages are usually designed in such a way that lexical analysis can be done before parsing, and parsing gets tokens as its input.

Classes of tokens

The lexer splits the code into tokens.

The lexer also classifies each token into classes. For example, the string

    i2=i1+271

results in the following list of classified tokens:

    identifier "i1"
    symbol "="
    identifier "i2"
    symbol "+"
    integer "271"

The lexer program

The lexer should read the source code character by character, and send tokens to the parser.

After each token, it should use the next character cto decide what kind of token to read.

if c is a digit, collect an integer as long as you read digits
if c is a letter, collect an identifier as long as you read identifier characters (digit, letter, ')
if c is a double quote, collect a string literal as long as you read characters other than a double quote
if c is space character (i.e. space, newline, or tab), ignore it and read next character

Longest match: read a token as long as you get characters that can belong to it.

Comments

Since comments (/* ... */) are not saved by the lexer, an easy way to remove them might be in a separate, earlier pass.

However, if we have string literals, we have to spare comment signs in them:

    "no comment /* here */ please!"

Thus also comments belong to the lexer:

if you read / and the next character is *, then ignore characters until you encounter * and /

The lexer here needs a 1-character lookahead - it must read one more than the next character

When lexing, the lexer knows if it is reading a string, and does not start reading a comment in the middle of this.

Reserved words and symbols

Reserved symbols and words: the terminals in the BNF grammar.

There is a finite number of reserved symbols and words (why?)

Reserved symbols: + > ++ >> >= etc.

Longest match: if ++ is found, + is not returned.

This explains a detail of C++ template syntax: space is necessary in

    vector<vector<int> >

Reserved words: usually similar to identifiers, longer than special symbols: if inline while etc

Transition diagram for reading tokens and reserved words

Book 3.4.1, 3.4.3

A picture will be shown

Implementing lexers (don't do like this!)

Transition diagrams can be hand-coded by using case expressions.

Book 3.4.4 gives a Java example; here is a Haskell one

  lexCmm :: String -> [Tok]
  lexCmm s = case s of
    c:cs   | isSpace c     -> lexCmm cs
    c:cs   | isAlpha c     -> getId s
    c:cs   | isDigit c     -> getInt s
    c:d:cs | isSymb [c,d]  -> TS [c,d] : lexCmm cs
    c:cs   | isSymb [c]    -> TS [c]   : lexCmm cs
    _                      -> []  -- covers end of file and unknown characters
   where
    getId s  = lx i : lexCmm cs where (i,cs) = span isIdChar s
    getInt s = TI (read i) : lexCmm cs where (i,cs) = span isDigit s
    isIdChar c = isAlpha c || isDigit c
    lx i = if isReservedWord i then TS i else TI i
  
  isSymb s = elem s $ words "++ -- == <= >= ++ { } = , ; + * - ( ) < >"
  
  isReservedWord w = elem w $ words "else if int main printInt return while"

But the code is messy, and hard to read and maintain.

Regular expressions

A good way of specifying and documenting the lexer is transition diagrams.

More concisely, we can use regular expressions:

    Tokens = Space (Token Space)*
    Token  = TInt | TId | TKey | TSpec 
    TInt   = Digit Digit*
    Digit  = '0' | '1' | '2' |'3' | '4' |'5' | '6' |'7' | '8' | '9'
    TId    = Letter IdChar*
    Letter = 'A' | ... | 'Z' | 'a' | ... | 'z'
    IdChar = Letter | Digit
    TKey   = 'i' 'f' | 'e' 'l' 's' 'e' | ...
    TSpec  = '+''+' | '+' | ...
    Space  = (' ' | '\n' | '\t')*

What is more, these expressions can be compiled into a program that performs the lexing! The program implements a transition diagram - more precisely, a finite state automaton.

The syntax and semantics of regular expressions

name	notation	semantics	verbally
symbol	'a'	{a}	the symbol 'a'
sequence	A B	`{xy \| x : A, y : B}`	A followed by B
union	`A \| B`	A U B	A or B
closure	`A*`	`{A^n \| n = 0,1,2,..}`	any number of A's
empty	eps	{[]}	the empty string

This is BNFC notation.

The semantics is a regular language, a set of strings of symbols. One often writes L(E) for the language denoted by the expression E.

There are variants of the notation, especially for symbols and empties. Quoting symbols is not so common, since it lengthens the expressions - but it is very good for clarity!

More notations for regular expressions

name	notation	definition/semantics	verbally
symbol range	`['c'..'t']`	'c'\|...\|'t'	any symbol from 'c' to 't'
symbol	`char`	`{x \| x is a character}`	any character
letter	`letter`	`['A'..'Z'] \| ['a'..'z']`	any letter
digit	`digit`	`['0'..'9']`	any digit
option	`A?`	`A \| eps`	optional A
n-sequence	`A^n`	`{x1...xn \| xi : A`	n A's
nonempty closure	`A+`	`A A*`	any positive number of A's
difference	`A - B`	`{x \| x : A & not (x : B)}`	the difference of A and B

The symbol range and n-sequence are not supported by BNFC.

"Any character" in BNFC is ASCII character 0..255.

State-of-the-art lexer implementation

Define each token type by using regular expressions

Compile the regular expressions into a code for a transition diagram (finite-state automaton).

Compile the resulting code into a binary.

    regexp in BNFC file
  
             |
             v            bnfc program
  
    Flex/Jlex/Alex file
  
             |
             v            flex/jlex/alex program
  
    C/C++/Java/Haskell file
  
             |            
             v            gcc/javac/ghc program
  
    binary file

Pros and cons of hand-written lexer code

+ full control on the program (lookahead)

+ usually smaller program

+ can be quick to create (if the lexer is simple)

+ can be made more powerful than regular expressions

- difficult to get right (if the lexer is complex)

- much extra work to include position information etc.

- may compromise performance (lookahead)

- not self-documenting

- not portable

- lexer generators are state-of-the-art in compilers

Thus the usual trade-offs between hand-made and generated code hold.

In this course, we use generated lexers in the labs.

A look at lexer generation tools

You can take a look at the web pages, manuals, and bnfc output for

Alex: Haskell

JLex: Java

Flex: C, C++

The BNF Converter web page has links to all these tools.

Lexer definitions in BNFC

Reserved word and symbols: terminals in the rules.

Built-in types

    Integer Double String Char

Comments

    comment "/*" "*/" ;
    comment "//" ;

Token

    token Id (letter (letter | digit | '_')*)

Position token (for better error messages in later phases)

    position token Id (letter (letter | digit | '_')*)

Built-in lexical types of BNFC

type	definition
`Ident`	`letter (letter \| digit \| '_' \| '\'')*`
`Integer`	`digit+`
`Double`	`digit+ '.' digit+ ('e' '-'? digit+)?`
`String`	`'"' ((char - ["\"\\"]) \| ('\\' ["\"\\nt"]))* '"'`
`Char`	`'\'' ((char - ["'\\"]) \| ('\\' ["'\\nt"])) '\''`

Parsing

From token list to syntax tree.

Initially: to parse tree.

In practice, at the same time: to abstract syntax tree.

Sometimes even: to type-checked tree, to intermediate code... but we will postpone such semantic actions to syntax-directed translation (next lecture).

How to implement parsing

Like lexing: use standard tools.

Happy for Haskell
CUP for Java
Bison for C and C++
(YACC = Yet Another Compiler Compiler - the classic for C)

Input to the tool: a BNF grammar (+ semantic actions).

Output of the tool: parser code in target programming language.

Like lexers, it is possible to write a parser by hand - but this is tedious and error-prone. It is also easy to end up with inefficiency and nontermination.

Top-down (predictive, recursive descent) parsing

Structure: for each category, write a function that constructs a tree in a way depending on the first token.

Example grammar:

    SIf.    Stm ::= "if" "(" Exp ")" Stm ;
    SWhile. Stm ::= "while" "(" Exp ")" Stm ;
    SExp.   Stm ::= Exp ;
    EInt.   Exp ::= Integer ;

Two functions:

    Stm pStm():
      next == "if"    -> ...  build tree with SIf
      next == "while" -> ...  build tree with SWhile
      next is integer -> ...  build tree with SExp
  
    Exp pExp():
      next is integer k -> return EInt k

How do we fill the three dots?

Recursive descent parsing, completed

Follow the structure of the grammar

    SIf.    Stm ::= "if" "(" Exp ")" Stm ;
    SWhile. Stm ::= "while" "(" Exp ")" Stm ;
    SExp.   Stm ::= Exp ";" ;

We also need an auxiliary for just ignoring tokens, and a global variable next for the next token.

    void ignore(Token t):
      check next==t
      shift next to the token after

There is a parsing function for each category, calling recursively other parsing functions and itself, as guided by the grammar.

    Stm pStm():
      next == "if"    -> 
        ignore("if")
        ignore("(")
        Exp e = pExp()
        ignore(")")
        Stm s = pStm()
        return SIf(e,s)
  
    //next == "while" similar
  
      next is integer -> 
        Exp e = pExp()
        return SExp(e)

Implementing recursive descent parsing

The pseudocode on previous page is directly translatable to both imperative and functional code.

In Java, see Book 2.4.2, 4.4.1

In Haskell, you can use monads or parser combinators.

In either case, this is a quick way to implement small parsers.

But it is quite limited, as we shall see.

Conflicts

Example:

    SIf.     Stm ::= "if" "(" Exp ")" Stm
    SIfElse. Stm ::= "if" "(" Exp ")" Stm "else" Stm

which one should we choose when we see "if"?

This situation is called a conflict.

It can be solved by left factoring - sharing the common left part of the rules (Book 4.3.4):

    SIE.   Stm  ::= "if" "(" Exp ")" Stm Rest
    RElse. Rest ::= "else" Stm
    REmp.  Rest ::=

To get the original abstract syntax, we need a function that depends on Rest.

    f(SIE exp stm REmp)         = SIf     exp stm
    f(SIE exp stm (RElse stm2)) = SIfElse exp stm stm2

Left recursion

What can we do in the grammar

    Exp ::= Exp "+" Integer
    Exp ::= Integer

To build an Exp, the parser first tries to build an Exp, and so on.

Left recursion: the value category is itself the leftmost item in a production.

Solution: translate it away as follows (4.3.3)

    Exp  ::= Integer Rest
    Rest ::= "+" Integer Rest
    Rest ::=

The category Rest now has right recursion, which is harmless.

Again, translation functions are needed to get the original abstract syntax.

Notice: left recursion can be implicit (see Book, Algorithm 4.19)

    Exp ::= Val "+" Integer
    Val ::= Exp

LL(1) tables

Used for guiding recursive descent parsers

rows: categories
columns: tokens
cells: rules

Example

    SIf.    Stm ::= "if" "(" Exp ")" Stm ;
    SWhile. Stm ::= "while" "(" Exp ")" Stm ;
    SExp.   Stm ::= Exp ";" ;
    EInt.   Exp ::= Integer ;

-	if	while	Int	(	)	;	$ (END)
Stm	SIf	SWhile	SExp	-	-	-	-
Exp	-	-	EInt	-	-	-	-

LL(1) conflict: a cell contains more than one rule.

Derivations

Leftmost derivation of while(1) if (0) 6 ;

    Stm --> while ( Exp ) Stm
        --> while (   1 ) Stm
        --> while (   1 ) if ( Exp ) Stm
        --> while (   1 ) if (   0 ) Stm
        --> while (   1 ) if (   0 ) Exp ;
        --> while (   1 ) if (   0 )   6 ;

Rightmost derivation of while(1) if (0) 6 ;

    Stm --> while ( Exp ) Stm
        --> while ( Exp ) if ( Exp ) Stm
        --> while ( Exp ) if ( Exp ) Exp ;
        --> while ( Exp ) if ( Exp )   6 ;
        --> while ( Exp ) if (   0 )   6 ;
        --> while (   1 ) if (   0 )   6 ;

Top-down and bottom-up parsing

Top-down: try to "fill" the tree from root to the leaves.

Example: LL(k) - Left-to-right, Leftmost derivation, k item lookahead.

Task: to predict what production to use, after seeing k tokens

Bottom-up: build larger and larger trees by combining leaves.

Example: LR(k) - Left-to-right, Rightmost derivation, k item lookahead.

Task: decide what to do after seeing some tokens before and k after.

Actions and tables

Book 4.6.3

The parser reads its input, and builds a stack of results. After every symbol, it chooses among four actions:

shift: read one more symbol
reduce: pop elements from the stack and replace by a value
accept: return the single value on the stack when no input is left
goto: jump to another state and act accordingly
reject: report that there is input left but no move to take, or that the input is finished but the stack is not one with a single value.

State: position in a rule, marked by a dot, e.g.

    Stm ::= "if" "(" . Exp ")" Stm

Actions are collected to a table. This table is similar to the transition function of an NFA. It has a column for each possible input symbol.

rows: states
columns: tokens
cells: actions

Example

Grammar

    1. Exp  ::= Exp "+" Exp1
    2. Exp  ::= Exp1
    3. Exp1 ::= Exp1 "*" Integer
    4. Exp1 ::= Integer

Parsing run

stack	input	action
	1 + 2 * 3	shift
1	+ 2 * 3	reduce 4
Exp1	+ 2 * 3	reduce 2
Exp	+ 2 * 3	shift
Exp +	2 * 3	shift
Exp + 2	* 3	reduce 4
Exp + Exp1	* 3	shift
Exp + Exp1 *	3	shift
Exp + Exp1 * 3	3	reduce 3
Exp + Exp1		reduce 1
Exp		accept

LR(k), SLR, LALR(1)

LR(0) is the simplest variant, but not very powerful.

LR(1) can produce very large tables.

LR(k) for k > 1 is too large in practice.

SLR (Simple LR): like LR(0), but reduce conditioned on follow.

LALR(1): like LR(1), but merge states with identical items.

Expressivity:

LR(0) < SLR < LALR(1) < LR(1) < LR(2) ...
LL(k) < LR(k)
none of these can be ambiguous

Standard tools (YACC, Bison, CUP, Happy) use LALR(1).

Constructing an LALR(1) table

Simple example (the previous grammar; additional rule 0 for integer literals)

                                 +     *     $    int   Integer  Exp1  Exp
   3 Integer -> L_integ .        r0    r0    r0   -
   4 Exp1 -> Integer .           r4    r4    r4   -
   5 Exp1 -> Exp1 . '*' Integer  -     s8    -    -
   6 %start_pExp -> Exp .        s9    -     a    - 
     Exp -> Exp . '+' Exp1
   7 Exp -> Exp1 .               r2    s8    r2   -                 
     Exp1 -> Exp1 . '*' Integer 
   8 Exp1 -> Exp1 '*' . Integer  -     -     -    s3    g11
   9 Exp -> Exp '+' . Exp1       -     -     -    s3    g4       g10
  10 Exp -> Exp '+' Exp1 .       r1    s8    r1                       
     Exp1 -> Exp1 . '*' Integer                  
  11 Exp1 -> Exp1 '*' Integer .  r3    r3    r3

Conflicts

The parser has several possible actions to take.

Shift-reduce conflict: between shift and reduce action.

Reduce-reduce conflict: between two (or more) reduce actions.

The latter are more harmful, but also more easy to eliminate.

Reduce-reduce conflicts

Plain ambiguities:

    EVar.   Exp ::= Ident ;
    ECons.  Exp ::= Ident ;

Solution: wait until type checher to distinguish constants from variables.

Implicit ambiguities (Lab1 !):

    SExp.   Stm  ::= Exp ;
    SDecl.  Stm  ::= Decl ;
    DTyp.   Decl ::= Typ ;
    EId.    Exp  ::= Ident ;
    TId.    Typ  ::= Ident ;

Solution: actually, DTyp is only valid in function parameter lists - not as statements!

Shift-reduce conflicts

Classic example: dangling else

    SIf.     Stm ::= "if" "(" Exp ")" Stm
    SIfElse. Stm ::= "if" "(" Exp ")" Stm "else" Stm

What happens with this input at position (.)?

    if (x > 0) if (y < 8) return y ; . else return x ;

Two possible outcomes:

    shift:   if (x > 0) { if (y < 8) return y ;  else return x ;}
    reduce:  if (x > 0) { if (y < 8) return y ;} else return x ;

Standard tools always choose shift rather than reduce.

This could be avoided by rewriting the grammar (Book 4.3.2). But usually the conflict is tolerated as a "well-understood" one.

Parser tools

File formats with special syntax

Happy for Haskell as host language
Bison for C/C++
CUP for Java

What the files look like:

BNF rules with semantic actions in host language (next lecture)

    Exp :: { Exp }
    Exp : 
       Exp '+' Exp1 { EAdd $1 $3 } 
     | Exp1 { $1 }

interface to lexer via special rules, e.g.

    L_integ  { PT _ (TI $$) }
    Integer :: { Integer } : 
      L_integ  { (read $1) :: Integer }

Info files

Shows the LALR(1) table in a human-readable form.

Create an info file:

    happy -i ParCPP.y

Check which rules are overshadowed in conflicts:

    grep "(reduce" ParConf.info

Interestingly, conflicts tend concentrate on a few rules. If you have very many, do

    grep "(reduce" ParConf.info | sort | uniq

The conflicts are (usually) the same in all tools.

Since the info file contains no Haskell, you can use Happy's info file if even if you principally work with another tool.

Conflicts in C++ as shown in an info file

  State 34
  	Const -> Ident .                                    (rule 123)
  	Const -> Ident . '<' ListType '>'                   (rule 124)
  
  	'>>'           reduce using rule 123
  	'<'            shift, and enter state 176
  			(reduce using rule 123)
  
  State 176
  	Const -> Ident '<' . ListType '>'                   (rule 124)
  
  State 243
  	Stm -> 'if' '(' Exp ')' Stm .                       (rule 49)
  	Stm -> 'if' '(' Exp ')' Stm . 'else' Stm            (rule 50)
  	'do'           reduce using rule 49
  	'else'         shift, and enter state 249
  			(reduce using rule 49)
  
  State 249
  	Stm -> 'if' '(' Exp ')' Stm 'else' . Stm            (rule 50)

Debug tools

Create a debugging Happy parser:

    happy -da ParCPP.y

With Bison, you can use gdb (GNU Debugger), which traces back the execution to lines in the Bison source file.