# Lecture 4: Implementing Lexers and Parsers

Programming Languages Course
Aarne Ranta (aarne@chalmers.se)

Book: 3.1, 3.3, 3.4, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9

## Overview of this lecture

Implementing lexers by hand

Regular expressions

Lexer tools: Alex/JLex/Flex

Implementing top-down (LL) parsers by hand

LL parsing tables

Bottom-up (LR) parsing methods

LALR parsing tools

Debugging and conflict solving

## Before parsing

Parsing is done in accordance with BNF rules such as

```    Stm ::= "if" "(" Exp ")" Stm "else" Stm ;
Exp ::= Exp "+" Exp ;
Exp ::= Ident ;
Exp ::= Integer ;
```

The terminals (in quotes) are tokens, and in the end all rules generate and recognize lists of tokens.

Also `Ident` and `Integer` are a special kinds of tokens.

The task of lexical analysis is to split the input string into tokens.

## Spaces

Tokens are sometimes separated by spaces (/newlines/tabs), but not necessarily:

```    if (i>0) {i=i+27;} else {i++;}

if, (, i, >, 0, ), {, i, =, i, +, 27, ;, }, else, {, i, ++, ;, }
```

Good design: it should always be legal to have spaces, newlines, or tabs between tokens - even if not necessary.

Quiz: which programming languages violate this rule?

## Lexical analysis in parser (don't do like this!)

It is possible for the parser to read input character by character; in this case, the grammar should tell where spaces are possible/necessary.

```    Exp       ::= Exp Space "+" Space Exp ;
Space     ::= [SpaceChar] ;
SpaceChar ::= " " | "\n" | "\t" ;
```

But this really clutters the grammar!

Programming languages are usually designed in such a way that lexical analysis can be done before parsing, and parsing gets tokens as its input.

## Classes of tokens

The lexer splits the code into tokens.

The lexer also classifies each token into classes. For example, the string

```    i2=i1+271
```

results in the following list of classified tokens:

```    identifier "i1"
symbol "="
identifier "i2"
symbol "+"
integer "271"
```

## The lexer program

The lexer should read the source code character by character, and send tokens to the parser.

After each token, it should use the next character `c`to decide what kind of token to read.

• if `c` is a digit, collect an integer as long as you read digits
• if `c` is a letter, collect an identifier as long as you read identifier characters (digit, letter, `'`)
• if `c` is a double quote, collect a string literal as long as you read characters other than a double quote
• if `c` is space character (i.e. space, newline, or tab), ignore it and read next character

Longest match: read a token as long as you get characters that can belong to it.

Since comments (`/* ... */`) are not saved by the lexer, an easy way to remove them might be in a separate, earlier pass.

However, if we have string literals, we have to spare comment signs in them:

```    "no comment /* here */ please!"
```

Thus also comments belong to the lexer:

• if you read `/` and the next character is `*`, then ignore characters until you encounter `*` and `/`

The lexer here needs a 1-character lookahead - it must read one more than the next character

When lexing, the lexer knows if it is reading a string, and does not start reading a comment in the middle of this.

## Reserved words and symbols

Reserved symbols and words: the terminals in the BNF grammar.

There is a finite number of reserved symbols and words (why?)

Reserved symbols: `+ > ++ >> >=` etc.

Longest match: if `++` is found, `+` is not returned.

This explains a detail of C++ template syntax: space is necessary in

```    vector<vector<int> >
```

Reserved words: usually similar to identifiers, longer than special symbols: `if inline while` etc

## Transition diagram for reading tokens and reserved words

Book 3.4.1, 3.4.3

A picture will be shown

## Implementing lexers (don't do like this!)

Transition diagrams can be hand-coded by using `case` expressions.

Book 3.4.4 gives a Java example; here is a Haskell one

```  lexCmm :: String -> [Tok]
lexCmm s = case s of
c:cs   | isSpace c     -> lexCmm cs
c:cs   | isAlpha c     -> getId s
c:cs   | isDigit c     -> getInt s
c:d:cs | isSymb [c,d]  -> TS [c,d] : lexCmm cs
c:cs   | isSymb [c]    -> TS [c]   : lexCmm cs
_                      -> []  -- covers end of file and unknown characters
where
getId s  = lx i : lexCmm cs where (i,cs) = span isIdChar s
getInt s = TI (read i) : lexCmm cs where (i,cs) = span isDigit s
isIdChar c = isAlpha c || isDigit c
lx i = if isReservedWord i then TS i else TI i

isSymb s = elem s \$ words "++ -- == <= >= ++ { } = , ; + * - ( ) < >"

isReservedWord w = elem w \$ words "else if int main printInt return while"
```

But the code is messy, and hard to read and maintain.

## Regular expressions

A good way of specifying and documenting the lexer is transition diagrams.

More concisely, we can use regular expressions:

```    Tokens = Space (Token Space)*
Token  = TInt | TId | TKey | TSpec
TInt   = Digit Digit*
Digit  = '0' | '1' | '2' |'3' | '4' |'5' | '6' |'7' | '8' | '9'
TId    = Letter IdChar*
Letter = 'A' | ... | 'Z' | 'a' | ... | 'z'
IdChar = Letter | Digit
TKey   = 'i' 'f' | 'e' 'l' 's' 'e' | ...
TSpec  = '+''+' | '+' | ...
Space  = (' ' | '\n' | '\t')*
```

What is more, these expressions can be compiled into a program that performs the lexing! The program implements a transition diagram - more precisely, a finite state automaton.

## The syntax and semantics of regular expressions

name notation semantics verbally
symbol 'a' {a} the symbol 'a'
sequence A B `{xy | x : A, y : B}` A followed by B
union `A | B` A U B A or B
closure `A*` `{A^n | n = 0,1,2,..}` any number of A's
empty eps {[]} the empty string

This is BNFC notation.

The semantics is a regular language, a set of strings of symbols. One often writes L(E) for the language denoted by the expression E.

There are variants of the notation, especially for symbols and empties. Quoting symbols is not so common, since it lengthens the expressions - but it is very good for clarity!

## More notations for regular expressions

name notation definition/semantics verbally
symbol range `['c'..'t']` 'c'|...|'t' any symbol from 'c' to 't'
symbol `char` `{x | x is a character}` any character
letter `letter` `['A'..'Z'] | ['a'..'z']` any letter
digit `digit` `['0'..'9']` any digit
option `A?` `A | eps` optional A
n-sequence `A^n` `{x1...xn | xi : A` n A's
nonempty closure `A+` `A A*` any positive number of A's
difference `A - B` `{x | x : A & not (x : B)}` the difference of A and B

The symbol range and n-sequence are not supported by BNFC.

"Any character" in BNFC is ASCII character 0..255.

## State-of-the-art lexer implementation

Define each token type by using regular expressions

Compile the regular expressions into a code for a transition diagram (finite-state automaton).

Compile the resulting code into a binary.

```    regexp in BNFC file

|
v            bnfc program

Flex/Jlex/Alex file

|
v            flex/jlex/alex program

|
v            gcc/javac/ghc program

binary file
```

## Pros and cons of hand-written lexer code

+ full control on the program (lookahead)

+ usually smaller program

+ can be quick to create (if the lexer is simple)

+ can be made more powerful than regular expressions

- difficult to get right (if the lexer is complex)

- much extra work to include position information etc.

- not self-documenting

- not portable

- lexer generators are state-of-the-art in compilers

In this course, we use generated lexers in the labs.

## A look at lexer generation tools

You can take a look at the web pages, manuals, and bnfc output for

JLex: Java

Flex: C, C++

The BNF Converter web page has links to all these tools.

## Lexer definitions in BNFC

Reserved word and symbols: terminals in the rules.

Built-in types

```    Integer Double String Char
```

```    comment "/*" "*/" ;
comment "//" ;
```

Token

```    token Id (letter (letter | digit | '_')*)
```

Position token (for better error messages in later phases)

```    position token Id (letter (letter | digit | '_')*)
```

## Built-in lexical types of BNFC

type definition
`Ident` `letter (letter | digit | '_' | '\'')*`
`Integer` `digit+`
`Double` `digit+ '.' digit+ ('e' '-'? digit+)?`
`String` `'"' ((char - ["\"\\"]) | ('\\' ["\"\\nt"]))* '"'`
`Char` `'\'' ((char - ["'\\"]) | ('\\' ["'\\nt"])) '\''`

## Parsing

From token list to syntax tree.

Initially: to parse tree.

In practice, at the same time: to abstract syntax tree.

Sometimes even: to type-checked tree, to intermediate code... but we will postpone such semantic actions to syntax-directed translation (next lecture).

## How to implement parsing

Like lexing: use standard tools.

• CUP for Java
• Bison for C and C++
• (YACC = Yet Another Compiler Compiler - the classic for C)

Input to the tool: a BNF grammar (+ semantic actions).

Output of the tool: parser code in target programming language.

Like lexers, it is possible to write a parser by hand - but this is tedious and error-prone. It is also easy to end up with inefficiency and nontermination.

## Top-down (predictive, recursive descent) parsing

Structure: for each category, write a function that constructs a tree in a way depending on the first token.

Example grammar:

```    SIf.    Stm ::= "if" "(" Exp ")" Stm ;
SWhile. Stm ::= "while" "(" Exp ")" Stm ;
SExp.   Stm ::= Exp ;
EInt.   Exp ::= Integer ;
```

Two functions:

```    Stm pStm():
next == "if"    -> ...  build tree with SIf
next == "while" -> ...  build tree with SWhile
next is integer -> ...  build tree with SExp

Exp pExp():
next is integer k -> return EInt k
```

How do we fill the three dots?

## Recursive descent parsing, completed

Follow the structure of the grammar

```    SIf.    Stm ::= "if" "(" Exp ")" Stm ;
SWhile. Stm ::= "while" "(" Exp ")" Stm ;
SExp.   Stm ::= Exp ";" ;
```

We also need an auxiliary for just ignoring tokens, and a global variable `next` for the next token.

```    void ignore(Token t):
check next==t
shift next to the token after
```

There is a parsing function for each category, calling recursively other parsing functions and itself, as guided by the grammar.

```    Stm pStm():
next == "if"    ->
ignore("if")
ignore("(")
Exp e = pExp()
ignore(")")
Stm s = pStm()
return SIf(e,s)

//next == "while" similar

next is integer ->
Exp e = pExp()
return SExp(e)
```

## Implementing recursive descent parsing

The pseudocode on previous page is directly translatable to both imperative and functional code.

In Java, see Book 2.4.2, 4.4.1

In either case, this is a quick way to implement small parsers.

But it is quite limited, as we shall see.

## Conflicts

Example:

```    SIf.     Stm ::= "if" "(" Exp ")" Stm
SIfElse. Stm ::= "if" "(" Exp ")" Stm "else" Stm
```

which one should we choose when we see "if"?

This situation is called a conflict.

It can be solved by left factoring - sharing the common left part of the rules (Book 4.3.4):

```    SIE.   Stm  ::= "if" "(" Exp ")" Stm Rest
RElse. Rest ::= "else" Stm
REmp.  Rest ::=
```

To get the original abstract syntax, we need a function that depends on `Rest`.

```    f(SIE exp stm REmp)         = SIf     exp stm
f(SIE exp stm (RElse stm2)) = SIfElse exp stm stm2
```

## Left recursion

What can we do in the grammar

```    Exp ::= Exp "+" Integer
Exp ::= Integer
```

To build an `Exp`, the parser first tries to build an `Exp`, and so on.

Left recursion: the value category is itself the leftmost item in a production.

Solution: translate it away as follows (4.3.3)

```    Exp  ::= Integer Rest
Rest ::= "+" Integer Rest
Rest ::=
```

The category `Rest` now has right recursion, which is harmless.

Again, translation functions are needed to get the original abstract syntax.

Notice: left recursion can be implicit (see Book, Algorithm 4.19)

```    Exp ::= Val "+" Integer
Val ::= Exp
```

## LL(1) tables

Used for guiding recursive descent parsers

• rows: categories
• columns: tokens
• cells: rules

Example

```    SIf.    Stm ::= "if" "(" Exp ")" Stm ;
SWhile. Stm ::= "while" "(" Exp ")" Stm ;
SExp.   Stm ::= Exp ";" ;
EInt.   Exp ::= Integer ;
```

- if while Int ( ) ; \$ (END)
Stm SIf SWhile SExp - - - -
Exp - - EInt - - - -

LL(1) conflict: a cell contains more than one rule.

## Derivations

Leftmost derivation of `while(1) if (0) 6 ;`

```    Stm --> while ( Exp ) Stm
--> while (   1 ) Stm
--> while (   1 ) if ( Exp ) Stm
--> while (   1 ) if (   0 ) Stm
--> while (   1 ) if (   0 ) Exp ;
--> while (   1 ) if (   0 )   6 ;
```

Rightmost derivation of `while(1) if (0) 6 ;`

```    Stm --> while ( Exp ) Stm
--> while ( Exp ) if ( Exp ) Stm
--> while ( Exp ) if ( Exp ) Exp ;
--> while ( Exp ) if ( Exp )   6 ;
--> while ( Exp ) if (   0 )   6 ;
--> while (   1 ) if (   0 )   6 ;
```

## Top-down and bottom-up parsing

Top-down: try to "fill" the tree from root to the leaves.

Example: LL(k) - Left-to-right, Leftmost derivation, k item lookahead.

Task: to predict what production to use, after seeing k tokens

Bottom-up: build larger and larger trees by combining leaves.

Example: LR(k) - Left-to-right, Rightmost derivation, k item lookahead.

Task: decide what to do after seeing some tokens before and k after.

## Actions and tables

Book 4.6.3

The parser reads its input, and builds a stack of results. After every symbol, it chooses among four actions:

• shift: read one more symbol
• reduce: pop elements from the stack and replace by a value
• accept: return the single value on the stack when no input is left
• reject: report that there is input left but no move to take, or that the input is finished but the stack is not one with a single value.

State: position in a rule, marked by a dot, e.g.

```    Stm ::= "if" "(" . Exp ")" Stm
```

Actions are collected to a table. This table is similar to the transition function of an NFA. It has a column for each possible input symbol.

• rows: states
• columns: tokens
• cells: actions

## Example

Grammar

```    1. Exp  ::= Exp "+" Exp1
2. Exp  ::= Exp1
3. Exp1 ::= Exp1 "*" Integer
4. Exp1 ::= Integer
```

Parsing run

stack input action
1 + 2 * 3 shift
1 + 2 * 3 reduce 4
Exp1 + 2 * 3 reduce 2
Exp + 2 * 3 shift
Exp + 2 * 3 shift
Exp + 2 * 3 reduce 4
Exp + Exp1 * 3 shift
Exp + Exp1 * 3 shift
Exp + Exp1 * 3 3 reduce 3
Exp + Exp1 reduce 1
Exp accept

## LR(k), SLR, LALR(1)

LR(0) is the simplest variant, but not very powerful.

LR(1) can produce very large tables.

LR(k) for k > 1 is too large in practice.

SLR (Simple LR): like LR(0), but reduce conditioned on follow.

LALR(1): like LR(1), but merge states with identical items.

Expressivity:

• LR(0) < SLR < LALR(1) < LR(1) < LR(2) ...
• LL(k) < LR(k)
• none of these can be ambiguous

Standard tools (YACC, Bison, CUP, Happy) use LALR(1).

## Constructing an LALR(1) table

Simple example (the previous grammar; additional rule 0 for integer literals)

```                                 +     *     \$    int   Integer  Exp1  Exp
3 Integer -> L_integ .        r0    r0    r0   -
4 Exp1 -> Integer .           r4    r4    r4   -
5 Exp1 -> Exp1 . '*' Integer  -     s8    -    -
6 %start_pExp -> Exp .        s9    -     a    -
Exp -> Exp . '+' Exp1
7 Exp -> Exp1 .               r2    s8    r2   -
Exp1 -> Exp1 . '*' Integer
8 Exp1 -> Exp1 '*' . Integer  -     -     -    s3    g11
9 Exp -> Exp '+' . Exp1       -     -     -    s3    g4       g10
10 Exp -> Exp '+' Exp1 .       r1    s8    r1
Exp1 -> Exp1 . '*' Integer
11 Exp1 -> Exp1 '*' Integer .  r3    r3    r3
```

## Conflicts

The parser has several possible actions to take.

Shift-reduce conflict: between shift and reduce action.

Reduce-reduce conflict: between two (or more) reduce actions.

The latter are more harmful, but also more easy to eliminate.

## Reduce-reduce conflicts

Plain ambiguities:

```    EVar.   Exp ::= Ident ;
ECons.  Exp ::= Ident ;
```

Solution: wait until type checher to distinguish constants from variables.

Implicit ambiguities (Lab1 !):

```    SExp.   Stm  ::= Exp ;
SDecl.  Stm  ::= Decl ;
DTyp.   Decl ::= Typ ;
EId.    Exp  ::= Ident ;
TId.    Typ  ::= Ident ;
```

Solution: actually, `DTyp` is only valid in function parameter lists - not as statements!

## Shift-reduce conflicts

Classic example: dangling else

```    SIf.     Stm ::= "if" "(" Exp ")" Stm
SIfElse. Stm ::= "if" "(" Exp ")" Stm "else" Stm
```

What happens with this input at position (.)?

```    if (x > 0) if (y < 8) return y ; . else return x ;
```

Two possible outcomes:

```    shift:   if (x > 0) { if (y < 8) return y ;  else return x ;}
reduce:  if (x > 0) { if (y < 8) return y ;} else return x ;
```

Standard tools always choose shift rather than reduce.

This could be avoided by rewriting the grammar (Book 4.3.2). But usually the conflict is tolerated as a "well-understood" one.

## Parser tools

File formats with special syntax

• Happy for Haskell as host language
• Bison for C/C++
• CUP for Java

What the files look like:

• BNF rules with semantic actions in host language (next lecture)
```    Exp :: { Exp }
Exp :
Exp '+' Exp1 { EAdd \$1 \$3 }
| Exp1 { \$1 }
```
• interface to lexer via special rules, e.g.
```    L_integ  { PT _ (TI \$\$) }
Integer :: { Integer } :
L_integ  { (read \$1) :: Integer }
```

## Info files

Shows the LALR(1) table in a human-readable form.

Create an info file:

```    happy -i ParCPP.y
```

Check which rules are overshadowed in conflicts:

```    grep "(reduce" ParConf.info
```

Interestingly, conflicts tend concentrate on a few rules. If you have very many, do

```    grep "(reduce" ParConf.info | sort | uniq
```

The conflicts are (usually) the same in all tools.

Since the info file contains no Haskell, you can use Happy's info file if even if you principally work with another tool.

## Conflicts in C++ as shown in an info file

```  State 34
Const -> Ident .                                    (rule 123)
Const -> Ident . '<' ListType '>'                   (rule 124)

'>>'           reduce using rule 123
'<'            shift, and enter state 176
(reduce using rule 123)

State 176
Const -> Ident '<' . ListType '>'                   (rule 124)

State 243
Stm -> 'if' '(' Exp ')' Stm .                       (rule 49)
Stm -> 'if' '(' Exp ')' Stm . 'else' Stm            (rule 50)
'do'           reduce using rule 49
'else'         shift, and enter state 249
(reduce using rule 49)

State 249
Stm -> 'if' '(' Exp ')' Stm 'else' . Stm            (rule 50)
```

## Debug tools

Create a debugging Happy parser:

```    happy -da ParCPP.y
```

With Bison, you can use `gdb` (GNU Debugger), which traces back the execution to lines in the Bison source file.