Lecture 4: Implementing Lexers and Parsers
Programming Languages Course
Aarne Ranta (aarne@chalmers.se)

%!target:html

%!postproc(html): #NEW <!-- NEW -->
%!postproc(html): #HR <HR>


Book: 3.1, 3.3, 3.4, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9


#NEW

==Overview of this lecture==

Implementing lexers by hand

Regular expressions

Lexer tools: Alex/JLex/Flex

Implementing top-down (LL) parsers by hand

LL parsing tables

Bottom-up (LR) parsing methods

LALR parsing tools

Debugging and conflict solving


#NEW

==Before parsing==

Parsing is done in accordance with BNF rules such as
```
  Stm ::= "if" "(" Exp ")" Stm "else" Stm ;
  Exp ::= Exp "+" Exp ;
  Exp ::= Ident ;
  Exp ::= Integer ;
```
The terminals (in quotes) are **tokens**, and in the end
all rules generate and recognize lists of tokens.

Also ``Ident`` and ``Integer`` are a special kinds of tokens.

The task of **lexical analysis** is to split the input string into tokens.


#NEW

==Spaces==

Tokens are sometimes separated by spaces (/newlines/tabs), but not 
necessarily:
```
  if (i>0) {i=i+27;} else {i++;}

  if, (, i, >, 0, ), {, i, =, i, +, 27, ;, }, else, {, i, ++, ;, }
```
Good design: it should //always// be //legal// to have
spaces, newlines, or tabs between tokens - even if not necessary.

**Quiz**: which programming languages violate this rule?


#NEW

==Lexical analysis in parser (don't do like this!)==

It //is// possible for the parser to read input character by character;
in this case, the grammar should tell where spaces are possible/necessary.
```
  Exp       ::= Exp Space "+" Space Exp ;
  Space     ::= [SpaceChar] ;
  SpaceChar ::= " " | "\n" | "\t" ;
```
But this really clutters the grammar!

Programming languages are usually designed in such a way that
lexical analysis can be done before parsing, and parsing gets
tokens as its input.


#NEW

==Classes of tokens==

The lexer //splits// the code into tokens.

The lexer also //classifies// each token into classes. For example, the string
```
  i2=i1+271
```
results in the following list of classified tokens:
```
  identifier "i1"
  symbol "="
  identifier "i2"
  symbol "+"
  integer "271"
```


#NEW

==The lexer program==

The lexer should read the source code character by character,
and send tokens to the parser.

After each token, it should use the next character ``c``to decide
what kind of token to read.
- if ``c`` is a digit, collect an integer as long as you read
  digits
- if ``c`` is a letter, collect an identifier as long as you read
  identifier characters (digit, letter, ``'``)
- if ``c`` is a double quote, collect a string literal as long
  as you read characters other than a double quote
- if ``c`` is space character (i.e. space, newline, or tab), ignore
  it and read next character


**Longest match**: read a token //as long as// you get characters that 
can belong to it.


#NEW

==Comments==

Since comments (``/* ... */``) are not saved by the lexer, an easy way 
to remove them might be in a separate, earlier pass.

However, if we have string literals, we have to
spare comment signs in them: 
```
  "no comment /* here */ please!"
```
Thus also comments belong to the lexer:
- if you read ``/`` and the next character is ``*``, then ignore
  characters until you encounter ``*`` and ``/``


The lexer here needs a 1-character **lookahead** - it must read
one more than the next character

When lexing, the lexer knows if it is reading a string, and does
not start reading a comment in the middle of this.


#NEW

==Reserved words and symbols==

Reserved symbols and words: the terminals in the BNF grammar.

There is a finite number of reserved symbols and words (why?)

Reserved symbols: ``+    >    ++    >>    >=`` etc.

Longest match: if ``++`` is found, ``+`` is not returned.

This explains a detail of C++ template syntax: space is necessary in
```
  vector<vector<int> >
```

Reserved words: usually similar to identifiers, longer than special symbols: 
``if     inline     while`` etc


#NEW

==Transition diagram for reading tokens and reserved words==

Book 3.4.1, 3.4.3

A picture will be shown


#NEW

==Implementing lexers (don't do like this!)==

Transition diagrams can be hand-coded by using ``case``
expressions.

Book 3.4.4 gives a Java example; here is a Haskell one
```
lexCmm :: String -> [Tok]
lexCmm s = case s of
  c:cs   | isSpace c     -> lexCmm cs
  c:cs   | isAlpha c     -> getId s
  c:cs   | isDigit c     -> getInt s
  c:d:cs | isSymb [c,d]  -> TS [c,d] : lexCmm cs
  c:cs   | isSymb [c]    -> TS [c]   : lexCmm cs
  _                      -> []  -- covers end of file and unknown characters
 where
  getId s  = lx i : lexCmm cs where (i,cs) = span isIdChar s
  getInt s = TI (read i) : lexCmm cs where (i,cs) = span isDigit s
  isIdChar c = isAlpha c || isDigit c
  lx i = if isReservedWord i then TS i else TI i

isSymb s = elem s $ words "++ -- == <= >= ++ { } = , ; + * - ( ) < >"

isReservedWord w = elem w $ words "else if int main printInt return while"
```
But the code is messy, and hard to read and maintain.


#NEW

==Regular expressions==

A good way of specifying and documenting the lexer is transition diagrams.

More concisely, we can use **regular expressions**:
```
  Tokens = Space (Token Space)*
  Token  = TInt | TId | TKey | TSpec 
  TInt   = Digit Digit*
  Digit  = '0' | '1' | '2' |'3' | '4' |'5' | '6' |'7' | '8' | '9'
  TId    = Letter IdChar*
  Letter = 'A' | ... | 'Z' | 'a' | ... | 'z'
  IdChar = Letter | Digit
  TKey   = 'i' 'f' | 'e' 'l' 's' 'e' | ...
  TSpec  = '+''+' | '+' | ...
  Space  = (' ' | '\n' | '\t')*
```
What is more, these expressions can be compiled into a program that
performs the lexing! The program implements a transition diagram - more
precisely, a **finite state automaton**. 


#NEW

==The syntax and semantics of regular expressions==

|| name    | notation  | semantics  | verbally  ||
| symbol   |      'a'  |       {a} |  the symbol 'a'
| sequence |      A B  |  ``{xy | x : A, y : B}`` |  A followed by B
| union    |    ``A | B`` |  A U B |  A or B
| closure  |    ``A*`` |   ``{A^n | n = 0,1,2,..}`` |  any number of A's    
| empty    |     eps |  {[]} |  the empty string 

This is BNFC notation.

The semantics is a **regular language**, a set of strings of symbols.
One often writes //L(E)// for the language denoted by the expression 
//E//.

There are variants of the notation, especially for symbols and empties.
Quoting symbols is not so common, since it lengthens the
expressions - but it is very good for clarity!


#NEW

==More notations for regular expressions==

|| name | notation  | definition/semantics  | verbally  ||
| symbol range |  ``['c'..'t']`` |    'c'|...|'t' |   any symbol from 'c' to 't' 
| symbol |    ``char`` |      ``{x | x is a character}`` |    any character 
| letter |    ``letter`` |      ``['A'..'Z'] | ['a'..'z']`` |    any letter 
| digit |    ``digit`` |      ``['0'..'9']`` |    any digit
| option |      ``A?`` |      ``A | eps`` |     optional A
| n-sequence |  ``A^n`` |  ``{x1...xn | xi : A`` | n A's
| nonempty closure |     ``A+`` |      ``A A*`` |     any positive number of A's 
| difference |  ``A - B`` | ``{x | x : A & not (x : B)}`` |  the difference of A and B 


The symbol range and n-sequence are not supported by BNFC.

"Any character" in BNFC is ASCII character 0..255.


#NEW

==State-of-the-art lexer implementation==

Define each token type by using regular expressions

Compile the regular expressions into a code for a transition
diagram (finite-state automaton).

Compile the resulting code into a binary.
```
  regexp in BNFC file

           |
           v            bnfc program

  Flex/Jlex/Alex file

           |
           v            flex/jlex/alex program

  C/C++/Java/Haskell file

           |            
           v            gcc/javac/ghc program

  binary file
```


#NEW

==Pros and cons of hand-written lexer code==

**+** full control on the program (lookahead)

**+** usually smaller program

**+** can be quick to create (if the lexer is simple)

**+** can be made more powerful than regular expressions


**-** difficult to get right (if the lexer is complex)

**-** much extra work to include position information etc.

**-** may compromise performance (lookahead)

**-** not self-documenting

**-** not portable

**-** lexer generators are state-of-the-art in compilers

**Thus the usual trade-offs between hand-made and generated code hold**.

In this course, we use generated lexers in the labs.


#NEW

==A look at lexer generation tools==

You can take a look at the web pages, manuals, and bnfc output for

Alex: Haskell

JLex: Java

Flex: C, C++

The [BNF Converter web page http://www.cs.chalmers.se/~markus/BNFC/]
has links to all these tools.


#NEW

==Lexer definitions in BNFC==

Reserved word and symbols: terminals in the rules.

Built-in types
```
  Integer Double String Char
```

Comments
```
  comment "/*" "*/" ;
  comment "//" ;
```
Token
```
  token Id (letter (letter | digit | '_')*) 
```
Position token (for better error messages in later phases)
```
  position token Id (letter (letter | digit | '_')*) 
```


#NEW

==Built-in lexical types of BNFC==


|| type   | definition    ||
| ``Ident``   | ``letter (letter | digit | '_' | '\'')*``  
| ``Integer`` | ``digit+`` 
| ``Double``  | ``digit+ '.' digit+ ('e' '-'? digit+)?`` 
| ``String``  | ``'"' ((char - ["\"\\"]) | ('\\' ["\"\\nt"]))* '"'``
| ``Char``    | ``'\'' ((char - ["'\\"]) | ('\\' ["'\\nt"])) '\''``


#NEW

==Parsing==

From token list to syntax tree.

Initially: to parse tree.

In practice, at the same time: to abstract syntax tree.

Sometimes even: to type-checked tree, to intermediate code... but we
will postpone such **semantic actions** to **syntax-directed translation**
(next lecture).


#NEW

==How to implement parsing==

Like lexing: use standard tools.
- Happy for Haskell
- CUP for Java
- Bison for C and C++
- (YACC = Yet Another Compiler Compiler - the classic for C)


Input to the tool: a BNF grammar (+ semantic actions).

Output of the tool: parser code in target programming language.

Like lexers, it //is// possible to write a parser by hand - but
this is tedious and error-prone. It is also easy to
end up with inefficiency and nontermination.


#NEW

==Top-down (predictive, recursive descent) parsing==

Structure: for each category, write a function that
constructs a tree in a way depending on the first token.

Example grammar:
```
  SIf.    Stm ::= "if" "(" Exp ")" Stm ;
  SWhile. Stm ::= "while" "(" Exp ")" Stm ;
  SExp.   Stm ::= Exp ;
  EInt.   Exp ::= Integer ;
```
Two functions:
```
  Stm pStm():
    next == "if"    -> ...  build tree with SIf
    next == "while" -> ...  build tree with SWhile
    next is integer -> ...  build tree with SExp

  Exp pExp():
    next is integer k -> return EInt k
```
How do we fill the three dots?


#NEW

==Recursive descent parsing, completed==

Follow the structure of the grammar
```
  SIf.    Stm ::= "if" "(" Exp ")" Stm ;
  SWhile. Stm ::= "while" "(" Exp ")" Stm ;
  SExp.   Stm ::= Exp ";" ;
```
We also need an auxiliary for just ignoring tokens,
and a global variable ``next`` for the next token.
```
  void ignore(Token t):
    check next==t
    shift next to the token after
```
There is a parsing function for each category, calling recursively
other parsing functions and itself, as guided by the grammar.
```
  Stm pStm():
    next == "if"    -> 
      ignore("if")
      ignore("(")
      Exp e = pExp()
      ignore(")")
      Stm s = pStm()
      return SIf(e,s)

  //next == "while" similar

    next is integer -> 
      Exp e = pExp()
      return SExp(e)
```


#NEW

==Implementing recursive descent parsing==

The pseudocode on previous page is directly translatable to
both imperative and functional code.

In Java, see Book 2.4.2, 4.4.1

In Haskell, you can use monads or parser combinators.

In either case, this is a quick way to implement small
parsers.

But it is quite limited, as we shall see.


#NEW

==Conflicts==

Example:
```
  SIf.     Stm ::= "if" "(" Exp ")" Stm
  SIfElse. Stm ::= "if" "(" Exp ")" Stm "else" Stm
```
which one should we choose when we see "if"?

This situation is called a **conflict**.

It can be solved by **left factoring** - sharing the common
left part of the rules (Book 4.3.4):
```
  SIE.   Stm  ::= "if" "(" Exp ")" Stm Rest
  RElse. Rest ::= "else" Stm
  REmp.  Rest ::= 
```
To get the original abstract syntax, we need 
a function that depends on ``Rest``.
```
  f(SIE exp stm REmp)         = SIf     exp stm
  f(SIE exp stm (RElse stm2)) = SIfElse exp stm stm2
```


#NEW

==Left recursion==

What can we do in the grammar
```
  Exp ::= Exp "+" Integer
  Exp ::= Integer
```
To build an ``Exp``, the parser first tries to build an ``Exp``, and
so on.

**Left recursion**: the value category is itself the leftmost item
in a production.

Solution: translate it away as follows (4.3.3)
```
  Exp  ::= Integer Rest
  Rest ::= "+" Integer Rest
  Rest ::=
```
The category ``Rest`` now has **right recursion**, which is
harmless.

Again, translation functions are needed to get the original
abstract syntax.

Notice: left recursion can be **implicit** (see Book, Algorithm 4.19)
```
  Exp ::= Val "+" Integer
  Val ::= Exp
```


#NEW

==LL(1) tables==

Used for guiding recursive descent parsers
- rows: categories
- columns: tokens
- cells: rules


Example
```
  SIf.    Stm ::= "if" "(" Exp ")" Stm ;
  SWhile. Stm ::= "while" "(" Exp ")" Stm ;
  SExp.   Stm ::= Exp ";" ;
  EInt.   Exp ::= Integer ;
```

 || -   | if  | while  | Int   | (   |  ) |  ; | $ (END) ||
 | Stm  | SIf | SWhile | SExp  |   - |  - |  - |  -
 | Exp  |  -  |   -    | EInt  |   - |  - |  - |  -


LL(1) conflict: a cell contains more than one rule.


#NEW

==Derivations==

Leftmost derivation of ``while(1) if (0) 6 ;``
```
  Stm --> while ( Exp ) Stm
      --> while (   1 ) Stm
      --> while (   1 ) if ( Exp ) Stm
      --> while (   1 ) if (   0 ) Stm
      --> while (   1 ) if (   0 ) Exp ;
      --> while (   1 ) if (   0 )   6 ;
```
Rightmost derivation of ``while(1) if (0) 6 ;``
```
  Stm --> while ( Exp ) Stm
      --> while ( Exp ) if ( Exp ) Stm
      --> while ( Exp ) if ( Exp ) Exp ;
      --> while ( Exp ) if ( Exp )   6 ;
      --> while ( Exp ) if (   0 )   6 ;
      --> while (   1 ) if (   0 )   6 ;
```


#NEW

==Top-down and bottom-up parsing==

**Top-down**: try to "fill" the tree from root to the leaves.

Example: LL(k) - Left-to-right, Leftmost derivation, k item lookahead.

Task: to predict what production to use, after seeing k tokens

**Bottom-up**: build larger and larger trees by combining leaves.

Example: LR(k) - Left-to-right, Rightmost derivation, k item lookahead.

Task: decide what to do after seeing some tokens before and k after.


#NEW

==Actions and tables==

Book 4.6.3

The parser reads its input, and builds a stack of results.
After every symbol, it chooses among four actions:
- **shift**: read one more symbol
- **reduce**: pop elements from the stack and replace by a value
- **accept**: return the single value on the stack when no input is left
- **goto**: jump to another state and act accordingly
- **reject**: report that there is input left but no move to take, or that
     the input is finished but the stack is not one with a single value.


State: position in a rule, marked by a dot, e.g.
```
  Stm ::= "if" "(" . Exp ")" Stm
```

Actions are collected to a table.
This table is similar to the transition function of
an NFA. It has a column for each possible input symbol.
- rows: states
- columns: tokens
- cells: actions


#NEW

==Example==

Grammar
```
  1. Exp  ::= Exp "+" Exp1
  2. Exp  ::= Exp1
  3. Exp1 ::= Exp1 "*" Integer
  4. Exp1 ::= Integer
```
Parsing run

 || stack            | input      | action  |
 |                   |  1 + 2 * 3 | shift
 | 1                 |    + 2 * 3 | reduce 4
 | Exp1              |    + 2 * 3 | reduce 2
 | Exp               |    + 2 * 3 | shift
 | Exp +             |      2 * 3 | shift
 | Exp + 2           |        * 3 | reduce 4
 | Exp + Exp1        |        * 3 | shift
 | Exp + Exp1 *      |          3 | shift
 | Exp + Exp1 * 3    |          3 | reduce 3
 | Exp + Exp1        |            | reduce 1
 | Exp               |            | accept


#NEW

==LR(k), SLR, LALR(1)==

LR(0) is the simplest variant, but not very powerful.

LR(1) can produce very large tables.

LR(k) for k > 1 is too large in practice.

SLR (Simple LR): like LR(0), but reduce conditioned on follow.

LALR(1): like LR(1), but merge states with identical items.

Expressivity:
- LR(0) < SLR < LALR(1) < LR(1) < LR(2) ...
- LL(k) < LR(k)
- none of these can be ambiguous


Standard tools (YACC, Bison, CUP, Happy) use LALR(1).


#NEW

==Constructing an LALR(1) table==

Simple example (the previous grammar; additional rule 0 for integer literals)
```
                               +     *     $    int   Integer  Exp1  Exp
 3 Integer -> L_integ .        r0    r0    r0   -
 4 Exp1 -> Integer .           r4    r4    r4   -
 5 Exp1 -> Exp1 . '*' Integer  -     s8    -    -
 6 %start_pExp -> Exp .        s9    -     a    - 
   Exp -> Exp . '+' Exp1
 7 Exp -> Exp1 .               r2    s8    r2   -                 
   Exp1 -> Exp1 . '*' Integer 
 8 Exp1 -> Exp1 '*' . Integer  -     -     -    s3    g11
 9 Exp -> Exp '+' . Exp1       -     -     -    s3    g4       g10
10 Exp -> Exp '+' Exp1 .       r1    s8    r1                       
   Exp1 -> Exp1 . '*' Integer                  
11 Exp1 -> Exp1 '*' Integer .  r3    r3    r3
```


#NEW

==Conflicts==

The parser has several possible actions to take.

Shift-reduce conflict: between shift and reduce action.

Reduce-reduce conflict: between two (or more) reduce actions.

The latter are more harmful, but also more easy to eliminate.


#NEW

==Reduce-reduce conflicts==

Plain ambiguities:
```
  EVar.   Exp ::= Ident ;
  ECons.  Exp ::= Ident ;
```
Solution: wait until type checher to distinguish constants from variables.

Implicit ambiguities (Lab1 !):
```
  SExp.   Stm  ::= Exp ;
  SDecl.  Stm  ::= Decl ;
  DTyp.   Decl ::= Typ ;
  EId.    Exp  ::= Ident ;
  TId.    Typ  ::= Ident ;
```
Solution: actually, ``DTyp`` is only valid in function parameter
lists - not as statements!


#NEW

==Shift-reduce conflicts==

Classic example: dangling else
```
  SIf.     Stm ::= "if" "(" Exp ")" Stm
  SIfElse. Stm ::= "if" "(" Exp ")" Stm "else" Stm
```
What happens with this input at position (.)?
```
  if (x > 0) if (y < 8) return y ; . else return x ;
```
Two possible outcomes:
```
  shift:   if (x > 0) { if (y < 8) return y ;  else return x ;}
  reduce:  if (x > 0) { if (y < 8) return y ;} else return x ;
```
Standard tools always choose shift rather than reduce.

This could be avoided by rewriting the grammar (Book 4.3.2).
But usually the conflict is tolerated as a "well-understood" one.


#NEW

==Parser tools==

File formats with special syntax
- Happy for Haskell as host language
- Bison for C/C++
- CUP for Java


What the files look like:
- BNF rules with **semantic actions** in host language (next lecture)
```
  Exp :: { Exp }
  Exp : 
     Exp '+' Exp1 { EAdd $1 $3 } 
   | Exp1 { $1 }
```
- interface to lexer via special rules, e.g.
```
  L_integ  { PT _ (TI $$) }
  Integer :: { Integer } : 
    L_integ  { (read $1) :: Integer }
```


#NEW

==Info files==

Shows the LALR(1) table in a human-readable form.

Create an info file:
```
  happy -i ParCPP.y
```
Check which rules are overshadowed in conflicts:
```
  grep "(reduce" ParConf.info
```
Interestingly, conflicts tend concentrate on a few rules. 
If you have very many, do
```
  grep "(reduce" ParConf.info | sort | uniq
```


The conflicts are (usually) the same in all tools. 

Since the info file contains no Haskell, you can use Happy's info
file if even if you principally work with another tool.


#NEW

==Conflicts in C++ as shown in an info file==

```
State 34
	Const -> Ident .                                    (rule 123)
	Const -> Ident . '<' ListType '>'                   (rule 124)

	'>>'           reduce using rule 123
	'<'            shift, and enter state 176
			(reduce using rule 123)

State 176
	Const -> Ident '<' . ListType '>'                   (rule 124)

State 243
	Stm -> 'if' '(' Exp ')' Stm .                       (rule 49)
	Stm -> 'if' '(' Exp ')' Stm . 'else' Stm            (rule 50)
	'do'           reduce using rule 49
	'else'         shift, and enter state 249
			(reduce using rule 49)

State 249
	Stm -> 'if' '(' Exp ')' Stm 'else' . Stm            (rule 50)
```


#NEW

==Debug tools==

Create a debugging Happy parser:
```
  happy -da ParCPP.y
```
With Bison, you can use ``gdb`` (GNU Debugger),
which traces back the execution to lines in the
Bison source file.