Lecture 5: Theory of Lexing and Parsing
Programming Languages Course
Aarne Ranta (aarne@chalmers.se)

%!target:html

%!postproc(html): #NEW <!-- NEW -->
%!postproc(html): #HR <HR>

Book: 3.6, 3.7, 3.9, 4.1, 4.2, 4.3, 


#NEW

==Overview of this lecture==

Regular expressions and finite-state automata

Nondeterministic and deterministic automata

Reasoning about automata and expressions

Regular and context-free languages

Decidability of context-free parsing

Limits of context-free grammars


#NEW

==The syntax and semantics of regular expressions (repetition)==

|| name    | notation  | semantics  | verbally  ||
| symbol   |      'a'  |       {a} |  the symbol 'a'
| sequence |      A B  |  ``{xy | x : A, y : B}`` |  A followed by B
| union    |    ``A | B`` |  A U B |  A or B
| closure  |    ``A*`` |   ``{A^n | n = 0,1,2,..}`` |  any number of A's    
| empty    |     eps |  {[]} |  the empty string 

This is BNFC notation.

The semantics is a **regular language**, a set of strings of symbols.
One often writes //L(E)// for the language denoted by the expression 
//E//.

There are variants of the notation, especially for symbols and empties.

Quoting symbols is often omitted in theoretical examples!


#NEW

==Finite-state automata==

A **Nondeterministic Finite-state Automaton** (NFA) is a tuple 
(Q,S,q0,F,d) where
- Q, a finite set of states
- S, a finite set of input symbols
- q0 : Q, the start state
- F < Q, the set of final states
- d : Q x S' -> P(Q), transition function (S' = S U {eps})


A **Deterministic Finite-state Automaton** (DFA) is like NFA.
except that the value of the transition function d is a single state
in Q (instead of a subset of Q) and there are no **epsilon transitions**
(transitions that don't consume a symbol). In other words,
- d : Q x S -> Q, transition function
 

It is customary to draw diagrams with circles for states and arrows
for the transition function: you can find them in the course book,
and we will draw them on the blackboard.


#NEW

==Compiling regular expressions into finite-state automata==

1) **NFA generation**. 
Compile the expression into an NFA, using transitions with the
epsilon symbol e. (Book 3.7.4). This translation is
**compositional**: simply traverse the syntax tree of the
regular expression. The size of the automaton is linear in the
size of the expression: O(|r|).

But NFAs are, in comparison to DFAs,
-  more difficult to implement
-  less efficient to run: O(|r| * n), where n is the input length.


2) **Determinization**. Transform the NFA into a DFA. (Book 3.7.1).
Here you go through states, and for each symbol
create a state collecting those states that the symbol can lead to
(subset construction).

The DFA can still be bigger than necessary, i.e. have more states
than needed. 


3) **Minimization**.
Merge equivalent states (Book 3.9.6): states beginning from which the
automaton recognizes the same set of strings.

The resulting DFA permits analysis that is linear in the length of
the input string: O (n). But its size is still exponential, at worst, in the
length of the regular expression: O (2^|r|).


#NEW

==From regular expressions to NFA's==

Book 3.7.4.

A typical example of **compilation schemes**:
```
  NFA(eps) =

  NFA('c') =

  NFA(A B) = 

  NFA(A | B) = 

  NFA(A*) =
```
Notice: the compilation would be much simpler, if the operators were only
between symbols.

Example:
```
  'a' 'b' | 'a' 'c'
```


#NEW

==From NFA's to DFA's==

Book 3.7.1

Subset construction with epsilon closure

Typical example of back-end optimization: of time

Example (again):
```
  'a' 'b' | 'a' 'c'
```


#NEW

==Minimizing DFA's==

Book 3.9.6

Merging equivalent states

A further example of back-end optimization: of space

Example (once more):
```
  'a' 'b' | 'a' 'c'
```


#NEW

==Equivalence between regular expressions into finite-state automata==

It is conversely possible to express any finite-state automaton by
a regular expression. A decompilation algorithm is given e.g.
in Hopcroft & Ullman, //Introduction to Automata Theory, Languages,
and Computation// (1979).

Often it is easier to construct the expression that the automaton, sometimes
vice-versa.


**Example** (easy with automata): 
any sequence of 0's and 1's with an even number of 0's.
Construct an automaton with two states, corresponding to the number of 0's
being even or odd.


**Example** (easy with expressions): 
algebraic reasoning (and simplification) of regular expressions, 
with laws such as
```
  A eps = A
  A | A = A
  A**   = A*
```


#NEW

==Example of reasoning on automata==

(Reminder: language = set of strings.)

**Proposition**. The language ```{ a^n b^n | n = 0,1,2,...}```
is not regular.


**Proof** (Book 4.2.7). We show that any language
```a^n b^n``` needs at least //n+1// states. It follows that
there can be no finite number of states to accept the language where
//n// can be arbitrary.

Consider the states ``s_0, s_1, ...., s_n`` reached after reading
each relevant number of ``a``'s. Each of these states must have 
a different follow-up, and hence be a distinct state: if we had 
``s_i = s_j``, then the automaton would accept ``a^i b^j``.


#NEW

==Example of exponential DFA size==

A sequence of //a//'s and //b//'s where the n'th last symbol is //a//.
```
  (a|b)* a (a|b)^n
```
The number of states in the DFA is exponential in ``n``.

Show this for n=1.


#NEW

==Limitations of regular languages==

Java, C, and Haskell are not regular languages.

We cannot write their parser by using regular expressions.

This is because the languages have rules of the form ``a^n X b^n``
e.g. for matching left and right parentheses.
A machine with finitely many states cannot remember that exactly
//n// left parentheses have been read, if //n// is arbitrary
(cf. the proposition on ``a^n b^n`` above).


#NEW

==Regular and context-free languages==

Every regular language is also a context-free language (i.e. can
be described by a context-free grammar).

The inverse is not true. It holds for grammars with no recursion, and
also for grammars with just **left recursion** 
```
  C ::= C X      ===>  C = Y X*
  C ::= Y
```
or **right recursion** 
```
  C ::= X C      ===>  C = X* Y
  C ::= Y
```
But it does not hold for languages with **middle recursion**:
```
  C ::= X C Z    ===>  C = X^n Y Z^n
  C ::= Y
```


#NEW

==The definition of context-free grammars (repetition) ==

A context-free grammar is a quadruple (//T,N,S,R//) where
- //T// and //N// are disjoint sets, called **terminals** and
  **nonterminals**, respectively
- //S// is a nonterminal, the **start category**
- //R// is a finite set of **rules**
- a rule is a pair (//C//, //t_1//...//t_n//) where
  - //C// is a nonterminal
  - each //t_1// is a terminal or a nonterminal
  - n >= 0


#NEW

==The Chomsky normal form==

All rules have one of these forms:
```
  C ::= A B  -- two non-terminals
  C ::= t    -- one terminal
```
This means that all parse trees are binary trees.

All context-free grammars can be brought to this form. For instance,
```
  Exp ::= Exp "+" Exp   ===>  Exp     ::= Exp PlusExp
                              PlusExp ::= Plus Exp
                              Plus    ::= "+"
```
Thus the Chomsky normal form creates new categories and increases the number of rules.

It can be complicated to compute (exercise 4.4.8).


#NEW

==A proof of decidability==

Lemma: a grammar can only generate a finite number of
token sequences of a given length.

Method: 
+ given a string, generate all sequences of the same length as it
+ check if the given string is among the generated ones


The lemma is easy to prove for grammars in the
Chomsky normal form.


#NEW

==Recognition and parsing for Chomsky normal form==

Recognition: just check if a string is grammatical in G.
```
  seqs(C,1)   = {t | (C ::= t) : G}
  seqs(C,n+1) = {x y | 
    (C ::= A B) : G,
    k : {1,...,n}, 
    x : seqs(A,k), y : seqs(B,n+1-k)}
```
Parsing: construct all possible parse trees in G.
```
  trees(C,1)   = {(t,f) | (f. C ::= t) : G}
  trees(C,n+1) = {(x y, f v w) | 
    (f. C ::= A B) : G,
    k : {1,...,n}, 
    (x,v) : trees(A,k), (y,w) : trees(B,n+1-k)}
```
This is easy to implement, but not very efficient.


#NEW

==Better algorithms for context-free grammars==

CYK: for Chomsky normal form

Earley algorithm, chart parsing: for arbitrary grammars

GLR (Generalized LR): for arbitrary grammars

The worst-case complexity for these algorithms is cubic in the lenth of the input
(recall: LALR(1) is linear!)


#NEW

==The theoretical limits==

Parsing with context-free grammars is decidable; the worst-case
time complexity is cubic.

Programming language implementations strive for linear-time
parsing - this is possible for a limited class of grammars.

A simple language that is not context-free: the **copy language**
```
  {x x | x : ('a' | 'b')*}
```
(proof omitted). Observe: this is not the same as
```
  S ::= S S | "a" S | "b" S |  
```
An instance of this: check that a variable is declared before
it is used.


#NEW

==The copy language in GF==

It is easy to define the copy language if constructors and linearization
functions are separated (cf. Lecture 2). This is how it is written in
[GF http://grammaticalframework.org]:
```
  -- abstract syntax
  cat S ; W ;
  fun s : W -> S ;
  fun e : W ;
  fun a : W -> W ;
  fun b : W -> W ;

  lin s w = w ++ w ;
  lin e   = "" ;
  lin a w = "a" ++ w ;
  lin b w = "b" ++ w ;
```
For instance, ``abbababbab`` has the tree ``s (a (b (b (a (b e)))))``.

GF corresponds to a class of grammars known as
**parallel multiple context-free grammars**, which is useful in
natural language description. Its parsing complexity is polynomial.