Lecture 5: Theory of Lexing and Parsing
Programming Languages Course
Aarne Ranta (aarne@chalmers.se)
%!target:html
%!postproc(html): #NEW
%!postproc(html): #HR
Book: 3.6, 3.7, 3.9, 4.1, 4.2, 4.3,
#NEW
==Overview of this lecture==
Regular expressions and finite-state automata
Nondeterministic and deterministic automata
Reasoning about automata and expressions
Regular and context-free languages
Decidability of context-free parsing
Limits of context-free grammars
#NEW
==The syntax and semantics of regular expressions (repetition)==
|| name | notation | semantics | verbally ||
| symbol | 'a' | {a} | the symbol 'a'
| sequence | A B | ``{xy | x : A, y : B}`` | A followed by B
| union | ``A | B`` | A U B | A or B
| closure | ``A*`` | ``{A^n | n = 0,1,2,..}`` | any number of A's
| empty | eps | {[]} | the empty string
This is BNFC notation.
The semantics is a **regular language**, a set of strings of symbols.
One often writes //L(E)// for the language denoted by the expression
//E//.
There are variants of the notation, especially for symbols and empties.
Quoting symbols is often omitted in theoretical examples!
#NEW
==Finite-state automata==
A **Nondeterministic Finite-state Automaton** (NFA) is a tuple
(Q,S,q0,F,d) where
- Q, a finite set of states
- S, a finite set of input symbols
- q0 : Q, the start state
- F < Q, the set of final states
- d : Q x S' -> P(Q), transition function (S' = S U {eps})
A **Deterministic Finite-state Automaton** (DFA) is like NFA.
except that the value of the transition function d is a single state
in Q (instead of a subset of Q) and there are no **epsilon transitions**
(transitions that don't consume a symbol). In other words,
- d : Q x S -> Q, transition function
It is customary to draw diagrams with circles for states and arrows
for the transition function: you can find them in the course book,
and we will draw them on the blackboard.
#NEW
==Compiling regular expressions into finite-state automata==
1) **NFA generation**.
Compile the expression into an NFA, using transitions with the
epsilon symbol e. (Book 3.7.4). This translation is
**compositional**: simply traverse the syntax tree of the
regular expression. The size of the automaton is linear in the
size of the expression: O(|r|).
But NFAs are, in comparison to DFAs,
- more difficult to implement
- less efficient to run: O(|r| * n), where n is the input length.
2) **Determinization**. Transform the NFA into a DFA. (Book 3.7.1).
Here you go through states, and for each symbol
create a state collecting those states that the symbol can lead to
(subset construction).
The DFA can still be bigger than necessary, i.e. have more states
than needed.
3) **Minimization**.
Merge equivalent states (Book 3.9.6): states beginning from which the
automaton recognizes the same set of strings.
The resulting DFA permits analysis that is linear in the length of
the input string: O (n). But its size is still exponential, at worst, in the
length of the regular expression: O (2^|r|).
#NEW
==From regular expressions to NFA's==
Book 3.7.4.
A typical example of **compilation schemes**:
```
NFA(eps) =
NFA('c') =
NFA(A B) =
NFA(A | B) =
NFA(A*) =
```
Notice: the compilation would be much simpler, if the operators were only
between symbols.
Example:
```
'a' 'b' | 'a' 'c'
```
#NEW
==From NFA's to DFA's==
Book 3.7.1
Subset construction with epsilon closure
Typical example of back-end optimization: of time
Example (again):
```
'a' 'b' | 'a' 'c'
```
#NEW
==Minimizing DFA's==
Book 3.9.6
Merging equivalent states
A further example of back-end optimization: of space
Example (once more):
```
'a' 'b' | 'a' 'c'
```
#NEW
==Equivalence between regular expressions into finite-state automata==
It is conversely possible to express any finite-state automaton by
a regular expression. A decompilation algorithm is given e.g.
in Hopcroft & Ullman, //Introduction to Automata Theory, Languages,
and Computation// (1979).
Often it is easier to construct the expression that the automaton, sometimes
vice-versa.
**Example** (easy with automata):
any sequence of 0's and 1's with an even number of 0's.
Construct an automaton with two states, corresponding to the number of 0's
being even or odd.
**Example** (easy with expressions):
algebraic reasoning (and simplification) of regular expressions,
with laws such as
```
A eps = A
A | A = A
A** = A*
```
#NEW
==Example of reasoning on automata==
(Reminder: language = set of strings.)
**Proposition**. The language ```{ a^n b^n | n = 0,1,2,...}```
is not regular.
**Proof** (Book 4.2.7). We show that any language
```a^n b^n``` needs at least //n+1// states. It follows that
there can be no finite number of states to accept the language where
//n// can be arbitrary.
Consider the states ``s_0, s_1, ...., s_n`` reached after reading
each relevant number of ``a``'s. Each of these states must have
a different follow-up, and hence be a distinct state: if we had
``s_i = s_j``, then the automaton would accept ``a^i b^j``.
#NEW
==Example of exponential DFA size==
A sequence of //a//'s and //b//'s where the n'th last symbol is //a//.
```
(a|b)* a (a|b)^n
```
The number of states in the DFA is exponential in ``n``.
Show this for n=1.
#NEW
==Limitations of regular languages==
Java, C, and Haskell are not regular languages.
We cannot write their parser by using regular expressions.
This is because the languages have rules of the form ``a^n X b^n``
e.g. for matching left and right parentheses.
A machine with finitely many states cannot remember that exactly
//n// left parentheses have been read, if //n// is arbitrary
(cf. the proposition on ``a^n b^n`` above).
#NEW
==Regular and context-free languages==
Every regular language is also a context-free language (i.e. can
be described by a context-free grammar).
The inverse is not true. It holds for grammars with no recursion, and
also for grammars with just **left recursion**
```
C ::= C X ===> C = Y X*
C ::= Y
```
or **right recursion**
```
C ::= X C ===> C = X* Y
C ::= Y
```
But it does not hold for languages with **middle recursion**:
```
C ::= X C Z ===> C = X^n Y Z^n
C ::= Y
```
#NEW
==The definition of context-free grammars (repetition) ==
A context-free grammar is a quadruple (//T,N,S,R//) where
- //T// and //N// are disjoint sets, called **terminals** and
**nonterminals**, respectively
- //S// is a nonterminal, the **start category**
- //R// is a finite set of **rules**
- a rule is a pair (//C//, //t_1//...//t_n//) where
- //C// is a nonterminal
- each //t_1// is a terminal or a nonterminal
- n >= 0
#NEW
==The Chomsky normal form==
All rules have one of these forms:
```
C ::= A B -- two non-terminals
C ::= t -- one terminal
```
This means that all parse trees are binary trees.
All context-free grammars can be brought to this form. For instance,
```
Exp ::= Exp "+" Exp ===> Exp ::= Exp PlusExp
PlusExp ::= Plus Exp
Plus ::= "+"
```
Thus the Chomsky normal form creates new categories and increases the number of rules.
It can be complicated to compute (exercise 4.4.8).
#NEW
==A proof of decidability==
Lemma: a grammar can only generate a finite number of
token sequences of a given length.
Method:
+ given a string, generate all sequences of the same length as it
+ check if the given string is among the generated ones
The lemma is easy to prove for grammars in the
Chomsky normal form.
#NEW
==Recognition and parsing for Chomsky normal form==
Recognition: just check if a string is grammatical in G.
```
seqs(C,1) = {t | (C ::= t) : G}
seqs(C,n+1) = {x y |
(C ::= A B) : G,
k : {1,...,n},
x : seqs(A,k), y : seqs(B,n+1-k)}
```
Parsing: construct all possible parse trees in G.
```
trees(C,1) = {(t,f) | (f. C ::= t) : G}
trees(C,n+1) = {(x y, f v w) |
(f. C ::= A B) : G,
k : {1,...,n},
(x,v) : trees(A,k), (y,w) : trees(B,n+1-k)}
```
This is easy to implement, but not very efficient.
#NEW
==Better algorithms for context-free grammars==
CYK: for Chomsky normal form
Earley algorithm, chart parsing: for arbitrary grammars
GLR (Generalized LR): for arbitrary grammars
The worst-case complexity for these algorithms is cubic in the lenth of the input
(recall: LALR(1) is linear!)
#NEW
==The theoretical limits==
Parsing with context-free grammars is decidable; the worst-case
time complexity is cubic.
Programming language implementations strive for linear-time
parsing - this is possible for a limited class of grammars.
A simple language that is not context-free: the **copy language**
```
{x x | x : ('a' | 'b')*}
```
(proof omitted). Observe: this is not the same as
```
S ::= S S | "a" S | "b" S |
```
An instance of this: check that a variable is declared before
it is used.
#NEW
==The copy language in GF==
It is easy to define the copy language if constructors and linearization
functions are separated (cf. Lecture 2). This is how it is written in
[GF http://grammaticalframework.org]:
```
-- abstract syntax
cat S ; W ;
fun s : W -> S ;
fun e : W ;
fun a : W -> W ;
fun b : W -> W ;
lin s w = w ++ w ;
lin e = "" ;
lin a w = "a" ++ w ;
lin b w = "b" ++ w ;
```
For instance, ``abbababbab`` has the tree ``s (a (b (b (a (b e)))))``.
GF corresponds to a class of grammars known as
**parallel multiple context-free grammars**, which is useful in
natural language description. Its parsing complexity is polynomial.