Lecture 5: Theory of Lexing and Parsing

Programming Languages Course
Aarne Ranta (aarne@chalmers.se)

Book: 3.6, 3.7, 3.9, 4.1, 4.2, 4.3,

Overview of this lecture

Regular expressions and finite-state automata

Nondeterministic and deterministic automata

Reasoning about automata and expressions

Regular and context-free languages

Decidability of context-free parsing

Limits of context-free grammars

The syntax and semantics of regular expressions (repetition)

name	notation	semantics	verbally
symbol	'a'	{a}	the symbol 'a'
sequence	A B	`{xy \| x : A, y : B}`	A followed by B
union	`A \| B`	A U B	A or B
closure	`A*`	`{A^n \| n = 0,1,2,..}`	any number of A's
empty	eps	{[]}	the empty string

This is BNFC notation.

The semantics is a regular language, a set of strings of symbols. One often writes L(E) for the language denoted by the expression E.

There are variants of the notation, especially for symbols and empties.

Quoting symbols is often omitted in theoretical examples!

Finite-state automata

A Nondeterministic Finite-state Automaton (NFA) is a tuple (Q,S,q0,F,d) where

Q, a finite set of states
S, a finite set of input symbols
q0 : Q, the start state
F < Q, the set of final states
d : Q x S' -> P(Q), transition function (S' = S U {eps})

A Deterministic Finite-state Automaton (DFA) is like NFA. except that the value of the transition function d is a single state in Q (instead of a subset of Q) and there are no epsilon transitions (transitions that don't consume a symbol). In other words,

d : Q x S -> Q, transition function

It is customary to draw diagrams with circles for states and arrows for the transition function: you can find them in the course book, and we will draw them on the blackboard.

Compiling regular expressions into finite-state automata

1) NFA generation. Compile the expression into an NFA, using transitions with the epsilon symbol e. (Book 3.7.4). This translation is compositional: simply traverse the syntax tree of the regular expression. The size of the automaton is linear in the size of the expression: O(|r|).

But NFAs are, in comparison to DFAs, - more difficult to implement - less efficient to run: O(|r| * n), where n is the input length.

2) Determinization. Transform the NFA into a DFA. (Book 3.7.1). Here you go through states, and for each symbol create a state collecting those states that the symbol can lead to (subset construction).

The DFA can still be bigger than necessary, i.e. have more states than needed.

3) Minimization. Merge equivalent states (Book 3.9.6): states beginning from which the automaton recognizes the same set of strings.

The resulting DFA permits analysis that is linear in the length of the input string: O (n). But its size is still exponential, at worst, in the length of the regular expression: O (2^|r|).

From regular expressions to NFA's

Book 3.7.4.

A typical example of compilation schemes:

    NFA(eps) =
  
    NFA('c') =
  
    NFA(A B) = 
  
    NFA(A | B) = 
  
    NFA(A*) =

Notice: the compilation would be much simpler, if the operators were only between symbols.

Example:

    'a' 'b' | 'a' 'c'

From NFA's to DFA's

Book 3.7.1

Subset construction with epsilon closure

Typical example of back-end optimization: of time

Example (again):

    'a' 'b' | 'a' 'c'

Minimizing DFA's

Book 3.9.6

Merging equivalent states

A further example of back-end optimization: of space

Example (once more):

    'a' 'b' | 'a' 'c'

Equivalence between regular expressions into finite-state automata

It is conversely possible to express any finite-state automaton by a regular expression. A decompilation algorithm is given e.g. in Hopcroft & Ullman, //Introduction to Automata Theory, Languages, and Computation// (1979).

Often it is easier to construct the expression that the automaton, sometimes vice-versa.

Example (easy with automata): any sequence of 0's and 1's with an even number of 0's. Construct an automaton with two states, corresponding to the number of 0's being even or odd.

Example (easy with expressions): algebraic reasoning (and simplification) of regular expressions, with laws such as

    A eps = A
    A | A = A
    A**   = A*

Example of reasoning on automata

(Reminder: language = set of strings.)

Proposition. The language `{ a^n b^n | n = 0,1,2,...}` is not regular.

Proof (Book 4.2.7). We show that any language `a^n b^n` needs at least n+1 states. It follows that there can be no finite number of states to accept the language where n can be arbitrary.

Consider the states s_0, s_1, ...., s_n reached after reading each relevant number of a's. Each of these states must have a different follow-up, and hence be a distinct state: if we had s_i = s_j, then the automaton would accept a^i b^j.

Example of exponential DFA size

A sequence of a's and b's where the n'th last symbol is a.

    (a|b)* a (a|b)^n

The number of states in the DFA is exponential in n.

Show this for n=1.

Limitations of regular languages

Java, C, and Haskell are not regular languages.

We cannot write their parser by using regular expressions.

This is because the languages have rules of the form a^n X b^n e.g. for matching left and right parentheses. A machine with finitely many states cannot remember that exactly n left parentheses have been read, if n is arbitrary (cf. the proposition on a^n b^n above).

Regular and context-free languages

Every regular language is also a context-free language (i.e. can be described by a context-free grammar).

The inverse is not true. It holds for grammars with no recursion, and also for grammars with just left recursion

    C ::= C X      ===>  C = Y X*
    C ::= Y

or right recursion

    C ::= X C      ===>  C = X* Y
    C ::= Y

But it does not hold for languages with middle recursion:

    C ::= X C Z    ===>  C = X^n Y Z^n
    C ::= Y

The definition of context-free grammars (repetition)

A context-free grammar is a quadruple (T,N,S,R) where

T and N are disjoint sets, called terminals and nonterminals, respectively
S is a nonterminal, the start category
R is a finite set of rules
a rule is a pair (C, t_1...t_n) where
- C is a nonterminal
- each t_1 is a terminal or a nonterminal
- n >= 0

The Chomsky normal form

All rules have one of these forms:

    C ::= A B  -- two non-terminals
    C ::= t    -- one terminal

This means that all parse trees are binary trees.

All context-free grammars can be brought to this form. For instance,

    Exp ::= Exp "+" Exp   ===>  Exp     ::= Exp PlusExp
                                PlusExp ::= Plus Exp
                                Plus    ::= "+"

Thus the Chomsky normal form creates new categories and increases the number of rules.

It can be complicated to compute (exercise 4.4.8).

A proof of decidability

Lemma: a grammar can only generate a finite number of token sequences of a given length.

Method:

given a string, generate all sequences of the same length as it
check if the given string is among the generated ones

The lemma is easy to prove for grammars in the Chomsky normal form.

Recognition and parsing for Chomsky normal form

Recognition: just check if a string is grammatical in G.

    seqs(C,1)   = {t | (C ::= t) : G}
    seqs(C,n+1) = {x y | 
      (C ::= A B) : G,
      k : {1,...,n}, 
      x : seqs(A,k), y : seqs(B,n+1-k)}

Parsing: construct all possible parse trees in G.

    trees(C,1)   = {(t,f) | (f. C ::= t) : G}
    trees(C,n+1) = {(x y, f v w) | 
      (f. C ::= A B) : G,
      k : {1,...,n}, 
      (x,v) : trees(A,k), (y,w) : trees(B,n+1-k)}

This is easy to implement, but not very efficient.

Better algorithms for context-free grammars

CYK: for Chomsky normal form

Earley algorithm, chart parsing: for arbitrary grammars

GLR (Generalized LR): for arbitrary grammars

The worst-case complexity for these algorithms is cubic in the lenth of the input (recall: LALR(1) is linear!)

The theoretical limits

Parsing with context-free grammars is decidable; the worst-case time complexity is cubic.

Programming language implementations strive for linear-time parsing - this is possible for a limited class of grammars.

A simple language that is not context-free: the copy language

    {x x | x : ('a' | 'b')*}

(proof omitted). Observe: this is not the same as

    S ::= S S | "a" S | "b" S |

An instance of this: check that a variable is declared before it is used.

The copy language in GF

It is easy to define the copy language if constructors and linearization functions are separated (cf. Lecture 2). This is how it is written in GF:

    -- abstract syntax
    cat S ; W ;
    fun s : W -> S ;
    fun e : W ;
    fun a : W -> W ;
    fun b : W -> W ;
  
    lin s w = w ++ w ;
    lin e   = "" ;
    lin a w = "a" ++ w ;
    lin b w = "b" ++ w ;

For instance, abbababbab has the tree s (a (b (b (a (b e))))).

GF corresponds to a class of grammars known as parallel multiple context-free grammars, which is useful in natural language description. Its parsing complexity is polynomial.