Book: 3.6, 3.7, 3.9, 4.1, 4.2, 4.3,
Regular expressions and finite-state automata
Nondeterministic and deterministic automata
Reasoning about automata and expressions
Regular and context-free languages
Decidability of context-free parsing
Limits of context-free grammars
name | notation | semantics | verbally | |
---|---|---|---|---|
symbol | 'a' | {a} | the symbol 'a' | |
sequence | A B | {xy | x : A, y : B} |
A followed by B | |
union | A | B |
A U B | A or B | |
closure | A* |
{A^n | n = 0,1,2,..} |
any number of A's | |
empty | eps | {[]} | the empty string |
This is BNFC notation.
The semantics is a regular language, a set of strings of symbols. One often writes L(E) for the language denoted by the expression E.
There are variants of the notation, especially for symbols and empties.
Quoting symbols is often omitted in theoretical examples!
A Nondeterministic Finite-state Automaton (NFA) is a tuple (Q,S,q0,F,d) where
A Deterministic Finite-state Automaton (DFA) is like NFA. except that the value of the transition function d is a single state in Q (instead of a subset of Q) and there are no epsilon transitions (transitions that don't consume a symbol). In other words,
It is customary to draw diagrams with circles for states and arrows for the transition function: you can find them in the course book, and we will draw them on the blackboard.
1) NFA generation. Compile the expression into an NFA, using transitions with the epsilon symbol e. (Book 3.7.4). This translation is compositional: simply traverse the syntax tree of the regular expression. The size of the automaton is linear in the size of the expression: O(|r|).
But NFAs are, in comparison to DFAs, - more difficult to implement - less efficient to run: O(|r| * n), where n is the input length.
2) Determinization. Transform the NFA into a DFA. (Book 3.7.1). Here you go through states, and for each symbol create a state collecting those states that the symbol can lead to (subset construction).
The DFA can still be bigger than necessary, i.e. have more states than needed.
3) Minimization. Merge equivalent states (Book 3.9.6): states beginning from which the automaton recognizes the same set of strings.
The resulting DFA permits analysis that is linear in the length of the input string: O (n). But its size is still exponential, at worst, in the length of the regular expression: O (2^|r|).
Book 3.7.4.
A typical example of compilation schemes:
NFA(eps) = NFA('c') = NFA(A B) = NFA(A | B) = NFA(A*) =
Notice: the compilation would be much simpler, if the operators were only between symbols.
Example:
'a' 'b' | 'a' 'c'
Book 3.7.1
Subset construction with epsilon closure
Typical example of back-end optimization: of time
Example (again):
'a' 'b' | 'a' 'c'
Book 3.9.6
Merging equivalent states
A further example of back-end optimization: of space
Example (once more):
'a' 'b' | 'a' 'c'
It is conversely possible to express any finite-state automaton by a regular expression. A decompilation algorithm is given e.g. in Hopcroft & Ullman, //Introduction to Automata Theory, Languages, and Computation// (1979).
Often it is easier to construct the expression that the automaton, sometimes vice-versa.
Example (easy with automata): any sequence of 0's and 1's with an even number of 0's. Construct an automaton with two states, corresponding to the number of 0's being even or odd.
Example (easy with expressions): algebraic reasoning (and simplification) of regular expressions, with laws such as
A eps = A A | A = A A** = A*
(Reminder: language = set of strings.)
Proposition. The language `{ a^n b^n | n = 0,1,2,...}`
is not regular.
Proof (Book 4.2.7). We show that any language
`a^n b^n`
needs at least n+1 states. It follows that
there can be no finite number of states to accept the language where
n can be arbitrary.
Consider the states s_0, s_1, ...., s_n
reached after reading
each relevant number of a
's. Each of these states must have
a different follow-up, and hence be a distinct state: if we had
s_i = s_j
, then the automaton would accept a^i b^j
.
A sequence of a's and b's where the n'th last symbol is a.
(a|b)* a (a|b)^n
The number of states in the DFA is exponential in n
.
Show this for n=1.
Java, C, and Haskell are not regular languages.
We cannot write their parser by using regular expressions.
This is because the languages have rules of the form a^n X b^n
e.g. for matching left and right parentheses.
A machine with finitely many states cannot remember that exactly
n left parentheses have been read, if n is arbitrary
(cf. the proposition on a^n b^n
above).
Every regular language is also a context-free language (i.e. can be described by a context-free grammar).
The inverse is not true. It holds for grammars with no recursion, and also for grammars with just left recursion
C ::= C X ===> C = Y X* C ::= Y
or right recursion
C ::= X C ===> C = X* Y C ::= Y
But it does not hold for languages with middle recursion:
C ::= X C Z ===> C = X^n Y Z^n C ::= Y
A context-free grammar is a quadruple (T,N,S,R) where
All rules have one of these forms:
C ::= A B -- two non-terminals C ::= t -- one terminal
This means that all parse trees are binary trees.
All context-free grammars can be brought to this form. For instance,
Exp ::= Exp "+" Exp ===> Exp ::= Exp PlusExp PlusExp ::= Plus Exp Plus ::= "+"
Thus the Chomsky normal form creates new categories and increases the number of rules.
It can be complicated to compute (exercise 4.4.8).
Lemma: a grammar can only generate a finite number of token sequences of a given length.
Method:
The lemma is easy to prove for grammars in the Chomsky normal form.
Recognition: just check if a string is grammatical in G.
seqs(C,1) = {t | (C ::= t) : G} seqs(C,n+1) = {x y | (C ::= A B) : G, k : {1,...,n}, x : seqs(A,k), y : seqs(B,n+1-k)}
Parsing: construct all possible parse trees in G.
trees(C,1) = {(t,f) | (f. C ::= t) : G} trees(C,n+1) = {(x y, f v w) | (f. C ::= A B) : G, k : {1,...,n}, (x,v) : trees(A,k), (y,w) : trees(B,n+1-k)}
This is easy to implement, but not very efficient.
CYK: for Chomsky normal form
Earley algorithm, chart parsing: for arbitrary grammars
GLR (Generalized LR): for arbitrary grammars
The worst-case complexity for these algorithms is cubic in the lenth of the input (recall: LALR(1) is linear!)
Parsing with context-free grammars is decidable; the worst-case time complexity is cubic.
Programming language implementations strive for linear-time parsing - this is possible for a limited class of grammars.
A simple language that is not context-free: the copy language
{x x | x : ('a' | 'b')*}
(proof omitted). Observe: this is not the same as
S ::= S S | "a" S | "b" S |
An instance of this: check that a variable is declared before it is used.
It is easy to define the copy language if constructors and linearization functions are separated (cf. Lecture 2). This is how it is written in GF:
-- abstract syntax cat S ; W ; fun s : W -> S ; fun e : W ; fun a : W -> W ; fun b : W -> W ; lin s w = w ++ w ; lin e = "" ; lin a w = "a" ++ w ; lin b w = "b" ++ w ;
For instance, abbababbab
has the tree s (a (b (b (a (b e)))))
.
GF corresponds to a class of grammars known as parallel multiple context-free grammars, which is useful in natural language description. Its parsing complexity is polynomial.