Lecture 5: Theory of Lexing and Parsing Programming Languages Course Aarne Ranta (aarne@chalmers.se) %!target:html %!postproc(html): #NEW %!postproc(html): #HR
Book: 3.6, 3.7, 3.9, 4.1, 4.2, 4.3, #NEW ==Overview of this lecture== Regular expressions and finite-state automata Nondeterministic and deterministic automata Reasoning about automata and expressions Regular and context-free languages Decidability of context-free parsing Limits of context-free grammars #NEW ==The syntax and semantics of regular expressions (repetition)== || name | notation | semantics | verbally || | symbol | 'a' | {a} | the symbol 'a' | sequence | A B | ``{xy | x : A, y : B}`` | A followed by B | union | ``A | B`` | A U B | A or B | closure | ``A*`` | ``{A^n | n = 0,1,2,..}`` | any number of A's | empty | eps | {[]} | the empty string This is BNFC notation. The semantics is a **regular language**, a set of strings of symbols. One often writes //L(E)// for the language denoted by the expression //E//. There are variants of the notation, especially for symbols and empties. Quoting symbols is often omitted in theoretical examples! #NEW ==Finite-state automata== A **Nondeterministic Finite-state Automaton** (NFA) is a tuple (Q,S,q0,F,d) where - Q, a finite set of states - S, a finite set of input symbols - q0 : Q, the start state - F < Q, the set of final states - d : Q x S' -> P(Q), transition function (S' = S U {eps}) A **Deterministic Finite-state Automaton** (DFA) is like NFA. except that the value of the transition function d is a single state in Q (instead of a subset of Q) and there are no **epsilon transitions** (transitions that don't consume a symbol). In other words, - d : Q x S -> Q, transition function It is customary to draw diagrams with circles for states and arrows for the transition function: you can find them in the course book, and we will draw them on the blackboard. #NEW ==Compiling regular expressions into finite-state automata== 1) **NFA generation**. Compile the expression into an NFA, using transitions with the epsilon symbol e. (Book 3.7.4). This translation is **compositional**: simply traverse the syntax tree of the regular expression. The size of the automaton is linear in the size of the expression: O(|r|). But NFAs are, in comparison to DFAs, - more difficult to implement - less efficient to run: O(|r| * n), where n is the input length. 2) **Determinization**. Transform the NFA into a DFA. (Book 3.7.1). Here you go through states, and for each symbol create a state collecting those states that the symbol can lead to (subset construction). The DFA can still be bigger than necessary, i.e. have more states than needed. 3) **Minimization**. Merge equivalent states (Book 3.9.6): states beginning from which the automaton recognizes the same set of strings. The resulting DFA permits analysis that is linear in the length of the input string: O (n). But its size is still exponential, at worst, in the length of the regular expression: O (2^|r|). #NEW ==From regular expressions to NFA's== Book 3.7.4. A typical example of **compilation schemes**: ``` NFA(eps) = NFA('c') = NFA(A B) = NFA(A | B) = NFA(A*) = ``` Notice: the compilation would be much simpler, if the operators were only between symbols. Example: ``` 'a' 'b' | 'a' 'c' ``` #NEW ==From NFA's to DFA's== Book 3.7.1 Subset construction with epsilon closure Typical example of back-end optimization: of time Example (again): ``` 'a' 'b' | 'a' 'c' ``` #NEW ==Minimizing DFA's== Book 3.9.6 Merging equivalent states A further example of back-end optimization: of space Example (once more): ``` 'a' 'b' | 'a' 'c' ``` #NEW ==Equivalence between regular expressions into finite-state automata== It is conversely possible to express any finite-state automaton by a regular expression. A decompilation algorithm is given e.g. in Hopcroft & Ullman, //Introduction to Automata Theory, Languages, and Computation// (1979). Often it is easier to construct the expression that the automaton, sometimes vice-versa. **Example** (easy with automata): any sequence of 0's and 1's with an even number of 0's. Construct an automaton with two states, corresponding to the number of 0's being even or odd. **Example** (easy with expressions): algebraic reasoning (and simplification) of regular expressions, with laws such as ``` A eps = A A | A = A A** = A* ``` #NEW ==Example of reasoning on automata== (Reminder: language = set of strings.) **Proposition**. The language ```{ a^n b^n | n = 0,1,2,...}``` is not regular. **Proof** (Book 4.2.7). We show that any language ```a^n b^n``` needs at least //n+1// states. It follows that there can be no finite number of states to accept the language where //n// can be arbitrary. Consider the states ``s_0, s_1, ...., s_n`` reached after reading each relevant number of ``a``'s. Each of these states must have a different follow-up, and hence be a distinct state: if we had ``s_i = s_j``, then the automaton would accept ``a^i b^j``. #NEW ==Example of exponential DFA size== A sequence of //a//'s and //b//'s where the n'th last symbol is //a//. ``` (a|b)* a (a|b)^n ``` The number of states in the DFA is exponential in ``n``. Show this for n=1. #NEW ==Limitations of regular languages== Java, C, and Haskell are not regular languages. We cannot write their parser by using regular expressions. This is because the languages have rules of the form ``a^n X b^n`` e.g. for matching left and right parentheses. A machine with finitely many states cannot remember that exactly //n// left parentheses have been read, if //n// is arbitrary (cf. the proposition on ``a^n b^n`` above). #NEW ==Regular and context-free languages== Every regular language is also a context-free language (i.e. can be described by a context-free grammar). The inverse is not true. It holds for grammars with no recursion, and also for grammars with just **left recursion** ``` C ::= C X ===> C = Y X* C ::= Y ``` or **right recursion** ``` C ::= X C ===> C = X* Y C ::= Y ``` But it does not hold for languages with **middle recursion**: ``` C ::= X C Z ===> C = X^n Y Z^n C ::= Y ``` #NEW ==The definition of context-free grammars (repetition) == A context-free grammar is a quadruple (//T,N,S,R//) where - //T// and //N// are disjoint sets, called **terminals** and **nonterminals**, respectively - //S// is a nonterminal, the **start category** - //R// is a finite set of **rules** - a rule is a pair (//C//, //t_1//...//t_n//) where - //C// is a nonterminal - each //t_1// is a terminal or a nonterminal - n >= 0 #NEW ==The Chomsky normal form== All rules have one of these forms: ``` C ::= A B -- two non-terminals C ::= t -- one terminal ``` This means that all parse trees are binary trees. All context-free grammars can be brought to this form. For instance, ``` Exp ::= Exp "+" Exp ===> Exp ::= Exp PlusExp PlusExp ::= Plus Exp Plus ::= "+" ``` Thus the Chomsky normal form creates new categories and increases the number of rules. It can be complicated to compute (exercise 4.4.8). #NEW ==A proof of decidability== Lemma: a grammar can only generate a finite number of token sequences of a given length. Method: + given a string, generate all sequences of the same length as it + check if the given string is among the generated ones The lemma is easy to prove for grammars in the Chomsky normal form. #NEW ==Recognition and parsing for Chomsky normal form== Recognition: just check if a string is grammatical in G. ``` seqs(C,1) = {t | (C ::= t) : G} seqs(C,n+1) = {x y | (C ::= A B) : G, k : {1,...,n}, x : seqs(A,k), y : seqs(B,n+1-k)} ``` Parsing: construct all possible parse trees in G. ``` trees(C,1) = {(t,f) | (f. C ::= t) : G} trees(C,n+1) = {(x y, f v w) | (f. C ::= A B) : G, k : {1,...,n}, (x,v) : trees(A,k), (y,w) : trees(B,n+1-k)} ``` This is easy to implement, but not very efficient. #NEW ==Better algorithms for context-free grammars== CYK: for Chomsky normal form Earley algorithm, chart parsing: for arbitrary grammars GLR (Generalized LR): for arbitrary grammars The worst-case complexity for these algorithms is cubic in the lenth of the input (recall: LALR(1) is linear!) #NEW ==The theoretical limits== Parsing with context-free grammars is decidable; the worst-case time complexity is cubic. Programming language implementations strive for linear-time parsing - this is possible for a limited class of grammars. A simple language that is not context-free: the **copy language** ``` {x x | x : ('a' | 'b')*} ``` (proof omitted). Observe: this is not the same as ``` S ::= S S | "a" S | "b" S | ``` An instance of this: check that a variable is declared before it is used. #NEW ==The copy language in GF== It is easy to define the copy language if constructors and linearization functions are separated (cf. Lecture 2). This is how it is written in [GF http://grammaticalframework.org]: ``` -- abstract syntax cat S ; W ; fun s : W -> S ; fun e : W ; fun a : W -> W ; fun b : W -> W ; lin s w = w ++ w ; lin e = "" ; lin a w = "a" ++ w ; lin b w = "b" ++ w ; ``` For instance, ``abbababbab`` has the tree ``s (a (b (b (a (b e)))))``. GF corresponds to a class of grammars known as **parallel multiple context-free grammars**, which is useful in natural language description. Its parsing complexity is polynomial.