Book: 3.1, 3.3, 3.4, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9
Implementing lexers by hand
Regular expressions
Lexer tools: Alex/JLex/Flex
Implementing top-down (LL) parsers by hand
LL parsing tables
Bottom-up (LR) parsing methods
LALR parsing tools
Debugging and conflict solving
Parsing is done in accordance with BNF rules such as
Stm ::= "if" "(" Exp ")" Stm "else" Stm ; Exp ::= Exp "+" Exp ; Exp ::= Ident ; Exp ::= Integer ;
The terminals (in quotes) are tokens, and in the end all rules generate and recognize lists of tokens.
Also Ident
and Integer
are a special kinds of tokens.
The task of lexical analysis is to split the input string into tokens.
Tokens are sometimes separated by spaces (/newlines/tabs), but not necessarily:
if (i>0) {i=i+27;} else {i++;} if, (, i, >, 0, ), {, i, =, i, +, 27, ;, }, else, {, i, ++, ;, }
Good design: it should always be legal to have spaces, newlines, or tabs between tokens - even if not necessary.
Quiz: which programming languages violate this rule?
It is possible for the parser to read input character by character; in this case, the grammar should tell where spaces are possible/necessary.
Exp ::= Exp Space "+" Space Exp ; Space ::= [SpaceChar] ; SpaceChar ::= " " | "\n" | "\t" ;
But this really clutters the grammar!
Programming languages are usually designed in such a way that lexical analysis can be done before parsing, and parsing gets tokens as its input.
The lexer splits the code into tokens.
The lexer also classifies each token into classes. For example, the string
i2=i1+271
results in the following list of classified tokens:
identifier "i1" symbol "=" identifier "i2" symbol "+" integer "271"
The lexer should read the source code character by character, and send tokens to the parser.
After each token, it should use the next character c
to decide
what kind of token to read.
c
is a digit, collect an integer as long as you read
digits
c
is a letter, collect an identifier as long as you read
identifier characters (digit, letter, '
)
c
is a double quote, collect a string literal as long
as you read characters other than a double quote
c
is space character (i.e. space, newline, or tab), ignore
it and read next character
Longest match: read a token as long as you get characters that can belong to it.
Since comments (/* ... */
) are not saved by the lexer, an easy way
to remove them might be in a separate, earlier pass.
However, if we have string literals, we have to spare comment signs in them:
"no comment /* here */ please!"
Thus also comments belong to the lexer:
/
and the next character is *
, then ignore
characters until you encounter *
and /
The lexer here needs a 1-character lookahead - it must read one more than the next character
When lexing, the lexer knows if it is reading a string, and does not start reading a comment in the middle of this.
Reserved symbols and words: the terminals in the BNF grammar.
There is a finite number of reserved symbols and words (why?)
Reserved symbols: + > ++ >> >=
etc.
Longest match: if ++
is found, +
is not returned.
This explains a detail of C++ template syntax: space is necessary in
vector<vector<int> >
Reserved words: usually similar to identifiers, longer than special symbols:
if inline while
etc
Book 3.4.1, 3.4.3
A picture will be shown
Transition diagrams can be hand-coded by using case
expressions.
Book 3.4.4 gives a Java example; here is a Haskell one
lexCmm :: String -> [Tok] lexCmm s = case s of c:cs | isSpace c -> lexCmm cs c:cs | isAlpha c -> getId s c:cs | isDigit c -> getInt s c:d:cs | isSymb [c,d] -> TS [c,d] : lexCmm cs c:cs | isSymb [c] -> TS [c] : lexCmm cs _ -> [] -- covers end of file and unknown characters where getId s = lx i : lexCmm cs where (i,cs) = span isIdChar s getInt s = TI (read i) : lexCmm cs where (i,cs) = span isDigit s isIdChar c = isAlpha c || isDigit c lx i = if isReservedWord i then TS i else TI i isSymb s = elem s $ words "++ -- == <= >= ++ { } = , ; + * - ( ) < >" isReservedWord w = elem w $ words "else if int main printInt return while"
But the code is messy, and hard to read and maintain.
A good way of specifying and documenting the lexer is transition diagrams.
More concisely, we can use regular expressions:
Tokens = Space (Token Space)* Token = TInt | TId | TKey | TSpec TInt = Digit Digit* Digit = '0' | '1' | '2' |'3' | '4' |'5' | '6' |'7' | '8' | '9' TId = Letter IdChar* Letter = 'A' | ... | 'Z' | 'a' | ... | 'z' IdChar = Letter | Digit TKey = 'i' 'f' | 'e' 'l' 's' 'e' | ... TSpec = '+''+' | '+' | ... Space = (' ' | '\n' | '\t')*
What is more, these expressions can be compiled into a program that performs the lexing! The program implements a transition diagram - more precisely, a finite state automaton.
name | notation | semantics | verbally | |
---|---|---|---|---|
symbol | 'a' | {a} | the symbol 'a' | |
sequence | A B | {xy | x : A, y : B} |
A followed by B | |
union | A | B |
A U B | A or B | |
closure | A* |
{A^n | n = 0,1,2,..} |
any number of A's | |
empty | eps | {[]} | the empty string |
This is BNFC notation.
The semantics is a regular language, a set of strings of symbols. One often writes L(E) for the language denoted by the expression E.
There are variants of the notation, especially for symbols and empties. Quoting symbols is not so common, since it lengthens the expressions - but it is very good for clarity!
name | notation | definition/semantics | verbally | |
---|---|---|---|---|
symbol range | ['c'..'t'] |
'c'|...|'t' | any symbol from 'c' to 't' | |
symbol | char |
{x | x is a character} |
any character | |
letter | letter |
['A'..'Z'] | ['a'..'z'] |
any letter | |
digit | digit |
['0'..'9'] |
any digit | |
option | A? |
A | eps |
optional A | |
n-sequence | A^n |
{x1...xn | xi : A |
n A's | |
nonempty closure | A+ |
A A* |
any positive number of A's | |
difference | A - B |
{x | x : A & not (x : B)} |
the difference of A and B |
The symbol range and n-sequence are not supported by BNFC.
"Any character" in BNFC is ASCII character 0..255.
Define each token type by using regular expressions
Compile the regular expressions into a code for a transition diagram (finite-state automaton).
Compile the resulting code into a binary.
regexp in BNFC file | v bnfc program Flex/Jlex/Alex file | v flex/jlex/alex program C/C++/Java/Haskell file | v gcc/javac/ghc program binary file
+ full control on the program (lookahead)
+ usually smaller program
+ can be quick to create (if the lexer is simple)
+ can be made more powerful than regular expressions
- difficult to get right (if the lexer is complex)
- much extra work to include position information etc.
- may compromise performance (lookahead)
- not self-documenting
- not portable
- lexer generators are state-of-the-art in compilers
Thus the usual trade-offs between hand-made and generated code hold.
In this course, we use generated lexers in the labs.
You can take a look at the web pages, manuals, and bnfc output for
Alex: Haskell
JLex: Java
Flex: C, C++
The BNF Converter web page has links to all these tools.
Reserved word and symbols: terminals in the rules.
Built-in types
Integer Double String Char
Comments
comment "/*" "*/" ; comment "//" ;
Token
token Id (letter (letter | digit | '_')*)
Position token (for better error messages in later phases)
position token Id (letter (letter | digit | '_')*)
type | definition | |
---|---|---|
Ident |
letter (letter | digit | '_' | '\'')* |
|
Integer |
digit+ |
|
Double |
digit+ '.' digit+ ('e' '-'? digit+)? |
|
String |
'"' ((char - ["\"\\"]) | ('\\' ["\"\\nt"]))* '"' |
|
Char |
'\'' ((char - ["'\\"]) | ('\\' ["'\\nt"])) '\'' |
From token list to syntax tree.
Initially: to parse tree.
In practice, at the same time: to abstract syntax tree.
Sometimes even: to type-checked tree, to intermediate code... but we will postpone such semantic actions to syntax-directed translation (next lecture).
Like lexing: use standard tools.
Input to the tool: a BNF grammar (+ semantic actions).
Output of the tool: parser code in target programming language.
Like lexers, it is possible to write a parser by hand - but this is tedious and error-prone. It is also easy to end up with inefficiency and nontermination.
Structure: for each category, write a function that constructs a tree in a way depending on the first token.
Example grammar:
SIf. Stm ::= "if" "(" Exp ")" Stm ; SWhile. Stm ::= "while" "(" Exp ")" Stm ; SExp. Stm ::= Exp ; EInt. Exp ::= Integer ;
Two functions:
Stm pStm(): next == "if" -> ... build tree with SIf next == "while" -> ... build tree with SWhile next is integer -> ... build tree with SExp Exp pExp(): next is integer k -> return EInt k
How do we fill the three dots?
Follow the structure of the grammar
SIf. Stm ::= "if" "(" Exp ")" Stm ; SWhile. Stm ::= "while" "(" Exp ")" Stm ; SExp. Stm ::= Exp ";" ;
We also need an auxiliary for just ignoring tokens,
and a global variable next
for the next token.
void ignore(Token t): check next==t shift next to the token after
There is a parsing function for each category, calling recursively other parsing functions and itself, as guided by the grammar.
Stm pStm(): next == "if" -> ignore("if") ignore("(") Exp e = pExp() ignore(")") Stm s = pStm() return SIf(e,s) //next == "while" similar next is integer -> Exp e = pExp() return SExp(e)
The pseudocode on previous page is directly translatable to both imperative and functional code.
In Java, see Book 2.4.2, 4.4.1
In Haskell, you can use monads or parser combinators.
In either case, this is a quick way to implement small parsers.
But it is quite limited, as we shall see.
Example:
SIf. Stm ::= "if" "(" Exp ")" Stm SIfElse. Stm ::= "if" "(" Exp ")" Stm "else" Stm
which one should we choose when we see "if"?
This situation is called a conflict.
It can be solved by left factoring - sharing the common left part of the rules (Book 4.3.4):
SIE. Stm ::= "if" "(" Exp ")" Stm Rest RElse. Rest ::= "else" Stm REmp. Rest ::=
To get the original abstract syntax, we need
a function that depends on Rest
.
f(SIE exp stm REmp) = SIf exp stm f(SIE exp stm (RElse stm2)) = SIfElse exp stm stm2
What can we do in the grammar
Exp ::= Exp "+" Integer Exp ::= Integer
To build an Exp
, the parser first tries to build an Exp
, and
so on.
Left recursion: the value category is itself the leftmost item in a production.
Solution: translate it away as follows (4.3.3)
Exp ::= Integer Rest Rest ::= "+" Integer Rest Rest ::=
The category Rest
now has right recursion, which is
harmless.
Again, translation functions are needed to get the original abstract syntax.
Notice: left recursion can be implicit (see Book, Algorithm 4.19)
Exp ::= Val "+" Integer Val ::= Exp
Used for guiding recursive descent parsers
Example
SIf. Stm ::= "if" "(" Exp ")" Stm ; SWhile. Stm ::= "while" "(" Exp ")" Stm ; SExp. Stm ::= Exp ";" ; EInt. Exp ::= Integer ;
- | if | while | Int | ( | ) | ; | $ (END) | |
---|---|---|---|---|---|---|---|---|
Stm | SIf | SWhile | SExp | - | - | - | - | |
Exp | - | - | EInt | - | - | - | - |
LL(1) conflict: a cell contains more than one rule.
Leftmost derivation of while(1) if (0) 6 ;
Stm --> while ( Exp ) Stm --> while ( 1 ) Stm --> while ( 1 ) if ( Exp ) Stm --> while ( 1 ) if ( 0 ) Stm --> while ( 1 ) if ( 0 ) Exp ; --> while ( 1 ) if ( 0 ) 6 ;
Rightmost derivation of while(1) if (0) 6 ;
Stm --> while ( Exp ) Stm --> while ( Exp ) if ( Exp ) Stm --> while ( Exp ) if ( Exp ) Exp ; --> while ( Exp ) if ( Exp ) 6 ; --> while ( Exp ) if ( 0 ) 6 ; --> while ( 1 ) if ( 0 ) 6 ;
Top-down: try to "fill" the tree from root to the leaves.
Example: LL(k) - Left-to-right, Leftmost derivation, k item lookahead.
Task: to predict what production to use, after seeing k tokens
Bottom-up: build larger and larger trees by combining leaves.
Example: LR(k) - Left-to-right, Rightmost derivation, k item lookahead.
Task: decide what to do after seeing some tokens before and k after.
Book 4.6.3
The parser reads its input, and builds a stack of results. After every symbol, it chooses among four actions:
State: position in a rule, marked by a dot, e.g.
Stm ::= "if" "(" . Exp ")" Stm
Actions are collected to a table. This table is similar to the transition function of an NFA. It has a column for each possible input symbol.
Grammar
1. Exp ::= Exp "+" Exp1 2. Exp ::= Exp1 3. Exp1 ::= Exp1 "*" Integer 4. Exp1 ::= Integer
Parsing run
stack | input | action |
---|---|---|
1 + 2 * 3 | shift | |
1 | + 2 * 3 | reduce 4 |
Exp1 | + 2 * 3 | reduce 2 |
Exp | + 2 * 3 | shift |
Exp + | 2 * 3 | shift |
Exp + 2 | * 3 | reduce 4 |
Exp + Exp1 | * 3 | shift |
Exp + Exp1 * | 3 | shift |
Exp + Exp1 * 3 | 3 | reduce 3 |
Exp + Exp1 | reduce 1 | |
Exp | accept |
LR(0) is the simplest variant, but not very powerful.
LR(1) can produce very large tables.
LR(k) for k > 1 is too large in practice.
SLR (Simple LR): like LR(0), but reduce conditioned on follow.
LALR(1): like LR(1), but merge states with identical items.
Expressivity:
Standard tools (YACC, Bison, CUP, Happy) use LALR(1).
Simple example (the previous grammar; additional rule 0 for integer literals)
+ * $ int Integer Exp1 Exp 3 Integer -> L_integ . r0 r0 r0 - 4 Exp1 -> Integer . r4 r4 r4 - 5 Exp1 -> Exp1 . '*' Integer - s8 - - 6 %start_pExp -> Exp . s9 - a - Exp -> Exp . '+' Exp1 7 Exp -> Exp1 . r2 s8 r2 - Exp1 -> Exp1 . '*' Integer 8 Exp1 -> Exp1 '*' . Integer - - - s3 g11 9 Exp -> Exp '+' . Exp1 - - - s3 g4 g10 10 Exp -> Exp '+' Exp1 . r1 s8 r1 Exp1 -> Exp1 . '*' Integer 11 Exp1 -> Exp1 '*' Integer . r3 r3 r3
The parser has several possible actions to take.
Shift-reduce conflict: between shift and reduce action.
Reduce-reduce conflict: between two (or more) reduce actions.
The latter are more harmful, but also more easy to eliminate.
Plain ambiguities:
EVar. Exp ::= Ident ; ECons. Exp ::= Ident ;
Solution: wait until type checher to distinguish constants from variables.
Implicit ambiguities (Lab1 !):
SExp. Stm ::= Exp ; SDecl. Stm ::= Decl ; DTyp. Decl ::= Typ ; EId. Exp ::= Ident ; TId. Typ ::= Ident ;
Solution: actually, DTyp
is only valid in function parameter
lists - not as statements!
Classic example: dangling else
SIf. Stm ::= "if" "(" Exp ")" Stm SIfElse. Stm ::= "if" "(" Exp ")" Stm "else" Stm
What happens with this input at position (.)?
if (x > 0) if (y < 8) return y ; . else return x ;
Two possible outcomes:
shift: if (x > 0) { if (y < 8) return y ; else return x ;} reduce: if (x > 0) { if (y < 8) return y ;} else return x ;
Standard tools always choose shift rather than reduce.
This could be avoided by rewriting the grammar (Book 4.3.2). But usually the conflict is tolerated as a "well-understood" one.
File formats with special syntax
What the files look like:
Exp :: { Exp } Exp : Exp '+' Exp1 { EAdd $1 $3 } | Exp1 { $1 }
L_integ { PT _ (TI $$) } Integer :: { Integer } : L_integ { (read $1) :: Integer }
Shows the LALR(1) table in a human-readable form.
Create an info file:
happy -i ParCPP.y
Check which rules are overshadowed in conflicts:
grep "(reduce" ParConf.info
Interestingly, conflicts tend concentrate on a few rules. If you have very many, do
grep "(reduce" ParConf.info | sort | uniq
The conflicts are (usually) the same in all tools.
Since the info file contains no Haskell, you can use Happy's info file if even if you principally work with another tool.
State 34 Const -> Ident . (rule 123) Const -> Ident . '<' ListType '>' (rule 124) '>>' reduce using rule 123 '<' shift, and enter state 176 (reduce using rule 123) State 176 Const -> Ident '<' . ListType '>' (rule 124) State 243 Stm -> 'if' '(' Exp ')' Stm . (rule 49) Stm -> 'if' '(' Exp ')' Stm . 'else' Stm (rule 50) 'do' reduce using rule 49 'else' shift, and enter state 249 (reduce using rule 49) State 249 Stm -> 'if' '(' Exp ')' Stm 'else' . Stm (rule 50)
Create a debugging Happy parser:
happy -da ParCPP.y
With Bison, you can use gdb
(GNU Debugger),
which traces back the execution to lines in the
Bison source file.