Back to main page of Programming Language Technology.
The objective of this lab is to write a parser for a fragment "C--" of the C++ programming language. The parser should return an abstract syntax tree at success, and report an error with a line number at failure.
Before the lab can be submitted, the parser has to pass some tests, which are given on the course web page via links later in this document.
The recommended implementation of the parser is via a BNF grammar processed by the BNF Converter (BNFC) tool. The approximate size of the parser source code should be around 50 rules.
With BNFC, just the grammar (.cf file) has to be written and submitted by the lab groups.
Build the grammar gradually, so that you can parse the 3 test files in the given order: as your first goal, parse the first test file; then the second, and finally the third.
After treating the 3 tests, run the test script to make your parser perfect.
You can use any code generator (Haskell, Java, C,...) from BNFC. It need not be the language you use in later labs.
Make sure your system can interpret shell scripts (e.g., Bash) and make files.
For building the parser with BNFC, an installation of a recent version (≥ 2.8) of BNFC is needed.
For automatic running of the testsuite, you'll need to generate and run Haskell code, requiring:
It is recommended to install these all at once via the Haskell platform.
In case you want to build your parser in Java, you need:
javac
) and interpreter (java
), and, for the parser generation,
See also the Software section on the course homepage.
If you use Haskell, you have the advantage of
info files, which tell you exactly where the conflicts are.
Assuming your grammar file is CMM.cf
and you have called bnfc -d CMM.cf
,
you can create an info file from the generated Happy file:
happy -i CMM/Par.y
The info file is now in CMM/Par.info
. Moreover, you can produce a
debugging parser by adding the -d
flag to Happy. Update your Makefile like this:
happy -gcad CMM/Par.y
And then run make
again.
Running this parser shows a trace of the LR actions performed during parsing.
You can also get some debugging output when using the Java backend by calling CUPS with the -dump_states
flag.
Edit your Makefile like this:
CUPFLAGS = -nopositions -expect 100 -dump_states
And then recompile your parser with:
make -B CMM/sym.class 2> parser.info
The parser.info
file will contain some debugging information.
Unfortunately this format is harder to read than the info file generated by Happy (above).
It is suitable to explain and advisable to implement the parser top-down, starting with the largest units and proceeding to smaller ones.
The specification differs in some places from the official C++ specification.
To help you get going, we have marked with a bold (n) those rules that are needed to parse the nth test file.
A program is a sequence of function definitions. (1)
A program may also contain comments and preprocessor directives, which are just ignored by our parser (see below). (1)
Remark:
An interpretable program must have a function main
of type int
that takes no arguments. It may or may not have a return
statement.
int main () { ... }
However, the existence and format of main
is not ensured by the parser, but at a later stage (type checker).
A function definition has a type, a name, an argument list, and a body. (1) Example:
int foo(double x, int y) { return y + 9 ; }
A type is one of the four built-in types: (1)
int double bool void
An argument list is a comma-separated list of argument declarations.
It is enclosed in parentheses (
and )
.
(1)
An argument declaration has a type and an identifier, for instance
int x
Notice that argument declarations with multiple variables
(int x, y
) are not included. A declaration that occurs
as a statement (as shown below), can also have more than one variable.
A function body is a list of statements enclosed
in curly brackets {
and }
.
(1)
Almost all statement forms belong to stage 2.
Variable declarations have one of the following formats:
int i ; double x, y, z ;
int i = 6 ;
Any expression followed by a semicolon ;
can be used as a statement.
(2)
Examples:
i = 3 ; printInt(i) ;
Remark: While any expression is allowed, only expressions with some side effect make really sense here. A side effect could be an assignment or a call to a function that makes some input or output action.
Statements returning an expression, (2) for example:
return i + 9 ;
Any list of statements (including empty list) between curly brackets. (2) For instance,
{ int i = 2 ; { int i = 3 ; { } } i++ ; }
While loops, with an expression in parentheses followed by a statement, (2) for example:
while (i < 10) ++i ; while (condition) { x = x * 2 ; condition = (x < 1024) ; }
if
with an expression in parentheses followed by
a statement, else
, and another statement.
(The else-branch cannot be omitted.)
(2)
Example:
if (x > 0) return x ; else { y = y + 2 ; return y ; }
Almost all expression forms belong to stage 3.
The following table gives the expression forms e
,
their precedence levels, and their associativity.
The associativity of operators is given as left, right, or none.
For binary operators, in general any of the three associativity is meaningful.
For pre-, post-, and mixfix operators, at most one of
left or right associativity makes sense, and the alternative is non-associative.
As they are bracketed,
the arguments in a function call x(e,...,e)
can be expressions of any level.
Otherwise, some subexpressions have to be one precedence level above of
the main expression to implement the required associativity.
Remark. Precedence and associativity departs in some cases from the
C++ standard,
further, in/decrement and assignment is restricted to identifiers x
.
level | expression forms | assoc | explanation | |
---|---|---|---|---|
6 | literal | - | literal (2) | |
6 | identifier x |
- | variable (3) | |
6 | x(e,...,e) |
none | function call (3) | |
6 | x++ , x-- |
none | post in/decrement (3) | |
6 | ++x , --x |
none | pre in/decrement (3) | |
5 | e*e , e/e |
left | mult, div (3) | |
4 | e+e , e-e |
left | add, sub (3) | |
3 | e<e , e>e , e>=e , e<=e |
none | comparison (3) | |
3 | e==e , e!=e |
none | (in)equality (3) | |
2 | e&&e |
left | conjunction (3) | |
1 | e||e |
left | disjunction (3) | |
0 | x=e |
right | assignment (3) |
The language features integer literals, floating point literals, and the
boolean literals, true
and false
.
(2)
An identifier is a letter followed by a list of letters, digits, and underscores. (1)
There are three kinds of comments: (1)
/*
and */
.
//
to the end of the line.
#
to the end of the line.
Block comments cannot be nested.
For incremental construction of the grammar, use these example files:
Finally, run your parser through the test suite.
Your grammar must pass the test suite and meet the specification in this document in all respects. The test suite contains the example programs, as well as a number of programs which your parser must reject.
The parser must not have any conflicts, neither shift/reduce nor reduce/reduce.
The solution must be written in an easily readable and maintainable way. In particular, tailoring it for the programs in the test suite is not maintainable!