Back to main page of Programming Language Technology.
The objective of this lab is to write an LR parser for a fragment of the C++ programming language. The parser should return an abstract syntax tree at success, and report an error with a line number at failure.
Before the lab can be submitted, the parser has to pass some tests, which are given on the course web page via links later in this document.
The recommended implementation of the parser is via a BNF grammar processed by the BNF Converter (BNFC) tool. The approximate size of the parser source code should be 100 rules.
With BNFC, just the grammar (.cf file) has to be written and submitted by the lab groups.
Build the grammar gradually, so that you can parse the six test files in the given order: as your first goal, parse the first test file; then the second, and so on.
When you have passed one level, just try to parse the next example with your current parser and examine the line at which it fails. In this way, you will find out what new grammar rules you need.
After treating the six tests, run the test suite to make your parser perfect.
You can use any BNFC backend that produces an LR parser. It need not be the language you use in later labs. The LR backends of BNFC are:
--haskell
, using the Happy parser generator.
--java
, using CUP.
--c
and --cpp
, using Bison.
--ocaml
, using ocamlyacc
.
--ocaml-menhir
, using Menhir.
The --java-antlr
backend generates an LL parser via ANTLR, so
please do not use this backend for lab 1 as your grammar might then end up not being LR.
Make sure your system can interpret shell scripts (e.g., Bash) and make files.
For building the parser with BNFC, an installation of a recent version (≥ 2.8) of BNFC is needed. The recommended version is 2.9.3.
For automatic running of the testsuite, you'll need to generate and run Haskell code, requiring:
It is recommended to install these via Stack or all at once via the Haskell platform.
In case you want to build your parser in Java, you need:
javac
) and interpreter (java
),
See also the Software section on the course homepage.
If you use Haskell, you have the advantage of
info files, which tell you exactly where the conflicts are.
Assuming your grammar file is CPP.cf
and you have called bnfc -dm CPP.cf
,
you can create an info file from the generated Happy file either by
make
(if the Makefile
was created by BNFC ≥ 2.8.3) or manually via
happy -i CPP/Par.y
The info file is now in CPP/Par.info
. Moreover, you can produce a
debugging parser by adding the -d
flag to Happy. Update your Makefile like this:
happy -gcaid CPP/Par.y
And then run make
again.
Running this parser shows a trace of the LR actions performed during parsing.
You can also get some debugging output when using the Java backend by calling CUPS with the -dump_states
flag.
Edit your Makefile like this:
CUPFLAGS = -nopositions -expect 100 -dump_states
And then recompile your parser with:
make -B CPP/sym.class 2> parser.info
The parser.info
file will contain some debugging information.
Unfortunately this format is harder to read than the info file generated by Happy (above).
It is suitable to explain and advisable to implement the parser top-down, starting with the largest units and proceeding to smaller ones.
The specification differs in some places from the official C++ specification.
To help you get going, we have marked with a bold (n) those rules that are needed to parse the nth test file.
A program is a sequence of definitions. (1)
A program may also contain comments and preprocessor directives, which are just ignored by the parser (see below). (1)
If you are using BNFC, the category for programs should be named
Program
, to comply with the expectations of the
test suite.
int foo(double x, int y) { return y + 9 ; }
typedef
statements (6)
using std::vector ;
(
and )
. (1)
int int x int x = 5 const int& xNotice that argument declarations with multiple variables (
int x, y
) are not included. Exception see below:
A declaration that occurs
as a statement can also have one or more variables.
{
and }
(1),
or an empty body consisting
of a semicolon (6). Example:
int foo(double x, int y) ;
;
can be used as a statement. (1)
int x; int x = 5, y, z = 3; const int x, y = 0; const int& x = y;
return i + 9 ;
while (i < 10) ++i ;
do ++i ; while (i < 10) ;
for (int i = 0 ; i != 10 ; ++i) k = k + i ;We do not require that any of the fields in parentheses may be empty.
if
with an expression in parentheses followed by
a statement and optionally by else
and a statement. (3) Examples:
if (x > 0) return x ; if (x > 0) return x ; else return y ;
{ int i = 2 ; { } i++ ; }
typedef vector_string Text ; typedef const int my_const_int ;
Note: semicolons are not used after curly brackets, but they are obligatory in all statements and definitions not ending with curly brackets.
The following table gives the expressions, their precedence levels, and their associativity. The associativity of operators is given as "left" or "right". (C++ operators do not use "non"-associativity.) For binary operators, in general any of these associativities is meaningful. Prefix operators cannot be left associative, and postfix operators cannot be right associative. The argument in an array index, the arguments in a function call, and the middle expression in the conditional can be expressions of any level, since they are bracketed. Otherwise, some subexpressions have to be one precedence level above of the main expression to implement the required the associativity.
Note. The precedences are not exactly the same as in the C++ standard, but very similar.
level | expression forms | assoc | explanation | test | |
---|---|---|---|---|---|
15 | literal | --- | atomic expressions | (1) | |
15 | x , x::x , x::x::x , ... |
--- | qualified constants | (1) | |
14 | e[e] |
left | indexing | (3) | |
14 | e(e,...,e) |
left | function call | (3) | |
14 | e.e , e->e |
left | structure projection | (3) | |
14 | e++ , e-- |
left | in/decrement | --- | |
13 | ++e , --e , *e , !e |
right | in/decrement, dereference, negation | (6) | |
12 | e*e , e/e , e%e |
left | multiplication, division, remainder | (3) | |
11 | e+e , e-e |
left | addition, subtraction | (3) | |
10 | e<<e , e>>e |
left | left and right shift | (1) | |
9 | e<e , e>e , e>=e , e<=e |
left | comparison | (6) | |
8 | e==e , e!=e |
left | (in)equality | (3) | |
4 | e&&e |
left | conjunction | (6) | |
3 | e||e |
left | disjunction | (6) | |
2 | e=e , e+=e , e-=e |
right | assignment | (3) | |
2 | e ? e : e |
right | conditional | (3) | |
1 | throw e |
right | exception | (4) |
Note that this grammar includes expressions that are meaningless. For instance, it permits increment, decrement, and assignment on any expression, but this makes only sense on so-called l-values such as variables and array positions. Meaningless expressions can be ruled out in later compilation phases, e.g., in the type-checking phase, and at that point, good error messages can be given (rather than just parse error).
Indexing _[e]
is treated as a postfix operator;
the expression e
is bracketed and can be of any level.
Indexing is associative, i.e., e[e1][e2]
is allowed and means
(e[e1])[e2]
. It is necessarily left associative, since
the other bracketing, e([e1][e2])
would be meaningless.
Function call is similar to indexing.
The conditional _?e:_
is treated as a binary operator
with a fixed "then"-expression e
.
Right associativity for the conditional has the usual meaning:
c1 ? t1 : c2 ? t2 : e2
is parsed as
c1 ? t1 : (c2 ? t2 : e2)
and not as
(c1 ? t1 : c2) ? t2 : e2
.
Note also that ?
and :
act like left and right parenthesis.
Thus, e
can be of any level and, for instance,
c1 ? c2 ? t1 : e1 : e2
, while being confusing, is unambiguous, meaning
c1 ? (c2 ? t1 : e1) : e2
.
Qualified constants (1) are identifiers separated by ::
such as std::cout
.
Qualified constants can simply implemented as nonempty lists separated by ::
.
The elements of the list are identifiers.
Single identifier expressions come out as a special case of these lists.
Note: in C++, identifiers in qualified constants can be followed by template instantiations of the form
ident < typelist >
where a typelist is a comma-separated list of types. Thus, possible qualified constants in C++ include
std::vector<t>::const_iterator std::map<int,vector<string>>
However, it is not trivial to integrate this into our LR grammar,
because the parser can confuse the initial part std::vector<t
with the expression
e1<e2
where e1
and e2
are the (qualified) constants std::vector
and t
.
Therefore, we leave template instantiations out of this assignment.
int
(1), bool
, char
, double
, void
.
string
.
const
(4), e.g. const int
.
&
is a
postfix-operator forming types from types, e.g. int &
.
Note that from C++ 11, double references like int &&
are also allowed.
It is sufficient to support them with spaces between the &
s, e.g. int & &
.
Otherwise, x && y;
would be ambiguous: it could be the declaration x & & y;
or a conjunction expression as statement.
It is sufficient to support references and double references in the grammar. However, it is also fine to allow arbitrary stacking of the reference operator, as for instance in char & & & &
.
0
, 00
, 20
, 151
. (1)
01.10
, 3.14
, 10.0e10
, 0.00e-00
, 1.1e-80
. (3)
'0'
, 'a'
, '"'
, '''
. (6)
"abc"
,
"This,\n is \"insane\"!"
, "LaTeX's \"\\newcommand\""
. (1)
A string literal may consist of many double-quoted character sequences
which will be concatenated. The parts can be divided over lines (3). E.g.,
"hello " "my little " "world"is an alternative way to say "hello my little world".
An identifier is a letter followed by a list of letters, digits, and underscores. (1)
There are three kinds of comments: (1)
/*
and */
.
//
to the end of the line.
#
to the end of the line.
Block comments cannot be nested.
Build the grammar gradually so that you can parse the following test files in the given order.
These programs are taken from the web page of the book Accelerated C++ (A. Koenig & B. Moo, Addison-Wesley, 2000). To comply to our grammar, we have removed template instantiations.
Finally, run your parser through the test suite.
Your grammar must pass the test suite and meet the specification in this document in all respects. The test suite contains the example programs, as well as a number of programs which your parser must reject.
Your grammar may not have any reduce/reduce conflicts. If your grammar has shift/reduce conflicts but the parser works, you may leave these conflict in the grammar as long as you document each of the conflicts and explain why it arises and why it is harmless. The documentation may be placed in suitable comments in the grammar file.
For instance, you may get one shift/reduce conflict from the "dangling
else
". It is fine to leave this conflict in as long as you
explain why the parser still behaves correctly (i.e. why shift is
always the correct action in the situations where a reduce would
also be possible).
The solution must be written in an easily readable and maintainable way. In particular, tailoring it for the programs in the test suite is not maintainable!
Assuming you know some C but not C++, here is a summary of the extra features of C++ involved in this lab.
Qualified names: s::n
, where s
is a name space or a class.
using
directives: license to use an unqualified name from a name space.
IO streams: cout
for output, cin
for input. Output is
produced by the left shift operator, input is read by the right shift.
Example from the second program:
// read the name std::string name; // define `name' std::cin >> name; // read into `name' // write a greeting std::cout << "Hello, " << name << "!" << std::endl;
Arguments passed by reference (&
),
with the possibility of forbidding modification (const
), e.g.
(from the sixth program)
gen_aux(const Grammar& g, const string& word, vector_string& ret)