Lab 1: Parser for a fragment of C++

Programming Language Technology, 2018

Back to main page of Programming Language Technology.

Summary

The objective of this lab is to write a parser for a fragment of the C++ programming language. The parser should return an abstract syntax tree at success, and report an error with a line number at failure.

Before the lab can be submitted, the parser has to pass some tests, which are given on the course web page via links later in this document.

The recommended implementation of the parser is via a BNF grammar processed by the BNF Converter (BNFC) tool. The approximate size of the parser source code should be 100 rules.

With BNFC, just the grammar (.cf file) has to be written and submitted by the lab groups.

Method

Build the grammar gradually, so that you can parse the six test files in the given order: as your first goal, parse the first test file; then the second, and so on.

When you have passed one level, just try to parse the next example with your current parser and examine the line at which it fails. In this way, you will find out what new grammar rules you need.

After treating the six tests, run the test script to make your parser perfect.

You can use any code generator (Haskell, Java, C,...) from BNFC. It does not have to be the same language as you plan to use in later labs.

Debugging

If you use Haskell, you have the advantage of info files, which tell you exactly where the conflicts are. Assuming your grammar file is CPP.cf and you have called bnfc on it, you can create an info file from the generated Happy file:

  happy -i ParCPP.y

The info file is now in ParCPP.info. Moreover, you can produce a debugging parser by adding the -d flag to Happy. Update your Makefile like this:

  happy -gcad ParCPP.y

And then run make again. Running this parser shows a trace of the LR actions performed during parsing.

You can also get some debugging output when using the Java backend by calling CUPS with the -dump_states flag. Edit your Makefile like this:

  CUPFLAGS = -nopositions -expect 100 -dump_states

And then recompile your parser with:

  make -B CPP/sym.class 2> parser.info

The parser.info file will contain some debugging information. Unfortunately this format is harder to read than the info file generated by Happy (above).

Language specification

It is suitable to explain and advisable to implement the parser top-down, starting with the largest units and proceeding to smaller ones.

The specification differs in some places from the official C++ specification.

To help you get going, we have marked with a bold (n) those rules that are needed to parse the nth test file.

Programs

A program is a sequence of definitions. (1)

A program may also contain comments and preprocessor directives, which are just ignored by the parser (see below). (1)

Definitions

A function definition has a type, a name, an argument list, and a body. (1) Example:
```
    int foo(double x, int y)
    {
      return y + 9 ;
    }
```
Some statements can be used as top-level definitions:
- typedef statements (6)
- variable declarations and initializations
Finally, definitions for using qualified constants are allowed, (2) e.g.
```
    using std::vector ;
```

Argument lists, declarations, and function bodies

An argument list is a comma-separated list of argument declarations. It is enclosed in parentheses: ( and ). (1)
An argument declaration always has a type. This type is optionally followed by an identifier or an identifier and an initialization, and optionally preceded by the specifier const. (4) The following are examples of argument declarations.
```
    int
    int x
    int x = 5
    const int& x
```
Notice that argument declarations with multiple variables (int x, y) are not included. Exception see below: A declaration that occurs as a statement can also have one or more variables.
A function body is either a list of statements enclosed in curly brackets { and } (1), or an empty body consisting of a semicolon (6). Example:
```
    int foo(double x, int y) ;
```

Statements

Any expression followed by a semicolon ; can be used as a statement. (1)
Variable declarations followed by a semicolon. (2) Variable declarations are similar to argument declarations in functions, except that they need at least one variable, and can have more than one variable. Examples:
```
    int x;
    int x = 5, y, z = 3;
    const int x, y = 0;
    const int& x, y;
```
Statements returning an expression (1), for example
```
    return i + 9 ;
```
While loops, with an expression in parentheses followed by a statement. (3) For example:
```
    while (i < 10) ++i ;
```
Do-while loops, with an expression in parentheses after the loop body. (6) For example:
```
    do ++i ; while (i < 10) ;
```
For loops, with a declaration and two expressions in parentheses followed by a statement. (6) For example:
```
    for (int i = 0 ; i != 10 ; ++i) k = k + i ;
```
We do not require that any of the fields in parentheses may be empty.
Conditionals: if with an expression in parentheses followed by a statement and optionally by else and a statement. (3) Examples:
```
    if (x > 0) return x ;
  
    if (x > 0) return x ; else return y ;
```
Blocks: any list of statement (including empty list) between curly brackets. (3) For instance,
```
    {
      int i = 2 ;
      {
      }
      i++ ;
    }
```
Type definitions: a type and a name for it. (3) Example:
```
    typedef vector_string Text ;
```

Note: semicolons are not used after curly brackets, but they are obligatory in all statements and definitions not ending with curly brackets.

Expressions

The following table gives the expressions, their precedence levels, and their associativity. The associativity of operators is given as "left" or "right". (C++ operators do not use "non"-associativity.) For binary operators, in general any of these associativities is meaningful. Prefix operators cannot be left associative, and postfix operators not right associative. The argument in an array index, the arguments in a function call, and the middle expression in the conditional can be expressions of any level, since they are bracketed. Otherwise, some subexpressions have to be one precedence level above of the main expression to implement the required the associativity.

level	expression forms	assoc	explanation	test
15	literal	---	atomic expressions	(1)
15	`x`, `x::x`, `x::x::x`, ...	---	qualified constants	(1)
14	`e[e]`	left	indexing	(3)
14	`e(e,...,e)`	left	function call	(3)
14	`e.e`, `e->e`	left	structure projection	(3)
14	`e++`, `e--`	left	in/decrement	---
13	`++e`, `--e`, `*e`, `!e`	right	in/decrement, dereference, negation	(6)
12	`e*e`, `e/e`, `e%e`	left	multiplication, division, remainder	(3)
11	`e+e`, `e-e`	left	addition, subtraction	(3)
10	`e<<e`, `e>>e`	left	left and right shift	(1)
9	`e<e`, `e>e`, `e>=e`, `e<=e`	left	comparison	(6)
8	`e==e`, `e!=e`	left	(in)equality	(3)
4	`e&&e`	left	conjunction	(6)
3	`e\|\|e`	left	disjunction	(6)
2	`e=e`, `e+=e`, `e-=e`	right	assignment	(3)
2	`e ? e : e`	right	conditional	(3)
1	`throw e`	right	exception	(4)

Note that this grammar includes expressions that are meaningless. For instance, it permits increment, decrement, and assignment on any expression, but it makes only sense on so-called l-values such as variables and array positions. Meaningless expressions can be ruled out in later compilation phases, e.g., in the type-checking phase, and at that point, good error messages can be given (rather than just parse error).

Indexing _[e] is treated as a postfix operator; the expression e is bracketed and can be of any level. Indexing is associative, i.e., e[e1][e2] is allowed and means (e[e1])[e2]. It is necessarily left associative, since the other bracketing, e([e1][e2]) would be meaningless. Function call is similar to indexing.

The conditional _?e:_ is treated as a binary operator with a fixed "then"-expression e. Right associativity for the conditional has the usual meaning: c1 ? t1 : c2 ? t2 : e2 is parsed as c1 ? t1 : (c2 ? t2 : e2) and not as (c1 ? t1 : c2) ? t2 : e2. Note also that ? and : act like left and right parenthesis. Thus, e can be of any level and, for instance, c1 ? c2 ? t1 : e1 : e2, while being confusing, is unambiguous, meaning c1 ? (c2 ? t1 : e1) : e2.

Qualified constants

Qualified constants (1) are identifiers separated by :: such as std::cout. Qualified constants can simply implemented as nonempty lists separated by ::. The elements of the list are identifiers. Single identifier expressions come out as a special case of these lists.

Note: in C++, identifiers in qualified constants can be followed by template instantiations of the form

    ident < typelist >

where a typelist is a comma-separated list of types. Thus, possible qualified constants in C++ include

    std::vector<t>::const_iterator
    std::map<int,vector<string>>

However, it is not trivial to integrate this into our LR grammar, because the parser can confuse the initial part std::vector<t with the expression e1<e2 where e1 and e2 are the (qualified) constants std::vector and t. Therefore, we leave template instantiations out of this assignment.

Types

Built-in types: int (1), bool, char, double, void.
Qualified constants, including plain identifiers (1), e.g. string.
Type references. (4) The reference operator & is a postfix-operator forming types from types, e.g. int &.
Note that from C++ 11, double references like int && are also allowed. It is sufficient to support them with spaces between the &s, e.g. int & &. Otherwise, x && y; would be ambiguous: it could be the declaration x & & y; or a conjunction expression as statement.
It is sufficient to support references and double references in the grammar. However, it is also fine to allow arbitrary stacking of the reference operator, as for instance in char & & & &.

Literals

Non-negative Integer literals, e.g, 0, 00, 20, 151. (1)
Non-negative Floating point literals, e.g., 01.10, 3.14, 10.0e10, 0.00e-00, 1.1e-80. (3)
Single-quoted character literals, e.g. '0', 'a', '"', '''. (6)
Double-quoted string literals, e.g. "abc", "This,\n is \"insane\"!", "LaTeX's \"\\newcommand\"". (1) A string literal may consist of many double-quoted character sequences which will be concatenated. The parts can be divided over lines (3). E.g.,
```
    "hello " "my little "
    "world"
```
will be parsed as string literal "hello my little world".

Identifiers

An identifier is a letter followed by a list of letters, digits, and underscores. (1)

Comments

Preprocessor directive: anything from token # to the end of the line. (1)
Line comment: anything from token // to the end of the line. (1)
Block comment: anything between tokens /* and */. Block comments cannot be nested.

Test programs

If you have any problems getting the test program to run, or if you think that there is an error in the test suite, contact the teachers of the course via the mailing list.

These programs are taken from the web page of the book Accelerated C++ (A. Koenig & B. Moo, Addison-Wesley, 2000). To comply to our grammar, we have removed template instantiations.

First: Print "Hello, world!".
Second: Ask the user's name and say hello to him/her.
Third: Compute the grade of a student.
Fourth: A smarter way to compute the grade.
Fifth: Test if a string is a palindrome.
Sixth: Randomly generate English sentences.

Once again: build the grammar gradually so that you can parse these files in this order.

Success criteria

Your grammar must pass the test suite and meet the specification in this document in all respects. The test suite contains the example programs, as well as a number of programs which your parser must reject.

The maximum number of 10 shift/reduce conflicts is permitted. The minimum of 1 is almost unavoidable (because of the "dangling else"). But reduce/reduce conflicts are forbidden.

The solution must be written in an easily readable and maintainable way. In particular, tailoring it for the programs in the test suite is not maintainable!

Submission

Do not submit your solution before it passes the test suite, lest it be returned automatically.
Do not submit any generated files.
Submit your lab using Fire.

Appendix: C++ features

Assuming you know some C but not C++, here is a summary of the extra features of C++ involved in this lab.

Qualified names: s::n, where s is a name space or a class.

using directives: license to use an unqualified name from a name space.

IO streams: cout for output, cin for input. Output is produced by the left shift operator, input is read by the right shift. Example from the second program:

    // read the name
    std::string name;     // define `name'
    std::cin >> name;     // read into `name'
  
    // write a greeting
    std::cout << "Hello, " << name  << "!" << std::endl;

Arguments passed by reference (&), with the possibility of forbidding modification (const), e.g. (from the sixth program)

    gen_aux(const Grammar& g, const string& word, vector_string& ret)