Lab 1: Parser for a fragment "C--" of C++

Programming Language Technology, 2019

Back to main page of Programming Language Technology.

Summary

The objective of this lab is to write a parser for a fragment "C--" of the C++ programming language. The parser should return an abstract syntax tree at success, and report an error with a line number at failure.

Before the lab can be submitted, the parser has to pass some tests, which are given on the course web page via links later in this document.

The recommended implementation of the parser is via a BNF grammar processed by the BNF Converter (BNFC) tool. The approximate size of the parser source code should be around 50 rules.

With BNFC, just the grammar (.cf file) has to be written and submitted by the lab groups.

Method

Build the grammar gradually, so that you can parse the 3 test files in the given order: as your first goal, parse the first test file; then the second, and finally the third.

After treating the 3 tests, run the test script to make your parser perfect.

You can use any code generator (Haskell, Java, C,...) from BNFC. It need not be the language you use in later labs.

Software requirements

Make sure your system can interpret shell scripts (e.g., Bash) and make files.

For building the parser with BNFC, an installation of a recent version (≥ 2.8) of BNFC is needed.

For automatic running of the testsuite, you'll need to generate and run Haskell code, requiring:

It is recommended to install these all at once via the Haskell platform.

In case you want to build your parser in Java, you need:

See also the Software section on the course homepage.

Debugging

If you use Haskell, you have the advantage of info files, which tell you exactly where the conflicts are. Assuming your grammar file is CMM.cf and you have called bnfc -d CMM.cf, you can create an info file from the generated Happy file:

  happy -i CMM/Par.y

The info file is now in CMM/Par.info. Moreover, you can produce a debugging parser by adding the -d flag to Happy. Update your Makefile like this:

  happy -gcad CMM/Par.y

And then run make again. Running this parser shows a trace of the LR actions performed during parsing.

You can also get some debugging output when using the Java backend by calling CUPS with the -dump_states flag. Edit your Makefile like this:

  CUPFLAGS = -nopositions -expect 100 -dump_states

And then recompile your parser with:

  make -B CMM/sym.class 2> parser.info

The parser.info file will contain some debugging information. Unfortunately this format is harder to read than the info file generated by Happy (above).

Language specification

It is suitable to explain and advisable to implement the parser top-down, starting with the largest units and proceeding to smaller ones.

The specification differs in some places from the official C++ specification.

To help you get going, we have marked with a bold (n) those rules that are needed to parse the nth test file.

Programs

A program is a sequence of function definitions. (1)

A program may also contain comments and preprocessor directives, which are just ignored by our parser (see below). (1)

Remark: An interpretable program must have a function main of type int that takes no arguments. It may or may not have a return statement.

    int main () {
      ...
    }

However, the existence and format of main is not ensured by the parser, but at a later stage (type checker).

Definitions

A function definition has a type, a name, an argument list, and a body. (1) Example:

    int foo(double x, int y)
    {
      return y + 9 ;
    }

Types

A type is one of the four built-in types: (1)

    int
    double
    bool
    void

Argument lists, declarations, and function bodies

An argument list is a comma-separated list of argument declarations. It is enclosed in parentheses ( and ). (1)

An argument declaration has a type and an identifier, for instance

    int x

Notice that argument declarations with multiple variables (int x, y) are not included. A declaration that occurs as a statement (as shown below), can also have more than one variable.

A function body is a list of statements enclosed in curly brackets { and } . (1)

Statements

Almost all statement forms belong to stage 2.

Declaration

Variable declarations have one of the following formats:

Expression as Statement

Any expression followed by a semicolon ; can be used as a statement. (2) Examples:

    i = 3 ;
    printInt(i) ;

Remark: While any expression is allowed, only expressions with some side effect make really sense here. A side effect could be an assignment or a call to a function that makes some input or output action.

Return Statement

Statements returning an expression, (2) for example:

    return i + 9 ;

Block

Any list of statements (including empty list) between curly brackets. (2) For instance,

    {
      int i = 2 ;
      {
        int i = 3 ;
        {
        }
      }
      i++ ;
    }

While Loop

While loops, with an expression in parentheses followed by a statement, (2) for example:

    while (i < 10) ++i ;
    while (condition) { x = x * 2 ; condition = (x < 1024) ; }

Conditional

if with an expression in parentheses followed by a statement, else, and another statement. (The else-branch cannot be omitted.) (2) Example:

    if (x > 0) return x ; else { y = y + 2 ; return y ; }

Expressions

Almost all expression forms belong to stage 3.

The following table gives the expression forms e, their precedence levels, and their associativity. The associativity of operators is given as left, right, or none. For binary operators, in general any of the three associativity is meaningful. For pre-, post-, and mixfix operators, at most one of left or right associativity makes sense, and the alternative is non-associative. As they are bracketed, the arguments in a function call x(e,...,e) can be expressions of any level. Otherwise, some subexpressions have to be one precedence level above of the main expression to implement the required associativity.

Remark. Precedence and associativity departs in some cases from the C++ standard, further, in/decrement and assignment is restricted to identifiers x.

level expression forms assoc explanation
6 literal - literal (2)
6 identifier x - variable (3)
6 x(e,...,e) none function call (3)
6 x++, x-- none post in/decrement (3)
6 ++x, --x none pre in/decrement (3)
5 e*e, e/e left mult, div (3)
4 e+e, e-e left add, sub (3)
3 e<e, e>e, e>=e, e<=e none comparison (3)
3 e==e, e!=e none (in)equality (3)
2 e&&e left conjunction (3)
1 e||e left disjunction (3)
0 x=e right assignment (3)

Literals

The language features integer literals, floating point literals, and the boolean literals, true and false. (2)

Identifiers

An identifier is a letter followed by a list of letters, digits, and underscores. (1)

Comments

There are three kinds of comments: (1)

  1. Block comment: anything between tokens /* and */.
  2. Line comment: anything from token // to the end of the line.
  3. Preprocessor directive: anything from token # to the end of the line.

Block comments cannot be nested.

Test programs

For incremental construction of the grammar, use these example files:

  1. First: Empty function definitions.

  2. Second: Functions with statements, but only literal expressions.

  3. Third: Compute a visualization of the Mandelbrot set as portable gray map (PGM). Uses the full language (definitions, statements, expressions).

Finally, run your parser through the test suite.

Success criteria

Your grammar must pass the test suite and meet the specification in this document in all respects. The test suite contains the example programs, as well as a number of programs which your parser must reject.

The parser must not have any conflicts, neither shift/reduce nor reduce/reduce.

The solution must be written in an easily readable and maintainable way. In particular, tailoring it for the programs in the test suite is not maintainable!

Submission