Lab 2: Type checker and interpreter for C--

Programming Language Technology, 2019

Back to main page of Programming Language Technology.

Summary

The objective of this lab is to write a type checker and an interpreter for the fragment "C--" of the C++ programming language introduced in Laboration 1. The type checker should check the program and send it to the interpreter at success. The interpreter should run the program and correctly perform all its input and output actions. At type checking failure, a type error should be reported.

Before the lab can be submitted, the program has to pass some tests, which are given on the course web page via links later in this document.

The recommended implementation is via the Laboration 1 BNF grammar processed by the BNF Converter (BNFC) tool. The syntax tree created by the parser should then be processed further by a program using the skeleton generated by BNFC.

The type checker and the interpreter code will be roughly 300-800 lines together, depending on the programming language used.

All BNFC supported languages can be used, but guidance is guaranteed only for Haskell and Java.

The type system and the interpreter are partially characterized by formal rules in the PLT book, chapters 4 and 5.

Method

In the type checker, the recommended procedure is two passes:

build a symbol table with all function types, including the built-in functions;
type check the code by using this symbol table. (Note that there is no need to type annotate the tree in this lab. Annotations will become relevant first in the next lab.)

In the interpreter, do a similar thing:

build a symbol table that provides for every function its source code syntax tree; the built-in functions can be left out and treated separately in the rule for evaluating function calls;
interpret the program by evaluating the expression main().

The type checker

Types

Only the four built-in types

    int
    double
    bool
    void

are taken into account. Every expression has one of these types.

Subtyping

A value of type int can be implicitely coerced (cast) to a value of type double. This means that whenever an expression of type double is expected, an expression of type int is also acceptable.

Function types

Types of functions in the symbol table can be represented in any way that stores their argument and return types. For instance, the function header

    int f (double x, bool b)

will create a symbol table entry (not actual code)

    f ↦ ([double, bool], int)

that maps the name f to a pair whose first component is the list of argument types and the second component is the return type.

Programs

A program is a sequence of function definitions.

A program may also contain comments and preprocessor directives, which are just ignored by the parser (see below).

An interpretable program must have a function main of type int that takes no arguments. It may or may not have a return statement.

    int main () {
      ...
    }

Definitions

A function definition has a type, a name, an argument list, and a body. Example:

    int foo(double x, int y)
    {
      return y + 9 ;
    }

The language has four built-in functions dealing with input and output:

    void   printInt(int x)        // print an integer and a newline in standard output
    void   printDouble(double x)  // print a double and a newline in standard output
    int    readInt()              // read an integer from standard input
    double readDouble()           // read a double from standard input

(These functions are implemented in file prelude.cc. You can #include it in a test case if you want run the test through a C compiler.)

Typing rules

The same function name may be used in at most one function definition.

All return statements in a function body must return an expression whose type is the return type of the function.

You don't need to check that there actually is a return statement.

Argument lists, declarations, and function bodies

An argument list is a comma-separated list of argument declarations. It is enclosed in parentheses ( and ).

An argument declaration has a type and an identifier, for instance

    int x

Notice that argument declarations with multiple variables (int x, y) are not included. A declaration that occurs as a statement (as shown below), can also have more than one variable.

A function body is a list of statements enclosed in curly brackets { and } .

Typing rules

An argument list may not declare the same variable more than once. The type of a variable cannot be void. I.e., the following definition has two errors in the argument list.

  int error (int x, void y, bool x) { ... }

Statements

Declaration

Variable declarations have one of the following formats:

type and one or more variables
```
    int i ;
    double x, y, z ;
  
```
type and a single initialized variable
```
    int i = 6 ;
```

Typing. The initializing expression must have the declared type in the extended context. The type cannot be void. Note: Due to subtyping, a double variable may also be initialized with an int expression.

Expression as Statement

Any expression followed by a semicolon ; can be used as a statement.

Remark: While any expression is allowed, only expressions with some side effect make really sense here. A side effect could be an assignment or a call to a function that makes some input or output action.

Return Statement

Statements returning an expression, for example

    return i + 9 ;

Typing. The type of the returned expression must be the same as or a subtype of the return type of the function in which it occurs.

Block

Any list of statements (including empty list) between curly brackets. For instance,

    {
      int i = 2 ;
      int j = 5 ;
      {
        int i = 3 ;  // shadows the previous variable i
        {
        }
        i = i + j;   // the inner i becomes 8
      }
      i++ ;          // the outer i becomes 3
    }

Typing rules

A block introduces a new scope. A variable may only be declared once in a scope, i.e., on the same block level. The parameters of a function live in the same scope as the top-level block in the body. For example, the following function contains both violations.

    int scope_violation (int x) {
      int x = 1;    // already in scope as parameter
      {
        int y = 2;
        int y;      // already declared in this block
      }
      return x;
    }

While Loop

While loops, with an expression in parentheses followed by a statement, for example:

    while (i < 10) ++i ;

Typing. The expression must have type bool. Regardless whether the statement is a block or not, it lives in a new scope. For example, in

    while (i++ < 10) int i = 0 ;

the declaration int i = 0 creates a new variable i rather than setting the value of the existing variable i to 0. Thus, it needs to behave just as:

    while (i++ < 10) { int i = 0 ; }

Conditional

if with an expression in parentheses followed by a statement (the first branch), else, and another statement (the second branch). Example:

    if (x > 0) return x ; else return y ;

Typing. The expression must have type bool. Similar as for while, the two branches of an if statement live in new scopes.

Expressions

The following table gives the expression forms e, their precedence levels, and their associativity. The associativity of operators is given as left, right, or none. For binary operators, in general any of the three associativity is meaningful. For pre-, post-, and mixfix operators, at most one of left or right associativity makes sense, and the alternative is non-associative. As they are bracketed, the arguments in a function call x(e,...,e) can be expressions of any level. Otherwise, some subexpressions have to be one precedence level above of the main expression to implement the required associativity.

Remark. Precedence and associativity departs in some cases from the C++ standard, further, in/decrement and assignment is restricted to identifiers x.

level	expression forms	assoc	explanation	type
6	literal	-	literal	literal type
6	identifier `x`	-	variable	declared type
6	`x(e,...,e)`	none	function call	return type
6	`x++`, `x--`	none	in/decrement	operand type (int or double)
6	`++x`, `--x`	none	in/decrement	operand type (int or double)
5	`e*e`, `e/e`	left	mult, div	operand type (int or double)
4	`e+e`, `e-e`	left	add, sub	operand type (int or double)
3	`e<e`, `e>e`, `e>=e`, `e<=e`	none	comparison	bool
3	`e==e`, `e!=e`	none	(in)equality	bool
2	`e&&e`	left	conjunction	bool
1	`e\|\|e`	left	disjunction	bool
0	`x=e`	right	assignment	type of both sides

Typing rules

Integer, double, and boolean literals have their usual types.

Variables have the type declared in the nearest enclosing block. A variable must be declared before it is used in an expression.

The arguments of a function call must have types corresponding to the argument types of the called function. The number of arguments must be the same as in the function declaration (thus the C++ default argument rule is not applied). Notice that only identifiers are used as functions.

Increments and decrements only apply to int and double variables.

The arithmetical and comparison operations apply to two operands of the same type, which is int or double. (However, there ─as everywhere!─ can be implicit coercions from int to double.)

Equality and inequality apply to two operands of the same type, which is int, double, or bool.

Conjunction and disjunction apply to operands of type bool only.

In assignments, both sides must have the same type, and this is then also the type of the assignment expression. Notice that only variables are used as left hand sides.

Literals

The language features integer literals, floating point literals, and the boolean literals, true and false.

Identifiers

An identifier is a letter followed by a list of letters, digits, and underscores.

Comments

There are three kinds of comments:

Block comment: anything between tokens /* and */.
Line comment: anything from token // to the end of the line.
Preprocessor directive: anything from token # to the end of the line.

Block comments cannot be nested.

The interpreter

The top-level interpreter is run by evaluating the expression main(). The return value is ignored.

Values

There are four types of values:

integer values, e.g. -47,
double values, e.g. 3.14159,
boolean values, true and false,
a void value, which need never be shown.

A value is the result of evaluating an expression in an environment. Because of subtyping, integer values need sometimes be coerced to double values.

Environments

The purpose of an environment is to bind variables to their values. Our environment is structured into a stack of environment blocks. Whenever we enter a new scope, we push a new block on the stack which we remove again when we leave the scope.

For illustration, consider the evaluation of these statements.

      int i = 2 ;      // create a new variable i in current scope with value 2
      int j = 5 ;
      {                // push a new environment block
        int i = 3 ;    // create a new variable i in this block with value 3
        i = i + j ;    // update this variable
        printInt (i) ; // print its value
      }                // remove the top block
      printInt (i) ;   // print the (unchanged) value 2 of i from the current scope

To implement evaluation of recursive functions, we need to save and restore whole environments. For instance:

  int factorial (int n) {
    int r;
    if (n < 2) return 1;
    else {
      r = factorial (n - 1);
      return r * n;
    }
  }

The evaluation of factorial(2) will bind n to 2, and then call factorial(1), which has to bind n to 1. It returns 1 which is bound to r but then we need the old value of n to correctly compute r * n. Thus, we need a stack of environments to preserve the old bindings and access them again after we return from a function call.

Programs

A program is a sequence of function definitions. Each function has a parameter list and a body, which is a sequence of statements.

The evaluation of a function call starts by evaluating the arguments and building a new environment with a single block where the received values are assigned to the argument variables (a.k.a. parameters) of the function.

The statements in the body are then executed in the order defined by their textual order as altered by while loops and if conditions.

The function returns a value, which is obtained from the return statement. After encountering a return, all the following statements in the function body should be ignored (not evaluated). If the return type is void, no return statement is required.

After the function has returned, the environment of the caller is restored.

Statements

Atomic statements

A declaration, e.g.

    int i ;

adds a variable to the current environment. Its value is initialized if and only if the declaration includes an initializing expression, e.g.:

    int i = 9 ;

An expression statement, e.g.

    i++ ;

is evaluated, and its value is ignored.

A return statement is executed by evaluating its expression argument. The value is returned to the caller of the function, and no more statements in the function body are executed.

Blocks

A block of statements, e.g.

    {
      int i = 3 ;
      i++ ;
    }

is interpreted in an environment where a new block is pushed on the environment stack at entrance, and popped at exit.

While-loops

A while statement, e.g.

    while (i < 10) {
      i++ ;
      j-- ;
    }

is interpreted so that the condition expression is first evaluated. If the value is true, the body is interpreted in the resulting environment, and the while statement is executed again. If the value is false, the body is not interpreted.

The body of a while statements needs to be interpreted in a fresh context block even if it is just a single statement. Example:

    while (i > 0) int i = 0;

The i declared by the body is not identical with the i in the condition. Thus, if the condition i > 0 holds initially, this should be an infinite loop.

Conditionals

An if-else statement, e.g.

    if (x < 10) i++ ; else printInt(j) ;

is interpreted so that the condition expression is first evaluated. If the value is true, the then-branch (statement before else) is interpreted. If the value is false, the else-branch (statement after else) is interpreted.

Similar to while the branches of the if statement are fresh scopes and need to evaluated with new environment blocks. For example:

    int i = 0;
    if (i > 0) i++; else int i = 1;
    printInt(i);

This snipped should print 0 and not 1, since the variable declared in the else-branch is a fresh i not the i declared above.

Expressions

The interpretation of an expression, also called evaluation, returns a value whose type is determined by the type of the expression.

A literal, e.g.

    123
    3.14
    true

is not evaluated further but just converted to the corresponding value.

A variable, e.g.

is evaluated by looking up its value in the innermost context where it occurs. Concretely, the environment blocks are searched for a binding for x, where the last added block is searched first. If the variable is not in the environment, or has no value there, the interpreter terminates with an error message

    uninitialized variable x

Note that this error should be raised by the following snippet

    int x = 1;
    {
      int x;
      printInt(x);
    }

since the initialized x is shadowed by an uninitialized x, and even by

    int x = 1 + x;

since the variable x is already declared but still undefined at the point when the initializing expression 1 + x will be evaluated.

A function call, e.g.

    foo (x+y, i<0, i++)

is interpreted by first evaluating its arguments from left to right. The environment is then looked up to find out how the function is interpreted on the resulting values.

Calls of the four built-in functions can be hard-coded as special cases in the expression evaluation code.

A postincrement,

i++

has the same value as its body initially has (here i). The value of the variable i is then incremented by 1. i-- correspondingly decrements i by 1. If i is of type double, 1.0 is used instead.

A preincrement,

++i

has the same value as i plus 1. This incremented value replaces the old value of i. The decrement and double variants are analogous.

The arithmetic operations addition, subtraction, multiplication, and division,

    a + b
    a - b
    a * b
    a / b

are interpreted by evaluating their operands from left to right. The resulting values are then added, subtracted, etc., by using appropriate operations of the implementation language. We are not picky about the precision chosen, but suggest for simplicity that int should be int and double should be double.

Comparisons and equality tests,

    a <  b
    a >  b
    a >= b
    a <= b
    a == b
    a != b

are treated similarly to the arithmetic operations, using comparisons of the implementation language. The returned value is a boolean.

Note that arithmetic and comparison operations require conversion of int values to double values if one argument is an int and the other a double value.

Conjunction,

    a && b

is evaluated lazily: first a is evaluated. If the result is true, also b is evaluated, and the value of b is returned. However, if a evaluates to false, then false is returned without evaluating b.

Disjunction,

    a || b

is also evaluated lazily: first a is evaluated. If the result is false, also b is evaluated, and the value of b is returned. However, if a evaluates to true, then true is returned without evaluating b.

Assignment,

    x = a

is evaluated by first evaluating a. The resulting value is returned, but also the environment is changed by assigning this value to the innermost occurrence of x.

Lab format

Solution templates

For Haskell and Java there are stubs that can be extended to the full solution. Just download one of the following tar archives, unpack, write the code, test, pack the archive again, and submit to fire.

These packages stubs for the grammar, type checker and interpreter, and suitable makefiles. If you start from these stubs, you will likely match the requirements for the solution format as detailed in the following:

Input and output

The interpreter must be a program called lab2, which is executed by the command

    lab2 <SourceFile>

and prints its output to the standard output. The output at success must be just the output defined by the interpreter.

The output at failure is an INTERPRETER ERROR, a TYPE ERROR or, a SYNTAX ERROR, depending on the phase at which the error occurs. These error messages should also give some useful explanation, which we leave to your imagination.

The input can be read not only from user typing on the terminal, but also from standard input redirected from a file or by echo. For instance,

    ./lab2 fibonacci.cc < test-input
    echo 20 | ./lab2 fibonacci.cc

Compiling the interpreter

The interpreter is submitted as archive of source files that can be compiled by typing make.

If you want to write the interpreter in another language, the procedure is the same: send a tar package and make sure the interpreter can be compiled in a normal Unix enviroment by typing make.

Test programs

Run the testsuite before submitting the lab.

Example of success

Source file good.cc:

  int main ()
  {
    int i = readInt() ; //5
  
    printInt(i) ;   //5
    printInt(i++) ; //5
    printInt(i) ;   //6
    printInt(++i) ; //7
    printInt(i) ;   //7
  
  }

Running the interpreter:

    % echo 3 | ./lab2 good.cc
    3
    3
    4
    5
    5

Examples of failure

Type error

Source file bad.cc:

  int f (double c)
  {
    int n = 1 ;
    while(c) ++n ;
  }

Running the type checker

    % lab2 bad.cc
    TYPE ERROR
  
    condition c in while: expected bool, found double

Interpreter error

Source file bad.cc:

  int main ()
  {
    int i ;
  
    printInt(i) ;
    printInt(i++) ;
    printInt(i) ;
    printInt(++i) ;
    printInt(i) ;
  }

Running the interpreter

    % lab2 bad.cc
    INTERPRETER ERROR
  
    uninitialized variable i

Thus it is assumed that the type checker does not detect uninitialized variables.

Success criteria

Your interpreter must pass the test suite and meet the specification in this document in all respects. The test suite contains both good and bad programs. The good programs must be executed correctly, whereas the bad ones must fail in appropriate ways in either the type checker or the interpreter.

The solution must be written in an easily readable and maintainable way. In particular, tailoring it for the programs in the test suite is not maintainable!

Submission

Do not submit your solution before it passes the test suite, lest it be returned automatically.
Please include exactly all the files that are required for building your solution, including a Makefile. To verify this you can follow the following 7 steps (on Windows, the tools used here (bash, mktemp, tar, make) are provided by both Cygwin and MSYS).
1. Create an empty folder
```
  $ TMPDIR=`mktemp -d`
```
2. Copy your solution to the created folder
```
  $ cp lab2.tar.gz $TMPDIR/
```
3. Change to the created folder
```
  $ cd $TMPDIR
```
4. Unpack your solution
```
  $ tar --strip-components=1 -xzf lab2.tar.gz
```
5. Build your solution
```
  $ make -B
```
6. Make sure step 5 succeeds without errors.
7. Make sure your solution has been built
```
  $ ./lab2
```
Do not submit any generated files.
Submit your lab using Fire.