Javalette reference

Project summary

This document describes the compiler project that you will do as the main part of the examination for the Compiler Construction course. The project is done individually or in groups of two students (recommended). The project is split into three parts:

  1. Front end for the language Javalette, i.e. lexical analysis, parsing, building abstract syntax, type-checking and a few static checks. This part builds mainly on knowledge that you should have acquired previously, e.g. in the course Programming Language Technology.
  2. A back end that generates code for LLVM (the Low Level Virtual Machine). LLVM are described in more detail later in this page.
  3. Extensions to the base language. There are several optional extensions that you may choose to do for the third part of the project, as detailed below.

Submission deadlines

There are three submission deadlines, one for each part of the project:

In addition to these submissions, examination includes a brief oral exam after submission C. Exact dates will be posted on this course homepage.

Extensions, credits and grades

The options for extending the project are to extend the source language with e.g. arrays and for-loops, structures and pointers, object-oriented features or to generate native x86 code. There is also one essay project you can do which involves studying optimizations in the LLVM framework. You do not need to decide in advance how ambitious you want to be; instead you should finish each stage before you attempt an extension.

In submission C, each of the seven tasks described in the “extensions” section gives one credit if implemented as described, with the exception of x86 code generation. Implementing a code generator gives two credits, but also requires you to implement some optimization for your code generator, such as register allocation or peephole optimization.

To pass the course and get grade 3 (or G, if you are a GU student), you need to submit working solutions in all submissions, implement at least one language extension in submission C, and pass the oral exam. To get grade 4, you must earn three credits; grade 5 (VG for GU students) requires five credits.

If you are only looking to pass the course and only get one credit then the project extension of studying an LLVM optimization is not enough. You must implement at least one language extension to Javalette in order to pass the course.

Part of the goal of a project course, like this course, is that you shall deliver working code on time. Thus, credits will be awarded for working extensions submitted before the deadline. We may allow resubmissions for minor bugfixes, but no partial credits will be awarded for partial solutions.

Finally, we note that we are making major simplifications in a compiler project by using virtual machines like LLVM as targets, rather than a real machine. This means that you can produce simple-minded code and rely on the respective target tools to do optimization and JIT compilation to machine code. The final lectures in the course will discuss these issues, but they will not be covered in depth. On the other hand, for most compiling purposes the road we take is the most effective. This leaves fine-tuning of optimization to LLVM tools, allowing many source languages and front-ends to profit from this effort.

Collaboration and academic honesty

As mentioned before, you work individually or in groups of two to three in this project. You must develop your own code, and you are not allowed to share your code with other students or to get, or even look at, code developed by them. On the other hand, we encourage discussions among participants in the course about the project. As long as you follow the simple and absolute rule not to share code, we have no objections to questions asked and answered at a conceptual level.

If you do get significant help from some other participant, it is natural to acknowledge this in your documentation file.

Don’t be a cheater.

The language Javalette

Javalette is a simple imperative language. It is almost a subset of C (see below). It can also be easily translated to Java (see below).

Javalette is not a realistic language for production use. However, it is big enough to allow for a core compiler project that illustrates all phases in compilation. It also forms a basis for extensions in several directions.

The basic language has no heap-allocated data. However, the extensions involve (Java-like) arrays, structures and objects, all of which are allocated on the heap. The extended language is designed to be garbage-collected, but you will not implement garbage collection as part of your project.

The description in this document is intentionally a bit vague and based on examples; it is part of your task to define the language precisely. However, the language is also partly defined by a collection of test programs (see below), on which the behaviour of your compiler is specified.

Some small Javalette programs

Let’s start with a couple of small programs. First, here is how to say hello to the world:

// Hello world program

int main () {
  printString("Hello world!") ;
  return 0 ;
}

A program that prints the even numbers smaller than 10 is

int main () {
  int i = 0 ;
  while (i < 10) {
    if (i % 2 == 0) printInt(i) ;
    i++ ;
  }
  return 0 ;
}

Finally, we show the factorial function in both iterative and recursive style:

int main () {
  printInt(fact(7)) ;
  printInt(factr(7)) ;
  return 0 ;
}

// iterative factorial

int fact (int n) {
  int i,r ;
  i = 1 ;
  r = 1 ;
  while (i <= n) {
    r = r * i ;
    i++ ;
  }
  return r ;
}

// recursive factorial

int factr (int n) {
  if (n < 2)
    return 1 ;
  else
    return n * factr(n-1) ;
}

Program structure

A Javalette program is a sequence of function definitions.

A function definition has a return type, a name, a parameter list, and a body consisting of a block.

The names of the functions defined in a program must be different (i.e, there is no overloading).

One function must have the name main. Its return type must be int and its parameter list empty. Execution of a program consists of executing main.

A function whose return type is not void must return a value of its return type. The compiler must check that it is not possible that execution of the function terminates without passing a return statement. This check may be conservative, i.e. reject as incorrect certain functions that actually would always return a value. A typical case could be to reject a function ending with an if-statement where only one branch returns, without considering the possibility that the test expression might always evaluate to the same value, avoiding the branch without return. However, your check must correctly decide the control flow when the test expression is the literal true or the literal false. A function, whose return type is void, may, on the other hand, omit the return statement completely.

Functions can be mutually recursive, i.e., call each other. There is no prescribed order between function definitions (i.e., a call to a function may appear in the program before the function definition).

There are no modules or other separate compilation facilities; we consider only one-file programs.

Types

Basic Javalette types are int, double, boolean and void. Values of types int, double and boolean are denoted by literals (see below). void has no values and no literals.

No coercions (casts) are performed between types. Note this: it is NOT considered an improvement to your compiler to add implicit casts. In fact, some of the test programs check that you do not allow casts.

In the type checker, it is useful to have a notion of a function type, which is a pair consisting of the value type and the list of parameter types.

Statements

The following are the forms of statements in Javalette; we indicate syntax using BNFC notation, where we use Ident, Exp and Stmt to indicate a variable, expression and statement, respectively. Terminals are given within quotes. For simplicity, we sometimes deviate here from the actual provided grammar file.

Declarations may appear anywhere within a block, but a variable must be declared before it is used.

A variable declared in an outer scope may be redeclared in a block; the new declaration then shadows the previous declaration for the rest of the block.

A variable can only be declared once in a block.

If no initial value is given in a variable declaration, the value of the variable is initialized to 0 for type int, 0.0 for type double and false for type boolean. Note that this is different from Java, where local variables must be explicitly initialized.

Expressions

Expressions in Javalette have the following forms:

Lexical details

Some of the tokens in Javalette are

Comments in Javalette are enclosed between /* and */ or extend from // to the end of line, or from # to the end of line (to treat C preprocessor directives as comments).

Primitive functions

For input and output, Javalette programs may use the following functions:

void printInt (int n)
void printDouble (double x)
void printString (String s)
int readInt ()
double readDouble ()

Note that there are no variables of type string in Javalette, so the only argument that can be given to printString is a string literal.

The print functions print their arguments terminated by newline and the read functions will only read one number per line. This is obviously rudimentary, but enough for our purposes.

These functions are not directly implemented in the virtual machines we use. We will provide them using other means, as detailed below.

Parameter passing

All parameters are passed by value, i.e., the value of the actual parameter is computed and copied into the formal parameter before the subroutine is executed. Parameters act as local variables within the subroutine, i.e., they can be assigned to.

Javalette, C and Java

Javalette programs can be compiled by a C compiler (gcc) if prefixed by suitable preprocessor directives and macro definitions, e.g.

#include <stdio.h>
#define printInt(k) printf("%d\n", k)
#define boolean int
#define true 1

In addition, function definitions must be reordered so that definition precedes use, mutual recursion must be resolved by extra type signatures and variable declarations moved to the beginnings of blocks.

Javalette programs can be compiled by a Java compiler (javac) by wrapping all functions in a class as public static methods and adding one more main method that calls your main:

public static void main (String[] args) {
  main();
}

Using a C compiler or Java compiler is a good way to understand what a program means even before you have written the full compiler. It can be useful to test the programs produced by your compiler with the result of the C- and/or Java compiler.

The front end

Your first task is to implement a compiler front end for Javalette:

  1. Define suitable data types/classes for representing Javalette abstract syntax.
  2. Implement a lexer and parser that builds abstract syntax from strings.
  3. Implement a type checker that checks that programs are type-correct.
  4. Implement a main program that calls lexer, parser and type checker, and reports errors.

These tasks are very well understood; there is a well-developed theory and, for steps 1 and 2, convenient tools exist that do most of the work. You should be familiar with these theories and tools and we expect you to complete the front end during the first week of the course.

We recommend that you use the BNF converter and either Alex and Happy (if you decide to implement your compiler in Haskell) or JLex and Cup (if you use Java). We may also allow other implementation languages and tools, but we can not guarantee support, and you must discuss your choice with Magnus before you start. This is to make sure that we will be able to run your compiler and that you will not use inferior tools.

We provide a BNFC source file Javalette.cf that you may use. If you already have a BNFC file for a similar language that you want to reuse you may do so, but you must make sure that you modify it to pass the test suite for this course.

We will accept a small number of shift/reduce conflicts in your parser; your documentation must describe these and argue that they are harmless. Reduce/reduce conflicts are not allowed. The provided BNFC file has the standard dangling-else shift/reduce conflict.

One thing to note is that it may be useful to implement the type-checker as a function, which traverses the syntax and returns its input if the program is type-correct. The reason for this is that you may actually want to modify this and decorate the syntax trees with more information during type-checking for later use by the code generator. One example of such decoration can be to annotate all subexpressions with type information; this will be useful during code generation.

To do this, you can add one further form of expression to your BNFC source, namely a type-annotated expression.

Extensions

This section describes optional extensions that you may implement to learn more, get credits and thus a higher final grade. You may choose different combinations of the extensions.

In this section we specify the requirements on the extensions. Some implementation hints are given in section extension hints and in the lecture notes.

One-dimensional arrays and for loops

The basic Javalette language has no heap-allocated data, so memory management consists only of managing the run-time stack. In this extension you will add one-dimensional arrays to basic Javalette. To get the credit, you must implement this in the front end and in the respective back end.

Arrays are Java-like: variables of array type contain a reference to the actual array, which is allocated on the heap. Arrays are explicitly created using a new construct and variables of array type have an attribute, length, which is accessed using dot notation.

Some examples of array declarations in the extension are

int[] a ;
double[] b;

Creating an array may or may not be combined with the declaration:

a = new int[20];
int[] c = new int[30];

After the above code, a.length evaluates to 20 and a refers to an array of 20 integer values, indexed from 0 to 19 (indexing always starts at 0). It is not required to generate bounds-checking code.

Functions may have arrays as arguments and return arrays as results:

int[] sum (int[] a, int[] b) {
  int[] res = new int [a.length];
  int i = 0;
  while (i < a.length) {
    res[i] = a[i] + b[i];
    i++;
  }
  return res;
}

Array parameters are passed by value, that is the reference is copied into the parameter.

One new form of expressions is added, namely indexing, as shown in the example. Indexed expressions may also occur as L-values, i.e., as left hand sides of assignment statements. An array can be filled with values by assigning each individual element, as in function sum. But one can also assign references as in C or Java:

c = a;

The extension also includes implementation of a simple form of foreach-loop to iterate over arrays. If expr is an expression of type t[], the following is a new form of statement:

for (t var : expr) stmt

The variable var of type t assumes the values expr[0], expr[1] and so on and the stmt is executed for each value. The scope of var is just stmt.

This form of loop is very convenient when you want to iterate over an array and access the elements, but it is not useful when you need to assign values to the elements. For this, we still have to rely on the while loop. The traditional for-loop would be attractive here, but we cannot implement everything.

Test files for this extension are in subdirectory extensions/arrays1.

Multidimensional arrays

In this extension you add arrays with an arbitrary number of indices. Just as in Java, an array of type int[][] is a one-dimensional array, each of whose elements is a one-dimensional array of integers. Declaration, creation and indexing is as expected:

int[][] matrix = new int[10][20];
int[][][] pixels;
...
matrix[i][j] =  2 * matrix[i][j];

You must specify the number of elements in each dimension when creating an array. For a two-dimensional rectangular array such as matrix, the number of elements in the two dimensions are matrix.length and matrix[0].length, respectively.

Dynamic data structures

In this extension you will implement a simple form of dynamic data structures, which is enough to implement lists and trees. The source language extensions are the following:

Here is an example of a complete program in the extended language:

typedef struct Node *list;

struct Node {
  int elem;
  list next;
};


int main () {
  printInt (length (fromTo (1, 100)));
  return 0;
}

list cons (int x, list xs) {
  list n;
  n = new Node;
  n->elem = x;
  n->next = xs;
  return n;
}

list fromTo (int m, int n) {
  if (m>n)
    return (list)null;
  else
    return cons (m, fromTo (m + 1, n));
}

int length (list xs) {
  int res = 0;
  while (xs != (list)null) {
    res++;
    xs = xs->next;
  }
  return res;
}

This and a few other test programs can be found in the extensions/pointers subdirectory of the test suite.

Object-orientation

This extension adds classes and objects to basic Javalette. From a language design point of view, it is not clear that you would want both this and the previous extension in the same language, but here we disregard this.

Here is a first simple program in the proposed extension:

class Counter {
  int val;

  void incr () {
    val++;
    return;
  }

  int value () {
    return val;
  }
}

int main () {
  Counter c;
  c = new Counter;
  c.incr ();
  c.incr ();
  c.incr ();
  int x = c.value ();
  printInt (x);
  return 0;
}

We define a class Counter, and in main create an object and call its methods a couple of times. The program writes 3 to stdout.

The source language extensions, from basic Javalette, are

Object orientation with dynamic dispatch

The restriction not to allow method override is of course severe. In this extension the restriction is removed and subclassing with inheritance and method override implemented. This requires a major change of implementation as compared to the previous extension. It is no longer possible to decide statically which code to run when a message is sent to an object. Thus, each object at runtime must have a link to a class descriptor, a struct with pointers to the code of the methods of the class. These class descriptor are linked together in a list, where a class descriptor has a link to the descriptor of its superclass. This list is searched at runtime for the proper method to execute. All this is discussed more during the lectures.

Native x86 code generation.

This extension is to produce native assembler code for a real machine, preferrably x86. We may accept code generators for other architectures, but you need to think of how we can test your extension. Before you attempt to write a backend for another architecture, discuss your choice with Magnus and explain the testing procedure.

Note that this extension gives you two credits, but it is not enough to just implement a naïve code generator. You must also implement some sort of optimization, such as register allocation or peephole optimization. Talk to Magnus about which optimization(s) to implement before attempting the x86 code generator. The x86 code generation extension acts also as a kind of multiplier, that is, implementing another extension, for example arrays, will give you two credits instead of one. This fair because you need to generate code for both LLVM and x86.

Study of LLVM optimization

We offer one possibility to get a credit that does not involve implementing a Javalette extension. This is to do a more thorough study of the LLVM framework and write a report of 4-5 pages. More precisely the task is as follows.

Look at the list of available optimization passes and choose at least three of these for further study. Mail Magnus to agree that your choice is suitable (do this before you start to work on the extension!).

For each pass you must:

We emphasize again that if you are only looking to pass the course and only get one credit then this project is not enough. You have to implement at least one extension to Javalette in order to pass the course.

Further possibilities

We are willing to give credits also to other extensions, which are not as well defined. If you want to do one of these and get credit, you must discuss it with Magnus in advance. Here are some possibilities:

Testing the project

Needless to say, you should test your project extensively. We provide a testsuite of programs and will run your compiler on these. You can download the testsuite from the course web site and run it locally or on a Chalmers machine (e.g., remote11 or a Linux lab machine). The testsuite contains both correct programs (in subdirectory testsuite/good) and illegal programs (in subdirectory testsuite/bad). For the good programs the correct output is provided in files with suffix .output. The bad programs contain examples of both lexical, syntactical and type errors.

Already after having produced the parser you should therefore write a main program and try to parse all the test programs. The same holds for the type checker and so on. When you only have the parser, you will of course pass some bad programs; those that are syntactically correct but have type errors.

Summarizing, your compiler must:

Furthermore, for correct programs, your compiled programs, must run and give correct output.

Automated testing

Before submission you must run that program to verify that your compiler behaves correctly. Our first action when we receive your submission is to run these tests. If this run fails, we will reject your submission without further checks, so you must make sure that this step works. Unfortunately, we cannot supply a working test driver for the Windows platform. If you have a Windows machine, you may do most of the development, including manual testing, on that machine, but for final testing you should transfer your project to our lab (or remote) machines and run the test driver.

The test driver runs each good program and compares its output with the corresponding .output file. If the program needs input, this is taken from the .input file. Note that the test driver handles this; your generated code should read from stdin and write to stdout.

The tests are of course not exhaustive. It is quite possible that the grader will discover bugs in your code even if it passes all tests.

The tester is provided as a gzipped tar ball, which can be downloaded from the course web site. You can use it to run the tests for your project. This archive contains a test driver Grade.hs with supporting files, and a subdirectory testsuite containing Javalette test programs.

Installation

The tester requires a Linux (or Mac OS X, or other Unix) environment and a recent version of the Haskell Platform. If you work on your own Windows machine, we cannot assist you in making the tester work. You should anyhow download the tester to get access to the testsuite in directory testsuite. Before submitting, you must upload your project to a lab machines and verify the submission.

Running the tests

Assume that your submission directory is dir and that your compiler is called jlc. Assume also that dir/lib contains the runtime support file (runtime.bc for submission B, and possibly runtime.o for submission C).

The test driver takes a number of options:

Flag Effect
-s <name> Name of your compiler binary
-b <backend> Backend to use. One of LLVM, x86, x86_64 and custom
-l <ver> LLVM version to use for with LLVM backend (only LLVM-3.8 has guaranteed support).
-x <ext> Implemented extensions. May be given multiple times.
-t <dir> Directory in which the Javalette test programs are found.
-k Keep any temporary directories created by the tester.
-h Print detailed help and usage instructions.

In addition, it takes one mandatory argument: a directory or tarball in which to find your submission.

Thus, if you have placed your submission directory, dir, in the directory containing Grade.hs, you can test your compiler as follows:

runhaskell Grade.hs dir

The above command will compile all the basic Javalette programs. The tester will not attempt to run the good programs, so this is suitable for testing your compiler for submission A. Note that it is essential that your compiler writes one line to stderr, containing OK for correct programs and ERROR for incorrect programs.

To also run the good programs and test them for submission B:

runhaskell Grade.hs -b LLVM dir

The test driver will report its activities in compiling the test programs and running the good ones. If your compiler is correct, output will end as follows:

Summary:
 0 Compiling core programs (101/101)
 0 Running core programs (35/35)

Credits total: 0

All 101 test programs were compiled and gave correct indication OK or ERROR to stderr. The 35 correct programs were run and gave correct output. Note that the actual numbers can be different, since we may add tests.

To test the extensions for submission, run the test suite with the -x flag for each extension you have implemented. The following command will run the tests for the two array extensions.

runghc Grade.hs -b LLVM -x arrays1 -x arrays2 dir

The following extensions are supported by the test suite: arrays1, arrays2, pointers, objects1, objects2. If you have implemented the x86 code generation extension, use -b x86 (or -b x86_64) to run all tests using that instead of the LLVM backend. If your x86 compiler has a different name than jlc, don’t forget to specify it using -s <compiler>.

Code generation: LLVM

LLVM (Low Level Virtual Machine) is both an intermediate representation language and a compiler infrastructure, i.e. a collection of software components for manipulating (e.g. optimizing) LLVM code and backends for various architectures. LLVM has a large user base and is actively developed. A lot of information and code to download can be found at the LLVM web site http://www.llvm.org. You must use the LLVM-3.8 version in this course; the testsuite has only guaranteed support forthis particular version.

Also LLVM code comes in two formats, a human-readable assembler format (stored in .ll files) and a binary bitcode format (stored in.bc files). Your compiler will produce the assembler format and you will use the LLVM assembler llvm-as to produce binary files for execution.

In addition to the assembler, the LLVM infrastructure consists of a large number of tools for optimizing, linking, JIT-compiling and manipulating bitcode. One consequence is that a compiler writer may produce very simple-minded LLVM code and leave to the LLVM tools to improve code when needed. Of course, similar remarks apply to JVM code.

LLVM code

The LLVM virtual machine is a register machine, with an infinite supply of typed, virtual registers. The LLVM intermediate language is a version of three-address code with arithmetic instructions that take operands from two registers and place the result in a third register. LLVM code must be in SSA (static single assignment) form, i.e. each virtual register may only be assigned once in the program text.

The LLVM language is typed, and all instructions contain type information. This “high-level” information, together with the “low-level” nature of the virtual machine, gives LLVM a distinctive flavour.

The LLVM web site provides a wealth of information, including language references, tutorials, tool manuals etc. There will also be lectures focusing on code generation for LLVM.

The structure of a LLVM file

There is less overhead in the LLVM file. But, since the language is typed, we must inform the tools of the types of the primitive functions:

declare void @printInt(i32)
declare void @printDouble(double)
declare void @printString(i8*)
declare i32 @readInt()
declare double @readDouble()

Here i32 is the type of 32 bit integers and i8* is the type of a pointer to an 8 bit integer (i.e., to a character). Note that function names in LLVM always start with @.

Before running a compiled Javalette program, myfile.bc must be linked with runtime.bc, a file implementing the primitive functions, which we will provide. In fact, this file is produced by giving clang a simple C file with definitions such as

void printInt(int x) {
  printf("%d\n",x);
}

An example

The following LLVM code demonstrates some of the language features in LLVM. It also serves as an example of what kind of code a Javalette compiler could generate for the fact function described here.

define i32 @main() {
entry:  %t0 = call i32 @fact(i32 7)             ; function call
        call void @printInt(i32 %t0)
        ret  i32 0

}

define i32 @fact(i32 %__p__n) {
entry:  %n = alloca i32                         ; allocate a variable on stack
        store i32 %__p__n , i32* %n             ; store parameter
        %i = alloca i32
        %r = alloca i32
        store i32 1 , i32* %i                   ; store initial values
        store i32 1 , i32* %r
        br label %lab0                          ; branch to lab0

lab0:   %t0 = load i32, i32* %i                 ; load i
        %t1 = load i32, i32* %n                 ; and n
        %t2 = icmp sle i32 %t0 , %t1            ; boolean %t2 will hold i <= n
        br i1 %t2 , label %lab1 , label %lab2   ; branch depending on %t2

lab1:   %t3 = load i32, i32* %r
        %t4 = load i32, i32* %i
        %t5 = mul i32 %t3 , %t4                 ; compute i * r
        store i32 %t5 , i32* %r                 ; store product
        %t6 = load i32, i32* %i                 ; fetch i,
        %t7 = add i32 %t6 , 1                   ; add 1
        store i32 %t7 , i32* %i                 ; and store
        br label %lab0

lab2:   %t8 = load i32, i32* %r
        ret  i32 %t8

}

We note several things:

LLVM tools

Your compiler will generate a text file with LLVM code, which is conventionally stored in files with suffix .ll. There are then several tools you might use:

Note that some installations of LLVM require a version number after the tool name, for example llvm-as-3.8 instead of llvm-as.

Here are the steps you can use to produce an executable file from within your compiler:

Optimizations

To wet your appetite, let us see how the LLVM code can be optimized:

> cat myfile.ll | llvm-as | opt -std-compile-opts | llvm-dis
declare void @printInt(i32)

define i32 @main() {
entry:
	tail call void @printInt(i32 5040)
	ret i32 0
}

define i32 @fact(i32 %__p__n) nounwind readnone {
entry:
	%t23 = icmp slt i32 %__p__n, 1
	br i1 %t23, label %lab2, label %lab1

lab1:
	%indvar = phi i32 [ 0, %entry ], [ %i.01, %lab1 ]
	%r.02 = phi i32 [ 1, %entry ], [ %t5, %lab1 ]
	%i.01 = add i32 %indvar, 1
	%t5 = mul i32 %r.02, %i.01
	%t7 = add i32 %indvar, 2
	%t2 = icmp sgt i32 %t7, %__p__n
	br i1 %t2, label %lab2, label %lab1

lab2:
	%r.0.lcssa = phi i32 [ 1, %entry ], [ %t5, %lab1 ]
	ret i32 %r.0.lcssa
}

The first line above is the Unix command to do the optimization. We cat the LLVM assembly code file and pipe it through the assembler, the optimizer and the disassembler. The result is an optimized file, where we observe:

If we save the optimized code in myfileOpt.bc (without disassembling it), we can link it together with the runtime using:

> llvm-link myfileOpt.bc runtime.bc -o a.out.bc

If we disassemble the resulting file a.out.bc, we get (we have edited the file slightly in inessential ways):

@fstr = internal constant [4 x i8] c"%d\0A\00"

define i32 @main() nounwind {
entry:
        %t0 = getelementptr [4 x i8]* @fstr, i32 0, i32 0
        %t1 = call i32 (i8*, ...)* @printf(i8* %t0, i32 5040) nounwind
        ret i32 0
}

declare i32 @printf(i8*, ...) nounwind

What remains is a definition of the format string @fstr as a global constant (\0A is \n), the getelementpointer instruction that returns a pointer to the beginning of the format string and a call to printf with the result value. Note that the call to printInt has been inlined, i.e., replaced by a call to printf; so linking includes optimizations across files.

We can now run a.out.bc using the just-in-time compiler lli. Or, if we prefer, we can produce native assembly code with llc. On a x86 machine, this gives

        .text
        .align  4,0x90
        .globl  _main
_main:
        subl    $12, %esp
        movl    $5040, 4(%esp)
        movl    $_fstr, (%esp)
        call    _printf
        xorl    %eax, %eax
        addl    $12, %esp
        ret
        .cstring
_fstr:                          ## fstr
        .asciz  "%d\n"

Hints for the extensions

One-dimensional arrays

To implement this extension, the expression new int[e] will need to allocate memory on the heap for the array itself and for the length attribute. Further, the array elements must be accessed by indexing.

LLVM provides support for built-in arrays, but these are not automatically heap-allocated. Instead, explicit pointers must be used. Thus, an array will have the LLVM type {i32, [0 x t] *}, where t is the LLVM type of the elements. The first i32 component holds the length; the second the array elements themselves. The number of elements in the array is here indicated to be 0; it is thus your responsibility to make sure to allocate enough memory. For memory allocation you should use the C function calloc, which initializes allocated memory to 0. You must add a type declaration for calloc, but you do not need to worry about it at link time; LLVM:s linker includes stdlib.

Indexing uses the getelementptr instruction, which is discussed in detail in the lectures.

The LLVM does not include a runtime system with garbage collection. Thus, this extension should really include some means for reclaiming heap memory that is no longer needed. The simplest would be to add a statement form free(a), where a is an array variable. This would be straightforward to implement, but is not necessary to get the credit.

More challenging would be to add automatic garbage collection. LLVM offers some support for this. If you are interested in doing this, we are willing to give further credits for that task.

Multidimensional arrays

This extension involves more work than the previous one. In particular, you must understand the getelementpointer instruction fully and you must generate code to iteratively allocate heap memory for subarrays.

Structures and object-orientation.

Techniques to do these extensions are discussed in the lectures.

From an implementation point of view, we recommend that you start with the extension with pointers and structures. You can then reuse much of the machinery developed to implement also the first OO extension. In fact, one attractive way to implement the object extension is by doing a source language translation to Javalette with pointers and structures.

The full OO extension requires more sophisticated techniques, to properly deal with dynamic dispatch.

Native code generation

The starting point for this extension could be your LLVM code, but you could also start directly from the abstract syntax. Of course, within the scope of this course you will not be able to produce a code generator that can compete with llc, but it may anyhow be rewarding to do also this final piece of the compiler yourself.

One major addition here is to handle function calls properly. Unlike LLVM (or the Java virtual machine (JVM), which provides some support for function calls, you will now have to handle all the machinery with activation records, calling conventions, and jumping to the proper code before and after the call.

There are several assemblers for x86 available and even different syntax versions. We recommend that you use the NASM assembler and that you read Paul Carter’s PC assembly tutorial before you start the project, unless you are already familiar with x86 architecture. We do not have strong requirements on code quality for your code generator; straightforward code generation is acceptable. In particular, you do not need to implement register allocation to improve your code. This will also have serious negative consequences for the performance of your code. Indeed, a preferrable way to get native code is to use a framework like LLVM, which provides an extensive infrastructure for code manipulation.

An introduction to x86 assembler will be given in the lectures.

Submission format

  1. You prepare your submission by creating a new empty directory, subsequently called the root directory. In this directory you create three subdirectories: doc, lib and src.
  2. The root directory must, after building as below, contain the executable compiler jlc. The compiler is used as follows:
    • For submission B and C, the command jlc myFile.jl produces the executable file a.out. Your code will generate a myFile.ll, which is stored in the same directory as the .jl source file. Then your compiler will call llvm-as and llvm-link to assemble and link.
    • Optionally, if you have chosen to implement a native code generator, the executable is jlc_ARCH, where ARCH is a CPU architecture such as x86. The command jlc_ARCH myFile.jl should produce an assembly file myFile.s and an executable file a.out. The compiler may be a shell script that calls files in src, or a symbolic link.
  3. The subdirectory src must contain all necessary source code. In this directory, one should be able to build your compiler using make. Thus the directory contains:
    • The BNF Converter source file, if the tool has been used.
    • Alex/Lex/JLex and Happy/Yacc/Cup source files.
    • Program modules: abstract syntax, lexer, parser, type checker, code generator, top level program.
    • A Makefile for building the compiler from source. Note that the Makefile may also be located in the root directory. The Makefile should at least have these targets:
      • A default target (the one that is run when the command make is issued. This target should compile all source files in the compiler, and any runtime library files. It does not need to regenerate any source files generated by BNFC, or by any of the parser and lexer generators. After running make in the source directory, the compiler in the root directory should work without further actions from the user.
      • A clean target that removes all files produced during building.

    Note that src should not contain obsolete files, such as backup files, bitcode files or other unnecessary files.

  4. In the lib directory you place the following files, as needed:
    • For submission B, you place here runtime.ll and have the Makefile generate runtime.bc from it; an LLVM bitcode file with the same functions. The file runtime.ll can be downloaded from the course website.
    • If you choose to do a native code generator for submission C, you place a file runtime.c here and have the Makefile generate the corresponding runtime.o.
  5. The doc directory must contain one file in html, ps, plain ascii, or pdf format (proprietary formats not allowed), with the following content:
    • An explanation of how the compiler is used (what options, what output, etc)
    • A specification of the Javalette language (if produced by BNF converter, you may just refer to your BNFC source file).
    • A list of shift/reduce conficts in your parser, if you have such conflicts, and an analysis of them.
    • For submission C, an explicit list of extensions implemented.
    • If applicable, a list of features not implemented and the reason why.
  6. If your compiler jlc is a shell script, you should also place this file here before building the tar ball.
  7. When you have prepared everything, you create a compressed tar ball:

    > tar -czf partA-1.tar.gz doc lib src

    This produces the file partA-1.tar.gz that you upload to Fire. We suggest the naming scheme partX-Y.tar.gz where X is A, B or C and Y is your version number (Y=1 the first time you submit, and if your submission is rejected and you must resubmit, the next has Y=2, etc).

    If you prefer, you may compress with bzip2 instead of gzip.

Testing your submission

Your submission must be structured as specified in the section on testing. We suggest that, after having prepared your tarball, you run

> runghc Grade.hs /tmp/partA-1.tar.gz

The grading program will extract the content extracting and build your compiler, before running the test suite. This is how we test your submission, so you can check that building succeeds before you submit. Don’t forget to run the tester with the apprioriate backend and extension flag for submissions B and C.

Menu