Lecture 9: Code Generation Programming Languages Course Aarne Ranta (aarne@chalmers.se) %!target:html %!postproc(html): #NEW %!postproc(html): #HR
%!postproc(html): #sub1 1 %!postproc(html): #subn1 n-1 %!postproc(html): #subn n Book: 6.2, 6.4, 6.6, 7.1, 7.4, 7.5, 7.6 #NEW ==Plan== The Semantic Gap Compilation schemes JVM code generation Stack and heap memory, garbage collection Back-end optimization Native code generation for Intel x86 #NEW ==The goals of the lecture== JVM: to learn the basic techniques of compilation - even in practice: [Exercise 5 ../exercises/06-exx.html] Native code: to get an idea of how programs are made to work on "bare silicon" - in practice: the Compiler Construction course (Period 4) #NEW ==The compiler is a translator== Reminder: a compiler **translates** the code into another format - **source code** into **target code**. Example: C++ source code. ``` int i = 1 ; int j = i + 2 ; printInt(i + j) ; ``` Translated to JVM target code: ``` bipush 1 dup istore 0 bipush 2 iadd istore 1 iload 0 iload 1 iadd invokestatic runtime/iprint(I)V ``` #NEW ==Semantic gap== Different kinds of constructions: **semantic gap** between source and target code. || high-level code | machine code || | statement | instructions | | expression | instructions | | variable | memory address, register | | value | bit vector | | type | memory layout | | control structure | jump | | function | subroutine | | tree structure | linear structure | Typically, one statement/expression translates to many instructions. ``` x + 3 ==> iload 0 bipush 3 iadd ``` #NEW ==Compilation schemes== Syntax-directed translation rules that generate target code from source code ``` // integer addition compile (exp1 + exp2) = compile exp1 compile exp2 emit iadd // integer multiplication compile (exp1 * exp2) = compile exp1 compile exp2 emit imul // integer literal, one signed byte compile i = emit (bipush i) ``` The rules go recursively to subtrees, and call ``emit`` to generate new instructions. But how do we know that we should call ``iadd`` and not ``dadd`` (double multiplication)? #NEW ==Type annotations== Code generation must know the types of arithmetic operations, variables, etc. The simplest way to guarantee this is to use an annotating type inference, ``` Exp annotExp (Env env, Exp exp) = typ := inferExp env exp return typed(exp) ``` Then we can generate code as follows: ``` // addition compile typed(exp1 + exp2) = compile exp1 compile exp2 if typ == int emit iadd else emit dadd ``` An alternative is to compute the types of expressions in the code generator - but this duplicates the work done in the type checker. #NEW ==Machine code constructs that depend on types== - Arithmetic operations: ``iadd, imul, isub, idiv`` vs. ``dadd, dmul, dsub, ddiv``. - Numeric constants: ``bipush 7`` vs. ``ldc2_w 7.0``. - Load and store instructions: ``iload, istore`` vs. ``dload, dstore`` - Memory usage: integers take one slot on the stack, doubles take two. - Comparisons, return statements,... We will simplify the presentation by only considering integers. Booleans are treated similar to integers. #NEW ==Variable addresses== The compiler needs an environment to keep the addresses of variables needed in load and store instructions. The addresses are integers starting from ``0``. Each declaration increments it. The environment has a block structure, because, variables declared in different blocks are stored in different addresses. ``` int i ; // i -> 0 int j ; // j -> 1 { int k ; // k -> 2 int j ; // j -> 3 } int m ; // m -> 2 ``` Exit from a block frees the addresses reserved in the block. #NEW ==Compiling declarations and statements== We assume an environment where we can - ``lookup`` - the address of a variable - ``addVar`` - add a variable to the next available address - ``discard`` - a number of variables at exit from a block ``` // declarations compile (int x ;) = addVar x // variable expressions compile (x) = addr := lookup x emit (iload addr) // assignments compile (x = exp) = compile exp addr := lookup x emit (istore addr) // blocks compile ({ stms }) = for all stm in stms: compile stm discard (number of declared variables in stms) ``` #NEW ==Labels and jumps== A **label** is an identifier marking a position in a code. A **jump** is the instruction ``goto LABEL`` which makes the interpreter continue from that position, with the same stack as before. In small step semantics: ``` --> ``` Example (my first BASIC program): ``` BEGIN: bipush 66 invokestatic runtime/iprint(I)V goto BEGIN ``` Notice: - indentation is just a convention, not obligatory - labels are not instructions in the bytecode, just indices in the code array calculated by the assembler - all labels in the code must be unique - ``invokestatic`` calls a function (static method), indicating its class (``runtime``), name (``iprint``), and type (``(I)V``) #NEW ==Compiling while loops== Idea: ``` TEST: while (exp) ===> evaluate exp stm if (exp==0) goto END execute stm goto TEST END: ``` Compilation scheme: ``` compile (while (exp) stm) = TEST := getLabel END := getLabel emit TEST compile exp emit ifeq END compile stm emit goto TEST emit END ``` The code uses the **conditional jump** ``ifeq LABEL``, with semantics ``` --> --> if v!= 0 ``` Notice that the jump can go to an earlier position in the code. #NEW ==Another way to compile while loops== ``` goto TEST while (exp) ===> LOOP: stm execute stm TEST: evaluate exp if (exp != 0) goto LOOP ``` #NEW ==Compiling conditionals== Idea: ``` if (exp) ===> evaluate exp stm1 if (exp==0) goto FALSE else execute stm1 stm2 goto TRUE FALSE: execute stm2 TRUE: ``` Compilation scheme: ``` compile (if (exp) stm1 else stm2) = TRUE := getLabel FALSE := getLabel compile exp emit ifeq FALSE compile stm1 emit goto TRUE emit FALSE compile stm2 emit TRUE ``` #NEW ==Compiling comparisons== There are no comparison operations. There are no booleans. If we want to get the value of ``exp1 < exp2``, we execute code corresponding to ``` if (exp1 < exp2) 1 ; else 0 ; ``` We use the conditional jump ``if_icmplt LABEL``, which compares the two elements at the top of the stack and jumps if the second-last is less than the last: ``` --> if a < b --> otherwise ``` We can use code that first pushes 1 on the stack. This is overwritten by 0 if the comparison does not succeed. ``` bipush 1 exp1 exp3 if_icmplt TRUE pop bipush 0 TRUE: ``` #NEW ==Comparisons in while loop conditions== Putting together the compilation of comparisons and ``while`` loops gives terrible spaghetti code: ``` while (x < 9) stm ===> TEST: bipush 1 iload 0 bipush 9 if_icmplt TRUE pop bipush 0 TRUE: ifeq goto END stm goto TEST END: ``` Couldn't we use the ``if_icmplt`` comparison directly in the ``while`` jump? #NEW ==Optimizing labels and jumps== Yes - the negation of ``if_icmplt``: recall that ``!(a < b) == (a >= b)``. ``` while (x < 9) stm ===> TEST: iload 0 bipush 9 if_icmpge END stm goto TEST END: ``` The problem is: how can we get this code by using the compilation schemes? #NEW ==Compositionality== Syntax-directed translation is **compositional**, if the value returned for a tree is a function of the values for its immediate subtrees: ``` T (C a#sub1 ... a#subn) = ... T(a#sub1) ... T(a#subn) ... ``` In programming, this means that - in Haskell, pattern matching does not need patterns deeper than 1 - in Java, one visitor definition per class and function is enough In Haskell, it would be easy to use **noncompositional** compilation schemes, by deeper patterns: ``` compile (SWhile (ELt exp1 exp2) stm) = ... ``` In Java, another visitor must be written to define what can happen depending on the condition part of ``while``. Another approach: **back-end optimization** of the generated code: run through the code and look for code fragments that can be improved. #NEW ==References and objects== An integer is one word, a double is two - but how big is a linked list? Or a ``TypeChecker``? How does JVM (or any machine) deal with objects whose size is not known? Even worse: the size of objects may change at runtime (e.g. elements can be added to a list). The stack can of course be arbitrarily large - but the memory needed by a variable must be known when it is allocated on the stack. This is //before// we know what values it will contain! Solution: use **indirect addressing**. On the stack, store a **reference** to an object that itself is stored in the **heap**. The heap is a separate part of the memory, which is not ordered like a stack. #NEW ==Stack vs. heap allocation== The reference gives the address of the object in the heap. In the heap, the object can be stored discontinuously, divided to many places. Each part then includes the address to the next part; when the list ends, the address is ``nul``. A linked list is a clear example. ``` .. ---- addr to x3 | x2 <------ | .. | | nul | STACK ---> x3 | ----- .. | [x1, x2, x3] .. addr to x2 ------- addr to x1 ----> x1 .. ---- HEAP ``` Graphics convention: the stack grows downwards, the heap grows upwards. #NEW ==How to free memory from the stack and the heap== Objects on the stack are removed automatically when they are no longer used: - the arguments of functions and arithmetic operations are popped when the value is returned - the local variables of a function are popped when the function returns Objects on the heap are not removed in this way: only the references to them from the stack are. Example: a function ``prime``, which returns the //n//th prime by building a list of the //n// first primes, ``` int prime(int n) { LinkedList primes ; //... } ``` When returning from the function, how do you know what parts of the heep contain parts of ``primes`` and are no longer needed? #NEW ==Garbage collection== In C (and to a less extent, C++), heap-allocated memory should be freed by using the function ``free``. In Java, memory is freed automatically by **garbage collection**. Garbage = values in memory no longer used by the program. A garbage collection routine is a part of the interpreter. It can be part of a compiler, too: in addition to the compiled native code, the compiler provides it as a part of a **runtime system**. C and C++ compilers do not generally provide one. Haskell compilers do. #NEW ==Mark-and-sweep garbage collection== Book 7.6.1 To understand how garbage collection works, let us consider a simple program that does it. The algorithm consists of three functions: - **roots**: find the heap pointers in the stack - **mark**: recursively mark as true all blocks that are addressed - **sweep**: free unmarked memory blocks and unmark marked blocks Problems: - when and how often to run it? - how to prevent the fragmentation of memory? There are more developed garbage collection algorithms that address these issues. Modern garbage collections are difficult to beat with hand-written memory management. #NEW ==Peephole optimization== Improve the generated code by looking at chunks of e.g. 3 instructions (this is called the peephole window). Reduce the number of instructions e.g. by **constant folding**, ``` bipush 6 bipush 9 ===> bipush 15 iadd ``` or change them to cheaper ones, e.g. ``` bipush 15 ===> bipush 15 bipush 15 dup ``` Iteration gives the best effect: ``` bipush 3 bipush 6 bipush 3 bipush 9 ===> bipush 15 ===> bipush 45 iadd imul imul ``` #NEW ==Assembling JVM== The notation have used in this lecture is from the [Jasmin http://jasmin.sourceforge.net/] JVM assembler. Things done by the assembler: - change **opcodes** and their arguments to bytes - change labels to positions in the byte array - put constants into a **global constant pool** We have created a simple script [``jass`` ../exercises/jvm/jass], which reads the code for a ``main`` function and embeds it into a class called ``Main``. Then it calls ``jasmin``. In Exercise 6, you can build a compiler by using the compilation schemes of this lecture. You can then ``jass`` the output of the compiler to produce a file ``Main.class``. This file can be run with ``` java Main ``` #NEW ==Compiling for Intel X86== The most common processor family in the world. Originally the "IBM PC", historically: - 8088, 8086. Real mode: any program may access any memory address - 80286 ("AT"). Main novelty: protected mode - 80386. Main novelty: 32-bit registers. - 80486DX, Pentium, Pentium II, III, IV... Main novelty: integrated math coprocessor with eight 80-bit registers The machines are backward compatible for the older code, but operating systems will forbid certain things, such as accessing the memory of other processes. #NEW ==Registers vs. memory== Registers are places for data inside the CPU. - + up to 10 times faster access than to main memory - + fewer instructions needed - - expensive; typically just 32 of them in a 32-bit CPU - - difficult to use optimally - - efficient cache memory is making them less central in compiler optimizations Arithmetic operations must have their operands and return their values in registers. All modern native code assembly languages use registers. #NEW ==From assembler to machine language== Machine language is just bytes. For instance, in x86 we have an instruction ``` 00000011 11000011 ``` This is usually given in hex code: ``` 03 C3 ``` This can be generated from the assembly instruction ``` add eax, ebx ``` In source code notation, this would be written ``` eax = eax + ebx ``` The variables ``eax`` and ``ebx`` are names of integer registers. #NEW ===More examples=== Addition and subtraction: two-address instructions with two registers. ``` add eax, ebx ; eax = eax + ebx sub eax, ebx ; eax = eax - ebx ``` Immediate values (integers) and memory addresses are also possible as source (but not as target): ``` add eax, 5 ; eax = eax + 5 add eax, [ebp - 8] ; eax = eax + [ebp - 8] ``` The memory address ``[ebp + 8]`` is an offset of 8 from the address stored in the register ``ebp``. Multiplication and division are more complicated, e.g. ``` div a ; eax = (edx:eax)/a ; edx = edx:eax % a ``` Move instructions can also have a memory address as target, but not as both source and target. ``` mov [ebp - 12], eax ``` Labels and jumps are as in JVM, e.g. ``` cmp eax, ebx ; compare eax with ebx jge Label ; if eax >= ebx, jump to Label ``` #NEW ===Example=== Source code ``` int i = 3 ; while(i < 17) { i = i + 2 ; } int j = i ; ``` Compiled code ``` mov [ebp - 8], 3 ; i -> [ebp - 8], i = 3 mov eax, [ebp - 8] ; eax = i Test: mov ebx, 17 ; ebx = 17 cmp eax, ebx ; cmp i 17 jge End add eax, 2 ; eax = i = i + 2 jmp Test End: mov [ebp - 12], eax ; j -> [ebp - 12], j = i mov [ebp - 8], eax ; save i in memory ``` Notice optimization: ``i`` is only in a register during the loop. #NEW ===Compiling to native code=== Compilation schemes are similar to JVM. Typically not directly to native code, but to **intermediate code** with infinitely many **virtual registers**. **2-address code** is similar to x86: one target and one source register. The scheme returns a register where the value is found. ``` compile (exp1 + exp2) = r1 := compile exp1 r2 := compile exp2 emit (r1 += r2) return r1 ``` But often the intermediate language is **3-address code**, with two source registers. ``` compile (exp1 + exp2) = r0 := newReg r1 := compile exp1 r2 := compile exp2 emit (r0 = r1 + r2) return r0 ``` #NEW ==Register allocation== The step from infinitely many virtual registers to the limited number of a machine is called register allocation. Main constraint: variables that are **live** at the same time cannot be kept in the same register. Example: how many registers are needed for ``x, y, x``? ``` int x = 1 ; // x alive int y = 2 ; // x,y alive printInt(y) ; // x,y alive int z = 3 ; // x, z alive printInt(x + z) ; // x, z alive ``` Answer: two, because ``y`` and ``z`` can be kept in the same register. #NEW ==Assembling for Intel X86== Excellent free book: [A Tutorial on PC assembly http://www.drpaulcarter.com/pcasm/] by Paul Carter. The book uses NASM, Net-Wide Assembler, also available through the book web page. The result of compilation is first assembled and then **linked**. ``` compile Foo.cc -- generate assembler nasm -f elf Foo.asm -- assemble to object file gcc -o Foo driver.o Foo.o asm_io.o -- link to executable ``` The linker combines object files and resolves names in them. **Dynamic linking**: some names are only linked at runtime. This makes it possible to have library code in one location, instead of being copied to all applications.