|                                                                                                                                                                                                | CHALMERS                                                                                                                                                                                                               | This lecture                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |  |  |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| Compiler construction<br>Lecture 6: Code generation for x86<br>Alex Gerdes<br>Vårtermin 2017<br>Chalmers University of Technology – Gothenburg University                                      |                                                                                                                                                                                                                        | <ul> <li>x86 architecture</li> <li>Calling conventions</li> <li>Some x86 instructions</li> <li>From LLVM to assembler <ul> <li>Instruction selection</li> <li>(Instruction scheduling)</li> <li>Register allocation</li> </ul> </li> </ul>                                                                                                                                                                                                                                                                                                                                 |  |  |
| x86 architectu                                                                                                                                                                                 | 'e                                                                                                                                                                                                                     | x86: assembly for a real machine         High-level view of x86         • Not a stack machine; no direct correspondence to operand stacks         • Arithmetics, etc. is done with values in registers         • Much more limited support for function calls; you need to handle return addresses, jumps, allocation of stack frames, etc. yourself         • Your code is assembled and run; no further optimization         • CISC architecture usually has few registers; straightforward code will run slowly                                                         |  |  |
| <pre>6 assembler, a first exampl JAVALETTE (or C) &gt; cat ex1.jl int f (int x, int y) {     int z = x + y;     return z;   } This might be compiled to the assembler code to the right.</pre> | e<br>NASM assembly code<br>segment .text<br>global f<br>f:<br>push dword ebp<br>mov ebp, esp<br>sub esp, 4<br>mov eax, [ebp+12]<br>add eax, [ebp+8]<br>mov [ebp-4], eax<br>mov eax, [ebp-4]<br>mov esp, ebp<br>pop ebp | Example explained         segment .text       ; code area         global f       ; f has external scope         f:       ; entry point for f         push dword ebp       ; save caller's fp         mov ebp, esp       ; set our fp         sub esp, 4       ; allocate space for z         mov eax, [ebp+12]       ; move y to eax         add eax, [ebp+8]       ; add x to eax         mov esp, ebp       ; restore caller's fp         mov esp, ebp       ; restore caller's fp         pop ebp       ; restore caller's fp         ret       ; pop return addr, jump |  |  |

| Intel x86 architectures                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | Which version should you target?                                                                                                                                                                                                                                        |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Long history<br>8086 1978. First IBM PCs, 16 bit registers, real mode<br>80286 1982. AT, Windows, protected mode<br>80386 1985. 32 bit registers, virtual memory<br>80486 (Pentium, Pentium II, III, IV) 1989 – 2003.<br>Math coprocessor, pipelining, caches, SSE,<br>Intel Core 2 2006. Multi-core<br>Core i3/i5/i7 2009 —<br>Backwards compatibility important; leading to a large set of<br>opcodes.<br>Not only Intel offer x86 processors, also AMD is in the market.                                                                                      | <b>x86</b><br>When speaking of the x86 architecture, one generally means<br>register/instruction set for the 80386 (with floating-point<br>operations).<br>You can compile code which would run on a 386 – or you may use<br>SSE2 operations for a more recent version. |
| x86 registers         General purpose registers (32-bits)         EAX, EBX, ECX, EDX, EBP, ESP, ESI, EDI.         Conventional use:         EBP and ESP for frame pointer and stack pointer.         Segment registers         Legacy from old segmented addressing architecture.         Can be ignored in JAVALETTE compilers.         Floating-point registers         Eight 80-bit registers ST0 - ST7 organised as a stack.         Flag registers         Status registers with bits for results of comparisons, etc.         We will discuss these later. | Calling convention                                                                                                                                                                                                                                                      |
| <section-header><section-header><section-header><section-header><list-item><list-item><list-item></list-item></list-item></list-item></section-header></section-header></section-header></section-header>                                                                                                                                                                                                                                                                                                                                                        | <ul> <li>Calling convention</li> <li>Caller, before call         <ul> <li>Push params (in reverse order)</li> <li>Push return address</li> <li>Jump to callee entry             push dword paramn</li></ul></li></ul>                                                   |

| Caller, before call                                                                                                                                                                               | Callee, on entry                                                                                                                                                                                                               | Caller, before call                                                                                                                                                                                           | Callee, on entry                                                                                                                                                                                              |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>Push params (in reverse order)</li> <li>Push return address</li> <li>Jump to callee entry push dword param<sub>n</sub></li> <li></li> <li>push dword param<sub>1</sub></li> </ul>        | <ul> <li>Push caller's base pointer</li> <li>Update current base pointer</li> <li>Allocate space for locals         <pre>push dword ebp         mov ebp, esp         sub esp, localbytes</pre> </li> </ul>                     | <ul> <li>Push params (in reverse order)</li> <li>Push return address</li> <li>Jump to callee entry         <pre>push dword paramn</pre> <pre>             push dword param1</pre> </li> </ul>                 | <ul> <li>Push caller's base pointer</li> <li>Update current base pointer</li> <li>Allocate space for locals <ul> <li>push dword ebp</li> <li>mov ebp, esp</li> <li>sub esp, localbytes</li> </ul> </li> </ul> |
| call f                                                                                                                                                                                            |                                                                                                                                                                                                                                | call f                                                                                                                                                                                                        | Callee, on exit                                                                                                                                                                                               |
|                                                                                                                                                                                                   |                                                                                                                                                                                                                                |                                                                                                                                                                                                               | <ul> <li>Restore base and stack ptr</li> <li>Pop return address and jump<br/>mov esp, ebp<br/>pop ebp<br/>ret</li> </ul>                                                                                      |
| ng convention                                                                                                                                                                                     | GAMMER                                                                                                                                                                                                                         | Calling convention                                                                                                                                                                                            |                                                                                                                                                                                                               |
|                                                                                                                                                                                                   | Callee, on entry                                                                                                                                                                                                               | Calling convention<br>Caller, before call                                                                                                                                                                     | Callee, on entry                                                                                                                                                                                              |
| ing convention<br>Caller, before call<br>• Push params (in reverse<br>order)<br>• Push return address                                                                                             | Callee, on entry <ul> <li>Push caller's base pointer</li> <li>Update current base pointer</li> <li>Allocate space for locals</li> </ul>                                                                                        |                                                                                                                                                                                                               | Callee, on entry <ul> <li>Push caller's base pointer</li> <li>Update current base pointer</li> <li>Allocate space for locals</li> </ul>                                                                       |
| <ul> <li>Caller, before call</li> <li>Push params (in reverse order)</li> </ul>                                                                                                                   | <ul><li>Push caller's base pointer</li><li>Update current base pointer</li></ul>                                                                                                                                               | Caller, before call <ul> <li>Push params (in reverse order)</li> </ul>                                                                                                                                        | <ul> <li>Push caller's base pointer</li> <li>Update current base pointer</li> </ul>                                                                                                                           |
| <ul> <li>Caller, before call</li> <li>Push params (in reverse order)</li> <li>Push return address</li> <li>Jump to callee entry push dword param<sub>n</sub></li> <li></li> </ul>                 | <ul> <li>Push caller's base pointer</li> <li>Update current base pointer</li> <li>Allocate space for locals</li> <li>push dword ebp</li> </ul>                                                                                 | Caller, before call <ul> <li>Push params (in reverse order)</li> <li>Push return address</li> <li>Jump to callee entry <ul> <li>push dword paramn</li> <li></li> </ul> </li> </ul>                            | <ul> <li>Push caller's base pointer</li> <li>Update current base pointer</li> <li>Allocate space for locals</li> </ul>                                                                                        |
| <ul> <li>Caller, before call</li> <li>Push params (in reverse order)</li> <li>Push return address</li> <li>Jump to callee entry</li> </ul>                                                        | <ul> <li>Push caller's base pointer</li> <li>Update current base pointer</li> <li>Allocate space for locals         <pre>push dword ebp         mov ebp, esp         sub esp, localbytes</pre> Callee, on exit     </li> </ul> | <ul> <li>Caller, before call</li> <li>Push params (in reverse order)</li> <li>Push return address</li> <li>Jump to callee entry push dword paramn</li> </ul>                                                  | <ul> <li>Push caller's base pointer</li> <li>Update current base pointer</li> <li>Allocate space for locals<br/>enter localbytes, 0</li> </ul>                                                                |
| <ul> <li>Caller, before call</li> <li>Push params (in reverse order)</li> <li>Push return address</li> <li>Jump to callee entry push dword paramn</li> <li></li> <li>push dword param1</li> </ul> | <ul> <li>Push caller's base pointer</li> <li>Update current base pointer</li> <li>Allocate space for locals <ul> <li>push dword ebp</li> <li>mov ebp, esp</li> <li>sub esp, localbytes</li> </ul> </li> </ul>                  | Caller, before call <ul> <li>Push params (in reverse order)</li> <li>Push return address</li> <li>Jump to callee entry <ul> <li>push dword paramn</li> <li></li> <li>push dword param1</li> </ul> </li> </ul> | <ul> <li>Push caller's base pointer</li> <li>Update current base pointer</li> <li>Allocate space for locals<br/>enter localbytes, 0</li> </ul> Callee, on exit <ul> <li>Restore base and stack ptr</li> </ul> |

# Parameters

- In the callee code, integer parameter 1 has address EBP+8, parameter 2 EBP+12, etc.
- Parameter values accessed with indirect addressing: [EBP+8], etc.
- Double parameters require 8 bytes
- Here EBP+n means "(address stored in EBP) + n"

#### Local variables

- First local var is at address EBP-4, etc.
- Local vars are conventionally addressed relative to EBP, not ESP
- Again, refer to vars by indirect addressing: [EBP-4], etc.

#### **Return values**

Integer and boolean values are returned in EAX, doubles in STO

# Scratch registers (caller save)

EAX, ECX and EDX must be saved by caller before call, if used; can be freely used by callee.

#### **Callee save register** EBX, ESI, EDI, EBP, ESP.

For EBP and ESP, this is handled in the code patterns.

#### Note

- What we have described is one common calling convention for 32-bit x86, called  $\underline{cdecl}$
- Other conventions exist, but we omit them

# Assemblers for x86

# Several alternatives

- · Several assemblers for x86 exist, with different syntax
- We will use NASM, the Netwide Assembler, which is available for several platforms
- We also recommend Paul Carter's book and examples; follow link from course website
- Some syntax differences to the GNU assembler:
  - GNU uses %eax etc. as register names
  - For two-argument instructions, the operands have opposite order!
  - Different syntax for indirect addressing
  - If you use gcc -S ex.c, you will get GNU syntax

# **Example: GNU syntax**

9:

a:

#### First example, revisited

| > gcc -c ex1<br>> objdump -c |    | 1.0 |    |                     |                |
|------------------------------|----|-----|----|---------------------|----------------|
| ex1.o: f<br>Disassembly      |    |     |    | elf32-i38<br>.text: | 36             |
| 00000000 <f></f>             | ·: |     |    |                     |                |
| 0:                           | 55 |     |    | push                | %ebp           |
| 1:                           | 89 | e5  |    | mov                 | %esp,%ebp      |
| 3:                           | 8b | 45  | 0c | mov                 | Oxc(%ebp),%eax |
| 6:                           | 03 | 45  | 80 | add                 | 0x8(%ebp),%eax |

# Integer arithmetic; two-adress code

c9

c3

#### Addition, subtraction and multiplication

add dest, src ; dest := dest + src sub dest, src ; dest := dest - src imul dest, src ; dest := dest \* src

Operands can be values in registers or in memory; src also a literal.

leave

ret

#### Division - one-address code

idiv denom (eax, edx) := ((edx:eax) / denom, (edx:eax) % denom)

- The numerator is the 64-bit value EDX:EAX (no other choices)
- Both div and mod are performed; results in EAX resp. EDX
- EDX must be zeroed before division

# Example

#### **JAVALETTE program**

```
int main () {
 printString "Input a number: "; mov ebp, esp
 int n = readInt();
 printInt(2 * n);
 return 0;
}
```

Assembler

The above code could be translated as follows (slightly optimized to fit on slide).

# Code for main push dword ebp push str1 call printString add esp, 4 call readInt imul eax, 2 push eax call printInt add esp, 4 mov eax, 0 leave ret

# Example, continued

| Complete file                                          | Comments                                                   |  |
|--------------------------------------------------------|------------------------------------------------------------|--|
| <pre>extern printString, printInt extern readInt</pre> | <ul> <li>IO functions we will come</li> </ul>              |  |
| segment .data<br>str1 db "Input a number: "            | <ul> <li>The .data se<br/>contains con<br/>str1</li> </ul> |  |
| segment .text                                          | • The .text se<br>contains cod                             |  |
| global main<br>main:                                   | <ul> <li>The global d<br/>gives main ex</li> </ul>         |  |

#### ; code from previous slide

- are external; e back to that
- egment nstants such as
- egment de
- declaration xternal scope (can be called from code outside this file)

# Floating-point arithmetic in x86

# Moving numbers (selection)

fld src Pushes value in src on fp stack

- fild  $\operatorname{src}$  Pushes integer value in  $\operatorname{src}$  on fp stack
- ${\tt fstp}\ {\tt dest}\ {\tt Stores}\ {\tt top}\ {\tt of}\ {\tt fp}\ {\tt stack}\ {\tt in}\ {\tt dest}\ {\tt and}\ {\tt pops}$

Both src and dest can be fp register or memory reference.

#### Arithmetic (selection)

Integer comparisons

+  $v_1 - v_2$  is computed and bits

• ZF is set iff value is zero

• OF is set iff result overflows

in the flag register are set:

• SF is set iff result is

negative

• cmp  $v_1 v_2$ 

**Control flow** 

fadd src Adds src to STO fadd to dest Adds STO to dest

faddp dest Adds STO to dest, then pop

Similar variants for fsub, fmul and fdiv.

# Floating-point arithmetic in SSE2

#### **New registers**

- 128-bit registers XMM0-XMM7 (later also XMM8-XMM15)
- Each can hold two double precision floats or four single-precision floats
- SIMD operations for arithmetic

### **Arithmetic instructions**

- Two-address code, ADDSD, MULSD, etc.
- SSE2 fp code similar to integer arithmetic

# **Control flow**

#### Integer comparisons

# • cmp $v_1 v_2$

- v<sub>1</sub> v<sub>2</sub> is computed and bits in the flag register are set:
  - ZF is set iff value is zero
  - OF is set iff result overflows SF is set iff result is
    - negative

#### Branch instructions (selection)

- JZ lab branches if ZF is set
- JL lab branches if SF is setSimilarly for the other
- relations between  $v_1$  and  $v_2$   $\bullet$  fcomi src compares STO
- and src and sets flags; can be followed by branching as above

### One more example

| JAVALETTE (or C)                                                           | Naive as | ssembler                                                                      |
|----------------------------------------------------------------------------|----------|-------------------------------------------------------------------------------|
| <pre>int sum(int n) {     int res = 0;     int i = 0;     int i = 0;</pre> | sum:     | mov [ebp-4], 0<br>mov [ebp-8], 0                                              |
| <pre>while (i &lt; n) {     res = res + i;     i++; }</pre>                | L3:      | <pre>jmp L2 mov eax, [ebp-8] add [ebp-4], eax inc [ebp-8]</pre>               |
| return res;<br>}                                                           | L2:      | <pre>mov eax, [ebp-8] cmp eax, [ebp+8] jl L3 mov eax, [ebp-4] leave ret</pre> |

### How to do an x86 backend

# Starting point

#### Two alternatives:

- From LLVM code (requires your basic backend to generate LLVM code as a data structure, not directly as strings); will generate many local vars
- From AST's generated by the frontend (means a lot of code common with LLVM backend)

#### Variables

In either case, your code will contain a lot of variables/virtual registers. Possible approaches:

- Treat these as local vars, storing to and fetching from stack at each access; gives really slow code
- Do register allocation<sup>1</sup>; much better code

<sup>1</sup>Future lecture

# Input and output A simple proposal Define printInt, readInt, etc. in C. Then link this file together with your object files using gcc. Alternative: Compile runtime.ll with llvm-as and llc to get runtime.s; this can be given to gcc as below. Linux building From LLVM to assembler To assemble a NASM file to file.o: nasm -f elf file.asm To link. gcc file.o runtime.s Result is executable a.out More info Paul Carter's book (link on course web site) gives more info.

#### Several stages

From LLVM to assembler

- Instruction selection
- Instruction scheduling
- SSA-based optimizations
- Register allocation
- Prolog/epilog code (AR management)
- Code emission

### Target-independent generation

Also much of this is done in target-independent ways and using general algorithms operating on target descriptions.

# Native code generation, revisited

# More complications

So far, we have ignored some important concerns in code generation:

- The instruction set in real-world processors typically offer many different ways to achieve the same effect. Thus, when translating an IR program to native code we must do instruction selection, i.e., choose between available alternatives.
- Often an instruction sequence contain independent parts that can be executed in arbitrary order. Different orders may take very different time; thus a code generator should do instruction scheduling.

Both these task are complex and interact with register allocation.

In LLVM, these tasks are done by the native code generator llc and the JIT compiler in 11i.

# Instruction selection

#### Further observations

Instruction selection

- Instruction selection for RISC machines generally simpler than for CISC machines
- The number of translation possibilities grow (combinatorially) as one considers larger chunks of IR code for translation

# **Pattern matching**

The IR code can be seen as a pattern matching problem: The native instructions are seen as patterns; instruction selection is the problem to cover the IR code by patterns.

#### **Further observations**

- Instruction selection for RISC machines generally simpler than for CISC machines
- The number of translation possibilities grow (combinatorially) as one considers larger chunks of IR code for translation

# Pattern matching

The IR code can be seen as a pattern matching problem: The native instructions are seen as patterns; instruction selection is the problem to cover the IR code by patterns.

## **Two approaches**

- Tree pattern matching: think of IR code as tree
- Peephole matching: think of IR code as sequence



| Instruction scheduling, example                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | Instruction scheduling, example                                                                                                                                                                      |                                                                                                                    |  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------|--|
| Example (from Cooper)<br>w = w * 2 * x * y * z<br>• Memory op takes 3 cycles, mult 2 cycles, add one cycle<br>• One instruction can be issued each cycle, if data available<br>Schedule 1                                                                                                                                                                                                                                                                                                                                      | Example (from Cooper)<br>w = w * 2 * x * y * z<br>• Memory op takes 3 cycles, mult 2 cycles, add one cycle<br>• One instruction can be issued each cycle, if data available<br>Schedule 1 Schedule 2 |                                                                                                                    |  |
| r1 <- M [fp + @w]<br>r1 <- r1 + r1<br>r2 <- M [fp + @x]<br>r1 <- r1 * r2<br>r2 <- M [fp + @y]<br>r1 <- r1 * r2<br>r2 <- M [fp + @z]<br>r1 <- r1 * r2<br>M [fp + @w] <- r1<br>Instruction scheduling                                                                                                                                                                                                                                                                                                                            | r1 <- M [fp + @w]<br>r1 <- r1 + r1<br>r2 <- M [fp + @x]<br>r1 <- r1 * r2<br>r2 <- M [fp + @y]<br>r1 <- r1 * r2<br>r2 <- M [fp + @z]<br>r1 <- r1 * r2<br>M [fp + @w] <- r1<br>x86 backend extension   | r1 <- M [fp + @w]<br>r2 <- M [fp + @x]<br>r3 <- M [fp + @y]<br>r1 <- r1 + r1<br>r1 <- r1 * r2<br>r2 <- M [fp + @z] |  |
| <ul> <li>Comments</li> <li>Problem is NP-complete for realistic architectures</li> <li>Common technique is <u>list scheduling</u>: greedy algorithm for scheduling a basic block</li> <li>Builds graph describing data dependencies between instructions and schedules instructions from ready list of instructions with available operands</li> <li>Interaction</li> <li>Despite interaction between selection, scheduling and register allocation, these are typically handled independently (and in this order).</li> </ul> | <b>Comments</b><br>• Two credits<br>• Need to implement at lo<br>• Acts as a 'multiplier' for                                                                                                        | east one optimization pass<br>r other extensions                                                                   |  |