Aarne Ranta (aarne@chalmers.se)

Book: 6.3, 6.5

To define what the program should do.

- e.g. read an array of integers and return a double

To guarantee that the program is meaningful.

- that it does not add a string to an integer
- that variables are declared before they are used

To document the programmer's intentions.

- better than comments, which are not checked by the compiler

To optimize the use of hardware.

- reserve the minimal amount of memory, but not more
- use the most appropriate machine instructions

Depending on language, the type checker can prevent

- application of a function to wrong number of arguments,
- application of integer functions to floats,
- use of undeclared variables in expressions,
- functions that do not return values,
- division by zero
- array indices out of bounds,
- nonterminating recursion,
- sorting algorithms that don't sort...

Languages differ greatly in how strict their static semantics is: none of the things above is checked by all programming languages!

In general, the more there is static checking in the compiler, the less need there is for manual debugging.

These formats are independent of implementation language.

- Lexer: regular expressions
- Parser: BNF grammars
**Type checker: typing rules**- Interpreter: operational semantic rules
- Code generator: compilation schemes

Typing rules concern **judgements** of the form

E => e : T

where *E* is an **environment**, which contains e.g. typings of identifiers.
The judgement says

- in the environment
*E*, expression*e*has type*T*

Judgements are used in **typing rules** of the form

J1 J2 ... Jn --------------- C J

(n >= 0) which says

- from the judgements
*J1, J2, ..., Jn*you may conclude*J*, if condition*C*holds.

The judgements above the line in a rule are the **premisses**.

The judgement under the line is the **conclusion**.

The condition `C`

beside is a
**side condition**, typically not expressible as a judgement,
therefore not a premiss.

The judgement are written in a formal language, whereas side conditions can be written in natural language.

Typing rules for arithmetic expressions

E => e1 : int E => e2 : int E => e1 : int E => e2 : int ------------------------------- ------------------------------- E => e1 + e2 : int E => e1 * e2 : int ---------- x : T is in E ------------ i is an integer literal E => x : T E => i : int

Derivation of judgement `x : int, y : int => x + 12 * y : int`

x:int, y:int => 12 : int x:int, y:int => y : int --------------------------------------------------- x:int, y:int => x : int x:int, y:int => 12 * y : int ----------------------------------------------------------- x:int, y:int => x + 12 * y : int

We generalize the type checking context to an **environment** with
two parts:

**signature**, which shows the types of functions**context**, which shows the types of variables.

In the course of type checking, the signature remains the same throughout a program module, whereas the context changes all the time.

No expression in the language of Lab 2 has **function types**,
because functions are never returned as values or used as arguments.

However, the compiler needs internally a data structure for function types, to hold the types of the parameters and the return type. E.g. for a function

bool between (int x, double a, double b) {...}

we write

between : (int, double, double) -> bool

to express this internal representation in typing rules.

Dividing the environment to signature `F`

and context `G`

,

F,G => J

Example: typing rule for a variable expression

------------- x : T is in G F,G => x : T

Example: typing rule for one-place function application

F,G => e : A --------------- f : (A) -> T is in F F,G => f(e) : T

Expressions have types, but statements do not.

However, also statements are checked in type checking.

We need a new judgement form, saying that a statement `S`

is valid:

F,G => S valid

Example: typing rule for an assignment

F,G => e : T -------------------- x : T is in G F,G => x = e ; valid

Example: typing rule for while loops

F,G => e : bool F,G => S valid -------------------------------- F,G => while (e) S valid

Contexts can be **extended** with new variables. The notation we use is

(G, x : T)

This corresponds to **variable binding** constructs, e.g. declarations.

Example: typing rule for a declaration; `SS`

is a sequence of statements

F,(G, x:t) => SS valid ------------------------- x not in G F,G => t x ; SS valid

The rule says: if `SS`

is a valid sequence of statements in context
`(G, x : T)`

, then `t x ; SS`

is valid in `G`

.

We prove that `int x ; x = x + 5 ;`

is valid in the empty context `()`

.

x : int => x : int x : int => 5 : int ----------------------------------------- x : int => x + 5 : int -------------------------- x : int => x = x + 5 ; valid ------------------------------- () => int x ; x = x + 5 ; valid

The signature is omitted for simplicity.

The validity of a function definition.

F,(G, x_{1}:A_{1},...,x_{n}: A_{n}) => SS valid --------------------------------------------- f not in F F,G => T f (A_{1}x_{1},...,A_{n}x_{n}) { SS } valid

More conditions:

- that
`x`

are distinct_{1}... x_{n} - that
`A`

are types (can be guaranteed by syntax)_{1}... A_{n}, T - that there is a
`return e`

with`e : T`

(not always checked)

The typing rule for function applications is the following

F,G => f : (A_{1},...,A_{n}) -> T F,G => e_{1}: A_{1},..., e_{n}: A_{n}---------------------------------------------------------- F,G => f(e_{1},...,e_{n}) : T

In C, C++, Java, Haskell, etc, variables on the same level must be distinct (e.g. in function parameter lists).

However, variables in an **inner block** are no longer on the same level
and can hence **overshadow** outer variables

{ int x = 1 ; bool b ; x = x + 2 ; // x : int { double x = 2.0 ; x = x + 1.0 ; // x : double b = true ; // b : bool } x = x + 5 ; // x : int b = b && b ; // b : bool }

Variables declared in a block are discarded at exit from the block.

There is no limit in the number of block levels.

Implementation 1: with markers

x : int, b : bool, MARK, x : double

- entering a block: add a marker
- leaving a block: delete variables after last marker, and the marker itself
- update: after the last variable
- lookup: the latest occurrence of the variable

Implementation 2: with stacks of contexts (separated by ".", stack top is rightmost)

(x : int, b : bool).(x : double)

- entering a block: push an empty context on the stack
- leaving a block: pop the topmost context
- update: after the last variable in the topmost context
- lookup: the deepest surrounding occurrence of the variable

Implementation 2 can be done with lists of lists: the top of the stack is the head of the list.

The rules are expressed as checking if a **list of statements** SS is valid.
Instead of a single context, we have a **stack of contexts** `Gs.G`

where
we denote by `G`

the topmost context.

Declarations.

Gs.(G,x:t) => SS valid -------------------------- x not in G Gs.G => t x ; SS valid

Assignments.

Gs => e : t Gs => SS valid ------------------------------- x : t in Gs Gs => x = e ; SS valid

Blocks.

Gs.() => SS valid Gs => SSS valid ------------------------------------- Gs => { SS } SSS valid

The typing rule for variable expressions is now

-------------- x : T the closest entry for x in Gs.G Gs.G => x : T

**Type checking**: given a judgement `G => e : T`

, find out whether it
can be derived by the typing rules. The **derivation** is a tree of rule
application with the judgement as the last line.

**Type inference**: given an expression `e`

,
find a type `T`

in context `G`

such that `G => e : T`

can be derived by the typing rules.

For Java, C, and C++, we can mostly do with just type checking, because types are marked explicitly.

Haskell has type inference as well: if you don't give the type,
the compiler can usually find *the most general type*.

We can classify checkers in terms of what they return:

- A
*rude checker*, which only says`True`

or`False`

, and may even crash (for instance, when variable lookup just gives an`error`

is the variable is not found). - An
*error-reporting checker*, which returns`OK`

or a message saying where the error is. - An
*annotating checker*, which returns a syntax tree annotated with more type information.

To build a compiler back end, we need the third.

In Lab 2, we build the second.

Pass 1:

- start with empty signature
- for each function
`f`

, update signature with`f : T`

Pass 2:

- for each
`f`

, check the function body of`f`

with respect to the type`T`

The expression checker consists of functions:

check (Exp e, Type t) returns void infer (Exp e) returns Type

These functions are defined by mutual recursion, by cases on the expression.

We also need to check function definitions and sequences of statements.

check (Def d) returns void check (Stms ss) returns void

All functions use an environment (= signature and stack of contexts).

The method is syntax-directed translation.

We show syntax-directed translation in pseudocode.

infer x = // variable x t := lookup(x) return t infer i = // integer literal i return int infer f(a_{1},..., a_{n}) = T := lookup(f) if T = (A_{1}, ..., A_{n}) -> B check a_{1}: A_{1}... check a_{n}: A_{n}return B else failure

Basic idea: from rule

J_{1}... J_{n}---------- C J

generate the code "upside down"

check J = check J_{1}... check J_{n}check_condition C

Example:

Env => exp1 : bool Env => exp2 : bool check Env => exp1 && exp2 : bool = ----------------------------------------- check Env => exp1 : bool Env => exp1 && exp2 : bool check Env => exp1 : bool

Judgements are easy: recursive calls to check.

Env => exp : bool Env => stm valid check Env => while (exp) stm valid = ------------------------------------ check Env => exp : bool Env => while (exp) stm valid check Env => stm valid

Side conditions are unlimited code, so you have to think harder.

---------------- var : typ is in Env check Env => var : typ = Env => var : typ check_condition lookup(var,Env) == typ

It is `lookup`

and such conditions that in the end generate the error messages.

lookup(var,Env) = message ("variable " var "not found") // if var is not in Env check_condition x == y = message ("expected " y " but found " x) // if not equal

There is a grammar rule saying that expressions can be used as statements:

Stm ::= Exp ";"

How do we check that such statements are valid?

Env => exp : ? ------------------ Env => exp ; valid

The problem is that we have no type `typ`

to check `exp : typ`

.

Solution 1: check `exp`

with each of the four types

check Env => exp ; valid = try each typ in [bool,double,int,void]: check exp : typ

This is inefficient, and does not scale up to infinitely many types.

Solution 2: do type inference with `exp`

. If it succeeds, the statement
is valid - because expressions of any type can be used as statments.

The general scheme is a rule where the conclusion has a type depending in some way on the premises and the condition:

J_{1}... J_{n}--------------------------------- C Env => exp : typ(J_{1}, ..., J_{n}, C)

We should then use recursive calls of `check`

and `infer`

so that

- everything we need for constructing the type is inferred
- everything else is just checked

Often the type is independent of the premisses (which still have to be checked of course!):

Env => exp1 : bool Env => exp2 : bool infer Env (exp1 && exp2) = ------------------------------------------ check Env => exp1 : bool Env => exp1 && exp2 : bool check Env => exp2 : bool return bool

It can also come from the condition:

---------------- var : typ is in Env infer Env var = Env => var : typ return lookup(var,Env)

Arithmetic operations in most languages are **overloaded**.

This means that they apply to many types.

The general rule for `+ - * /`

is: both operands have the same type as the value,
which must be `int`

or `double`

.

Env => exp1 : typ Env => exp2 : typ -------------------------------------- typ is int or double Env => exp1 + exp2 : typ

What we do is infer the type of the first operand and check the second.

infer Env (exp1 + exp2) = typ := infer Env exp1 check_condition typ == int or typ == double check Env => exp2 : typ return typ

Also the comparison operators are overloaded, but
the return type is of course `bool`

.

Now we can check expression statements:

check Env => exp ; valid = infer Env exp

If `infer`

fails, we get any error message it generates.

If `infer`

succeeds, we discard the type.

In the same way, we only need to write `infer`

for expressions.
Then we define `check`

uniformly,

check Env => exp : typ = typ2 := infer Env exp check_condition typ2 == typ

The `check_condition`

call usually returns a message at failure, e.g.

TYPE ERROR type of exp: expected typ, inferred typ2

To check the whole program,

- collect the types of each function into the signature
- check that function names are unique
- check each function definition using the signature

To check a function definition

- check that argument variables are unique
- initialize the topmost context with the argument variables
- check the body in this context
- check that there is a
`return`

, with an expression that has the expected return type of the function (or just a`return`

if the type is`void`

)

To check a sequence of statements

- check the validity of the first statement and update the environment if appropriate
- check the remaining sequence in the new environment
- an empty sequence is always valid

You can copy the contents of
`laborations/lab2/haskell/`

:

CPP.cf -- grammar lab2.hs -- main module Makefile TypeChecker.hs -- type checking module

You only have to modify `CPP.cf`

and `TypeChecker.hs`

.

But you can already compile them: just type

make

and run the type checker with

./lab2 <File>

The rest is "debugging the empty file"!

You don't have to write this - just copy the file
`laborations/lab2/haskell/lab2.hs`

.

This file shows how compiler phases are linked together.

check :: String -> IO () check s = case pProgram (myLexer s) of Bad err -> do putStrLn "SYNTAX ERROR" putStrLn err exitFailure Ok tree -> case typecheck tree of Bad err -> do putStrLn "TYPE ERROR" putStrLn err exitFailure Ok _ -> putStrLn "OK"

In other words: call the parser; if it succeeds, call the type checker.

Notice the use of the **error type**,

data Err a = Ok a | Bad String

The value is either `Ok`

of the expected type or `Bad`

with an error message.

The `Err`

type is generated by BNFC. One could also use Haskell's standard
type `Either String a`

.

The `Err`

type

data Err a = Ok a | Bad String

is a **monad** - a type of actions returning `a`

but also doing
other things (in this case: exceptions).

Monad actions can be **sequence**d: if

inferExp :: Env -> Exp -> Err Type

then you can make several inferences one after the other by using `do`

do inferExp env exp1 inferExp env exp2

You can **bind** variables returned from actions, and **return**
values.

do typ1 <- inferExp env exp1 typ2 <- inferExp env exp2 return TBool

If you are only interested in side effects, use the dummy value type
`()`

(corresponds to `void`

in C and Java).

Environment type

type Env = (Sig,[Context]) -- signature and stack of contexts type Sig = [(Id,([Type],Type))] -- or Map Id ([Type],Type) type Context = [(Id,Type)] -- or Map Id Type

Auxiliary operations on the environment

lookVar :: Env -> Id -> Err Type lookFun :: Env -> Id -> Err ([Type],Type) updateVar :: Env -> Id -> Type -> Err Env updateFun :: Env -> Id -> ([Type],Type) -> Err Env newBlock :: Env -> Err Env emptyEnv :: Env

Keep the datatypes abstract, i.e. use them only via these operations. Then you can switch to another implementation if needed (more efficient, more stuff in the environment).

The environment datatypes and operations.

Type signatures of the checking methods

typecheck :: Program -> Err () -- required function in lab2 checkDef :: Env -> Def -> Err () -- check a function definition checkStms :: Env -> Type -> [Stm] -> Err () checkStm :: Env -> Type -> Stm -> Err Env checkExp :: Env -> Type -> Exp -> Err () inferExp :: Env -> Exp -> Err Type

Some other auxiliaries.

checkUnique :: (Ord a, Print a) => [a] -> Err () checkCondition :: Bool -> Err ()

checkStm :: Env -> Type -> Stm -> Err Env checkStm env val x = case x of SExp exp -> do inferExp env exp return env SDecl type' x -> updateVar env id type' -- also check that x is not in context already SWhile exp stm -> do checkExp env Type_bool exp checkStm env val stm checkExp :: Env -> Type -> Exp -> Err () checkExp env typ exp = do typ2 <- inferExp env exp if (typ2 == typ) then return () else fail $ "type of " ++ printTree exp -- ...

inferExp :: Env -> Exp -> Err Type inferExp env x = case x of ETrue -> return Type_bool EInt n -> return Type_int EId id -> lookVar env id EPIncr exp -> inferNumeric env exp ETimes exp0 exp -> inferNumericBin env exp0 exp inferNumeric :: Env -> Exp -> Err Type inferNumeric env exp = do typ <- inferExp env exp if elem typ [Type_int, Type_double] then return typ else fail $ "type of expression " ++ printTree exp -- ... inferNumericBin :: Env -> Exp -> Exp -> Err Type

You can copy the contents of
`laborations/lab2/java/`

:

CPP.cf -- grammar lab2 -- script running the type checker lab2.java -- main program Makefile TypeChecker.java -- type checker class TypeException.java -- exceptions for type checking

You only have to modify `CPP.cf`

and `TypeChecker.java`

.

But you can already compile them: just type

make

and run the type checker with

./lab2 <File>

The rest is "debugging the empty file"!

Before `make`

, you may have to set your class path so that it finds
java_cup and JLex, as well as the current directory.

export CLASSPATH=.:<path-to-JLex>:<path-to-CUP>:$CLASSPATH

This is given in
`laborations/lab2/java/lab2.java`

,
hence you don't have to write this.

It shows how compiler phases are linked together.

try { l = new Yylex(new FileReader(args[0])); parser p = new parser(l); CPP.Absyn.Program parse_tree = p.pProgram(); new TypeChecker().typecheck(parse_tree); } catch (TypeException e) { System.out.println("TYPE ERROR"); System.err.println(e.toString()); System.exit(1); } catch (IOException e) { System.err.println(e.toString()); System.exit(1); } catch (Throwable e) { System.out.println("SYNTAX ERROR"); System.out.println("At line " + String.valueOf(l.line_num()) + ", near \"" + l.buff() + "\" :"); System.out.println(" " + e.getMessage()); System.exit(1); }

Environment types

public static class FunType { public LinkedList<Type> args ; public Type val ; } public static class Env { public Map<String,FunType> signature ; public LinkedList<Map<String,Type>> contexts ; -- stack of contexts public static Type lookVar(String id) { ...} ; public static FunType lookFun(String id) { ...} ; public static void updateVar (String id, Type ty) {...} ; // ... }

The environment datatypes and operations.

An enumeration of codes for types.

public static enum TypeCode { CodeInt, CodeDouble, CodeBool, CodeVoid } ;

Notice that `TypeCode`

is not the same as the class
`Type`

, which is the syntactic category of source-language types.
We need `TypeCode`

to be able to compare types for equality,
and this happens when we compare an expected type with an inferred type.

Type signatures of the checking methods

public void typecheck(Program p) { } public static class CheckStm implements Stm.Visitor<Env,Env> { public Env visit(SDecl p, Env env) { } public Env visit(SExp p, Env env) { } // ... public static class InferExp implements Exp.Visitor<Type,Env> { public Type visit(EInt p, Env env) { } public Type visit(EAdd p, Env env) { } // ... }

public static class CheckStm implements Stm.Visitor<Env,Env> { public Env visit(SDecl p, Env env) { env.updateVar(p.id_,p.type_) ; return env ; } //... }

public static class InferExpType implements Exp.Visitor<Type,Env> { public Type visit(demo.Absyn.EPlus p, Env env) { Type t1 = p.exp_1.accept(this, env); Type t2 = p.exp_2.accept(this, env); if (typeCode(t1) == TypeCode.CodeInt && typeCode(t2) == TypeCode.CodeInt) return TInt; else if (typeCode(t1) == TypeCode.CodeDouble && typeCode(t2) == TypeCode.CodeDouble) return TDouble; else throw new TypeException("Operands to + must be int or double."); } //... }

The function `typeCode`

converts source language types to their codes:

public static TypeCode typeCode (Type ty) ...

It can be implemented by using a visitor or the `instanceof`

operator.

You don't need to debug completely empty files:

- for the grammars, you can pick rules from your Lab 1 (as indicated in Lab 2 PM)
- for the type checker, you can start from the "mini" implementation, in
`laborations/mini`

We will read through the Lab PM

Preparation and exercise: write typing rules for Lab PM constructs.