Lecture 2: Abstract and Concrete Syntax
Programming Languages Course
Aarne Ranta (aarne@chalmers.se)
%!target:html
%!postproc(html): #NEW
%!postproc(html): #HR
Book: 2.8.2, 4.1 - 4.3
#NEW
==The central role of abstract syntax==
Although lexing is the first compiler phase, we don't start from it.
Instead, we start from the middle, abstract syntax, which is
- the goal of lexing + parsing
- the starting point of code generation
- the domain of type checking and many optimizations
- the hub between the front end and the back end
This lecture: how syntax rules look like.
Next lecture: how to write a grammar to generate a compiler front end.
#NEW
==Abstract and concrete syntax==
**Abstract syntax**: what are the significant parts of the expression?
Example: a sum expression has its two operand expressions as its significant parts
[abstract.png]
**Concrete syntaz**: what does the expression look like?
Example: //the same// sum expression can look in different ways:
```
2 + 3 -- infix
(+ 2 3) -- prefix
(2 3 +) -- postfix
bipush 2 -- JVM
bipush 3
iadd
the sum of 2 and 3 -- English
```
#NEW
==Parse trees and abstract syntax trees==
Parse tree (left): show the concrete syntax (how tokens are grouped together)
- the tree initially constructed by the parser
Abstract tree (right): show the semantically significant structure
- the tree returned by the parser and manipulated by type checker
| [parse.png] [abstract.png]
#NEW
==The structure of trees==
Parse tree:
- nodes: **nonterminals** (= **syntactic categories**)
- leaves: **terminals** (= **tokens**)
Abstract tree:
- nodes: **constructor functions**
- leaves: atoms (= zero-place constructor functions)
| [parse.png] [abstract.png]
#NEW
==The definition and construction of trees==
Parse trees
are defined by **context-free grammars**
```
Exp ::= Exp "+" Exp ;
Exp ::= "2" ;
Exp ::= "3" ;
```
Abstract trees
are defined by **constructor type signatures**
```
plus : (Exp, Exp) -> Exp
2 : Exp
3 : Exp
```
#NEW
==From concrete to abstract syntax==
1. Give a name (= label) to each rule:
```
Exp ::= Exp "+" Exp ==> plus. Exp ::= Exp "+" Exp
```
2. Ignore terminals (= tokens, in quotes):
```
plus .Exp ::= Exp "+" Exp ==> plus. Exp ::= Exp Exp
```
3. Treat label as constructor name,
LHS (left-hand-side) as value type, RHS as argument types:
```
plus. Exp ::= Exp Exp ==> plus : (Exp, Exp) -> Exp
```
#NEW
==One abstract, many concrete==
One abstract syntax tree can have infinitely many
concrete syntax representations.
```
2 + 3 -- infix
(+ 2 3) -- prefix
(2 3 +) -- postfix
bipush 2 -- JVM
bipush 3
iadd
the sum of 2 and 3 -- English
```
Remember: terminals don't matter.
#NEW
==Separating abstract from concrete syntax==
One could give a separate abstract syntax rule
```
fun plus : Exp -> Exp -> Exp
```
and functions computing the concrete syntax as **linearization**:
```
lin plus x y = x ++ "+" ++ y -- infix
lin plus x y = "(" ++ "+" ++ x ++ y ++ ")" -- prefix
lin plus x y = "(" ++ x ++ y ++ "+" ++ ")" -- postfix
lin plus x y = x ++ y ++ "iadd" -- JVM
lin plus x y = "the sum of" ++ x ++ "and" ++ y -- English
```
This leads to more expressive grammars, definable in
[GF http://grammaticalframework.org] (Grammatical Framework).
#NEW
==(For fun) using GF==
Concrete syntaxes can be different languages.
GF has tools for visualizing trees and word alignment:
[``tournesol.cs.chalmers.se:41296`` http://tournesol.cs.chalmers.se:41296]
The fridge magnet interface:
[``tournesol.cs.chalmers.se:41296/fridge`` http://tournesol.cs.chalmers.se:41296/fridge]
#NEW
==The main idea of compilation==
To compile:
- **parse** with the concrete syntax of source language
- **linearize** the resulting tree into the target language
```
+ bipush 2
2 + 3 -------> / \ -----------> bipush 3
parse 2 3 linearize iadd
```
This is the idea of **syntax-directed translation**.
Notice: from a grammar, both parsing and linearization are created automatically.
#NEW
==Algebraic datatypes==
Abstract syntax can be expressed as **algebraic datatypes**.
They have a direct support in Haskell, as ``data`` types.
You just have to follow the syntax convetions: constructor
begin with capital letters.
```
data Exp = Eplus Exp Exp
| E2
| E3
```
This is one reason why Haskell is so well suited for
compiler construction.
But: we will show later how algebraic datatypes are
encoded in Java.
#NEW
==Context-free grammars==
Concrete syntax is described by **context-free grammars**.
Context-free grammar = **BNF grammar** (Backus-Naur form).
The mathematical definition is simple:
A context-free grammar is a quadruple (//T,N,S,R//) where
- //T// and //N// are disjoint sets, called **terminals** and
**nonterminals**, respectively
- //S// is a nonterminal, the **start category**
- //R// is a finite set of **rules**
- a rule is a pair (//C//, //t_1//...//t_n//) where
- //C// is a nonterminal
- each //t_1// is a terminal or a nonterminal
- n >= 0
Example (the one above):
- //T// = {"+", "2", "3"}
- //N// = {Exp}
- //S// = Exp
- //R// = ((Exp, Exp "+" Exp), (Exp, "2"), (Exp, "3"))
#NEW
==The BNF Converter==
We will follow the notation of BNF Converter (= BNFC), where
- rules have the form ``C ::= ... ;``
- terminals are quoted strings e.g. ``"+"``
- nonterminals are unquoted identifiers e.g. ``Exp``
- each rule is preceded by a label and a dot, e.g. ``EPlus.``
From a BNF grammar, the program automatically generates
- abstract syntax definition
- parser
- linearizer
- lexer
- all this in C, C++, C#, Haskell, Java, OCaml
Today, we will look at how BNF grammars are written, independently of BNFC.
#NEW
==Abstract and concrete syntax of BNF==
In a sense, the mathematical definition of BNF is its abstract syntax!
Concrete syntaxes vary: for instance, in linguistics, the common form is
```
Exp -> Exp + Exp
```
In Ansi C specification (Kernighan and Ritchie),
//Exp: Exp// ``+`` //Exp//
It is also common to group the rules with common LHS:
```
Exp ::= Exp "+" Exp | "2" | "3"
```
This is often called **extended BNF** (which also has other abbreviations).
#NEW
==Example: BNF in BNF==
This is a subset of the
[full definition http://www.cs.chalmers.se/~markus/BNFC/BNF.cf]
```
Gr. Grammar ::= ListRule ;
Rul. Rule ::= Ident "." Ident "::=" ListItem ;
ITerm. Item ::= String ;
INonterm. Item ::= Ident ;
NilRule. ListRule ::= ;
ConsRule. ListRule ::= Rule ";" ListRule ;
NilItem. ListItem ::= ;
ConsItem. ListItem ::= Item ListItem ;
```
The grammar uses two primitive (i.e. internally defined) nonterminals:
- ``String``: quoted string
- ``Ident`` : letter optionally followed by letters, digits, and ``_ '``
#NEW
==Notations for abstract syntax trees==
Graphically:
| [abstract0.png]
Haskell notation: label followed by subtrees, in parentheses if necessary:
```
EPlus 2 (ETimes 3 4)
```
Lisp notation: the same, but always in parentheses:
```
(EPlus 2 (ETimes 3 4))
```
(The original plan was to give Lisp a separate concrete syntax later!)
#NEW
==Writing a grammar: the main programming language structures==
Usually it is good to proceed top-down: start from the largest program
structures and proceed to the smallest.
In many languages, the levels are roughly:
- modules
- definitions
- statements
- expressions
- atoms
Functional languages skip the statement level.
#NEW
==Example: a simple imperative language==
Each file contains a module, which is a sequence of function definitions.
A definition has a header and a sequence of statemens.
Statements are built from other statements and expressions.
Expressions are built from other expressions and atoms.
```
int plus (int x, int y) // definition
{
return x + y ; // statement
}
int test() // definition
{
int x ; // statements
x = readInt() ;
int y ;
y = readInt() ;
return plus(x,y) ;
}
```
#NEW
==Rules for modules and function definitions==
A module is a list of definitions
```
Mod. Module ::= ListDef ;
NilDef. ListDef ::= ; -- empty list
ConsDef. ListDef ::= Def ListDef ; -- add one more
```
A function definition has a header and a list of statements in curly brackets
```
Fun. Def ::= Header "{" ListStm "}" ;
NilStm. ListStm ::= ;
ConsStm. ListStm ::= Stm ";" ListStm ;
```
A header has a type, a name (an identifier), and a parameter declaration list
```
Head. Header ::= Type Ident "(" ListDecl ")" ;
```
#NEW
==Rule formats for sequences==
Sequences of different kinds are very common.
BNFC has a special notation for list categories: ``[C]``.
There is also a special notation for list rule ("Nil" and "Cons"):
```
terminator Stm ";" ;
```
abbreviates the two rules
```
NilStm. ListStm ::= ;
ConsStm. ListStm ::= Stm ";" ListStm ;
```
It says: "statements in a statement list are terminated by a semicolon".
There is also the form
```
separator Decl "," ;
```
which says: "declarations in a declaration list are separated by a comma".
#NEW
==Rules for modules and function definitions: Version 2==
A module is a list of definitions
```
Mod. Module ::= [Def] ;
terminator Def "" ;
Fun. Def ::= Header "{" [Stm] "}" ;
terminator Stm ";" ;
Head. Header ::= Type Ident "(" [Decl] ")" ;
```
#NEW
==Rules for declarations, statements and expressions==
A declaration has a type and an identifier:
```
DTyp. Decl ::= Type Ident ;
```
A statement is a declaration, an expression, or a return of an expression:
```
SDecl. Stm ::= Decl ;
SExp. Stm ::= Exp ;
SRet. Stm ::= "return" Exp ;
```
An expression is an identifier, a function call, an assignment, or a sum:
```
EId. Exp ::= Ident ;
ECall. Exp ::= Ident "(" [Exp] ")" ;
EAss. Exp ::= Ident "=" Exp ;
EPlus. Exp ::= Exp "+" Exp ;
```
Expressions in a list are separated by commas:
```
separator Exp "," ;
```
#NEW
==Atoms==
There is a type of integers:
```
TInt. Type ::= "int" ;
```
The category ``Ident`` is defined internally in BNFC, and needs hence
no rule. (Actually, rules for internally defined categories are illegal.)
More on atoms in next week's lectures:
- other internally defined types
- how to define your own token types
#NEW
==The syntax tree of the example==
```
Mod [
Fun
(Head
TInt
(Ident "plus")
[DTyp TInt (Ident "x"), DTyp TInt (Ident "y")])
[SRet (EPlus (EId (Ident "x")) (EId (Ident "y")))],
Fun
(Head
TInt
(Ident "test")
[])
[SDecl (DTyp TInt (Ident "x")),
SExp (EAss (Ident "x") (ECall (Ident "readInt") [])),
SDecl (DTyp TInt (Ident "y")),
SExp (EAss (Ident "y") (ECall (Ident "readInt") [])),
SRet (ECall (Ident "plus") [EId (Ident "x"),EId (Ident "y")])]]
```
#NEW
==Abstract syntax representation in Haskell==
Each category is a datatype. Lists are represented as Haskell lists.
```
data Module =
Mod [Def]
data Def =
Fun Header [Stm]
data Header =
Head Type Ident [Decl]
data Stm =
SDecl Decl
| SExp Exp
| SRet Exp
```
Ident is also a datatype.
```
newtype Ident = Ident String
```
Syntax-directed translation is implemented as pattern matching (later lecture).
#NEW
==Abstract syntax representation in Java 1.5==
Each category is an abstract class.
```
public abstract class Stm
public abstract class Exp
```
Each syntax constructor is a subclass of its value type class.
```
public class SDecl extends Stm {
public final Decl decl_;
}
public class SExp extends Stm {
public final Exp exp_;
}
public class EPlus extends Exp {
public final Exp exp_1, exp_2;
}
```
Lists are treated using Java's linked lists.
```
public class ListDef extends java.util.LinkedList {
}
```
The classes have more methods: constructors, equality, visitors.
Syntax-directed translation is implemented with visitors (later lecture).
#NEW
==Ambiguity==
What is the syntax tree for this string?
```
2 + 3 + 4
```
Two possibilities are permitted by the grammar:
```
+ +
/ \ / \
+ 4 2 +
/ \ / \
2 3 3 4
```
Now, the arithmetic value is the same for both, so one
might think it doesn't matter. But just add minus to the
grammar, and consider
```
4 - 3 - 2
```
The above grammar is **ambiguous**: it gives more than
one parse result for some strings.
Sometimes the ambiguity does not affect the meaning, sometimes
it does.
Programming languages in general avoid ambiguity.
#NEW
==Parentheses, ssociativity and precedence==
One way to avoid ambiguity in a language is to use parentheses:
```
Exp ::= "(" Exp "-" Exp ")"
```
Then the programmer is forced to write either of
```
(4 - (3 -2)) ((4 - 3) - 2)
```
However, it is much convenient to have conventions on parsing (and
evaluation order):
- **left associativity** - group as long left as possible:
```
4 - 3 - 2 == (4 - 3) - 2
```
- **precedence** - e.g. times "binds stronger" than plus:
```
4 + 3 * 2 == 4 + (3 * 2)
```
Parentheses always have priority over precedence and associativity:
```
(4 + 3) * 2 != 4 + 3 * 2
```
#NEW
==Precedence levels in BNFC==
Having precedence and associativity rules in addition to grammar can be messy.
But they can simply be encoded in a BNF grammar by using **precedence levels**:
numeral subscripts to categories. Convention: ``Exp = Exp0``.
```
EInt. Exp3 ::= Integer
ETimes. Exp2 ::= Exp2 "*" Exp3
EPlus. Exp1 ::= Exp1 "+" Exp2
```
The following rules implement **coercions** between precedence levels.
```
_. Exp3 ::= "(" Exp ")"
_. Exp2 ::= Exp3
_. Exp1 ::= Exp2
_. Exp ::= Exp1
```
Coercions are **semantic dummies**: they add nothing to the abstract
syntax tree. The symbol ``_`` on the constructor place is used to indicate this.
A shorthand for such coercions groups is available:
```
coercions Exp 3 ;
```
#NEW
==Precedence levels and trees==
By definition, parse trees show coercions but abstract trees don't.
| [parse0.png] [abstract0.png]
#NEW
==A complete grammar Imp.cf==
```
Mod. Module ::= [Def] ;
Fun. Def ::= Header "{" [Stm] "}" ;
terminator Def "" ;
Head. Header ::= Type Ident "(" [Decl] ")" ;
SDecl. Stm ::= Decl ;
SExp. Stm ::= Exp ;
SRet. Stm ::= "return" Exp ;
terminator Stm ";" ;
DTyp. Decl ::= Type Ident ;
separator Decl "," ;
EId. Exp2 ::= Ident ;
ECall. Exp1 ::= Ident "(" [Exp] ")" ;
EAss. Exp1 ::= Ident "=" Exp1 ;
EPlus. Exp ::= Exp "+" Exp1 ;
coercions Exp 2 ;
separator Exp "," ;
TInt. Type ::= "int" ;
```
#NEW
==A quick run with BNFC==
Install BNFC: see
```
http://digitalgrammars.com/bnfc/
```
Generate parser and other files from the grammar
```
bnfc -m Imp.cf
```
Compile a test program
```
make
```
Run the test program on file [``koe.imp`` exx/koe.imp]
```
./TestImp koe.imp
```
#NEW
==How to approach lab 1==
You can proceed by modifying [``Imp.cf`` exx/Imp.cf]
We will make a quick tour of [Lab 1 PM ../laborations/lab1/lab1.html]
Have a tight modify-compile-test loop!
More practical details in next Tuesday's lecture.