Lecture 2: Abstract and Concrete Syntax Programming Languages Course Aarne Ranta (aarne@chalmers.se) %!target:html %!postproc(html): #NEW %!postproc(html): #HR
Book: 2.8.2, 4.1 - 4.3 #NEW ==The central role of abstract syntax== Although lexing is the first compiler phase, we don't start from it. Instead, we start from the middle, abstract syntax, which is - the goal of lexing + parsing - the starting point of code generation - the domain of type checking and many optimizations - the hub between the front end and the back end This lecture: how syntax rules look like. Next lecture: how to write a grammar to generate a compiler front end. #NEW ==Abstract and concrete syntax== **Abstract syntax**: what are the significant parts of the expression? Example: a sum expression has its two operand expressions as its significant parts [abstract.png] **Concrete syntaz**: what does the expression look like? Example: //the same// sum expression can look in different ways: ``` 2 + 3 -- infix (+ 2 3) -- prefix (2 3 +) -- postfix bipush 2 -- JVM bipush 3 iadd the sum of 2 and 3 -- English ``` #NEW ==Parse trees and abstract syntax trees== Parse tree (left): show the concrete syntax (how tokens are grouped together) - the tree initially constructed by the parser Abstract tree (right): show the semantically significant structure - the tree returned by the parser and manipulated by type checker | [parse.png] [abstract.png] #NEW ==The structure of trees== Parse tree: - nodes: **nonterminals** (= **syntactic categories**) - leaves: **terminals** (= **tokens**) Abstract tree: - nodes: **constructor functions** - leaves: atoms (= zero-place constructor functions) | [parse.png] [abstract.png] #NEW ==The definition and construction of trees== Parse trees are defined by **context-free grammars** ``` Exp ::= Exp "+" Exp ; Exp ::= "2" ; Exp ::= "3" ; ``` Abstract trees are defined by **constructor type signatures** ``` plus : (Exp, Exp) -> Exp 2 : Exp 3 : Exp ``` #NEW ==From concrete to abstract syntax== 1. Give a name (= label) to each rule: ``` Exp ::= Exp "+" Exp ==> plus. Exp ::= Exp "+" Exp ``` 2. Ignore terminals (= tokens, in quotes): ``` plus .Exp ::= Exp "+" Exp ==> plus. Exp ::= Exp Exp ``` 3. Treat label as constructor name, LHS (left-hand-side) as value type, RHS as argument types: ``` plus. Exp ::= Exp Exp ==> plus : (Exp, Exp) -> Exp ``` #NEW ==One abstract, many concrete== One abstract syntax tree can have infinitely many concrete syntax representations. ``` 2 + 3 -- infix (+ 2 3) -- prefix (2 3 +) -- postfix bipush 2 -- JVM bipush 3 iadd the sum of 2 and 3 -- English ``` Remember: terminals don't matter. #NEW ==Separating abstract from concrete syntax== One could give a separate abstract syntax rule ``` fun plus : Exp -> Exp -> Exp ``` and functions computing the concrete syntax as **linearization**: ``` lin plus x y = x ++ "+" ++ y -- infix lin plus x y = "(" ++ "+" ++ x ++ y ++ ")" -- prefix lin plus x y = "(" ++ x ++ y ++ "+" ++ ")" -- postfix lin plus x y = x ++ y ++ "iadd" -- JVM lin plus x y = "the sum of" ++ x ++ "and" ++ y -- English ``` This leads to more expressive grammars, definable in [GF http://grammaticalframework.org] (Grammatical Framework). #NEW ==(For fun) using GF== Concrete syntaxes can be different languages. GF has tools for visualizing trees and word alignment: [``tournesol.cs.chalmers.se:41296`` http://tournesol.cs.chalmers.se:41296] The fridge magnet interface: [``tournesol.cs.chalmers.se:41296/fridge`` http://tournesol.cs.chalmers.se:41296/fridge] #NEW ==The main idea of compilation== To compile: - **parse** with the concrete syntax of source language - **linearize** the resulting tree into the target language ``` + bipush 2 2 + 3 -------> / \ -----------> bipush 3 parse 2 3 linearize iadd ``` This is the idea of **syntax-directed translation**. Notice: from a grammar, both parsing and linearization are created automatically. #NEW ==Algebraic datatypes== Abstract syntax can be expressed as **algebraic datatypes**. They have a direct support in Haskell, as ``data`` types. You just have to follow the syntax convetions: constructor begin with capital letters. ``` data Exp = Eplus Exp Exp | E2 | E3 ``` This is one reason why Haskell is so well suited for compiler construction. But: we will show later how algebraic datatypes are encoded in Java. #NEW ==Context-free grammars== Concrete syntax is described by **context-free grammars**. Context-free grammar = **BNF grammar** (Backus-Naur form). The mathematical definition is simple: A context-free grammar is a quadruple (//T,N,S,R//) where - //T// and //N// are disjoint sets, called **terminals** and **nonterminals**, respectively - //S// is a nonterminal, the **start category** - //R// is a finite set of **rules** - a rule is a pair (//C//, //t_1//...//t_n//) where - //C// is a nonterminal - each //t_1// is a terminal or a nonterminal - n >= 0 Example (the one above): - //T// = {"+", "2", "3"} - //N// = {Exp} - //S// = Exp - //R// = ((Exp, Exp "+" Exp), (Exp, "2"), (Exp, "3")) #NEW ==The BNF Converter== We will follow the notation of BNF Converter (= BNFC), where - rules have the form ``C ::= ... ;`` - terminals are quoted strings e.g. ``"+"`` - nonterminals are unquoted identifiers e.g. ``Exp`` - each rule is preceded by a label and a dot, e.g. ``EPlus.`` From a BNF grammar, the program automatically generates - abstract syntax definition - parser - linearizer - lexer - all this in C, C++, C#, Haskell, Java, OCaml Today, we will look at how BNF grammars are written, independently of BNFC. #NEW ==Abstract and concrete syntax of BNF== In a sense, the mathematical definition of BNF is its abstract syntax! Concrete syntaxes vary: for instance, in linguistics, the common form is ``` Exp -> Exp + Exp ``` In Ansi C specification (Kernighan and Ritchie), //Exp: Exp// ``+`` //Exp// It is also common to group the rules with common LHS: ``` Exp ::= Exp "+" Exp | "2" | "3" ``` This is often called **extended BNF** (which also has other abbreviations). #NEW ==Example: BNF in BNF== This is a subset of the [full definition http://www.cs.chalmers.se/~markus/BNFC/BNF.cf] ``` Gr. Grammar ::= ListRule ; Rul. Rule ::= Ident "." Ident "::=" ListItem ; ITerm. Item ::= String ; INonterm. Item ::= Ident ; NilRule. ListRule ::= ; ConsRule. ListRule ::= Rule ";" ListRule ; NilItem. ListItem ::= ; ConsItem. ListItem ::= Item ListItem ; ``` The grammar uses two primitive (i.e. internally defined) nonterminals: - ``String``: quoted string - ``Ident`` : letter optionally followed by letters, digits, and ``_ '`` #NEW ==Notations for abstract syntax trees== Graphically: | [abstract0.png] Haskell notation: label followed by subtrees, in parentheses if necessary: ``` EPlus 2 (ETimes 3 4) ``` Lisp notation: the same, but always in parentheses: ``` (EPlus 2 (ETimes 3 4)) ``` (The original plan was to give Lisp a separate concrete syntax later!) #NEW ==Writing a grammar: the main programming language structures== Usually it is good to proceed top-down: start from the largest program structures and proceed to the smallest. In many languages, the levels are roughly: - modules - definitions - statements - expressions - atoms Functional languages skip the statement level. #NEW ==Example: a simple imperative language== Each file contains a module, which is a sequence of function definitions. A definition has a header and a sequence of statemens. Statements are built from other statements and expressions. Expressions are built from other expressions and atoms. ``` int plus (int x, int y) // definition { return x + y ; // statement } int test() // definition { int x ; // statements x = readInt() ; int y ; y = readInt() ; return plus(x,y) ; } ``` #NEW ==Rules for modules and function definitions== A module is a list of definitions ``` Mod. Module ::= ListDef ; NilDef. ListDef ::= ; -- empty list ConsDef. ListDef ::= Def ListDef ; -- add one more ``` A function definition has a header and a list of statements in curly brackets ``` Fun. Def ::= Header "{" ListStm "}" ; NilStm. ListStm ::= ; ConsStm. ListStm ::= Stm ";" ListStm ; ``` A header has a type, a name (an identifier), and a parameter declaration list ``` Head. Header ::= Type Ident "(" ListDecl ")" ; ``` #NEW ==Rule formats for sequences== Sequences of different kinds are very common. BNFC has a special notation for list categories: ``[C]``. There is also a special notation for list rule ("Nil" and "Cons"): ``` terminator Stm ";" ; ``` abbreviates the two rules ``` NilStm. ListStm ::= ; ConsStm. ListStm ::= Stm ";" ListStm ; ``` It says: "statements in a statement list are terminated by a semicolon". There is also the form ``` separator Decl "," ; ``` which says: "declarations in a declaration list are separated by a comma". #NEW ==Rules for modules and function definitions: Version 2== A module is a list of definitions ``` Mod. Module ::= [Def] ; terminator Def "" ; Fun. Def ::= Header "{" [Stm] "}" ; terminator Stm ";" ; Head. Header ::= Type Ident "(" [Decl] ")" ; ``` #NEW ==Rules for declarations, statements and expressions== A declaration has a type and an identifier: ``` DTyp. Decl ::= Type Ident ; ``` A statement is a declaration, an expression, or a return of an expression: ``` SDecl. Stm ::= Decl ; SExp. Stm ::= Exp ; SRet. Stm ::= "return" Exp ; ``` An expression is an identifier, a function call, an assignment, or a sum: ``` EId. Exp ::= Ident ; ECall. Exp ::= Ident "(" [Exp] ")" ; EAss. Exp ::= Ident "=" Exp ; EPlus. Exp ::= Exp "+" Exp ; ``` Expressions in a list are separated by commas: ``` separator Exp "," ; ``` #NEW ==Atoms== There is a type of integers: ``` TInt. Type ::= "int" ; ``` The category ``Ident`` is defined internally in BNFC, and needs hence no rule. (Actually, rules for internally defined categories are illegal.) More on atoms in next week's lectures: - other internally defined types - how to define your own token types #NEW ==The syntax tree of the example== ``` Mod [ Fun (Head TInt (Ident "plus") [DTyp TInt (Ident "x"), DTyp TInt (Ident "y")]) [SRet (EPlus (EId (Ident "x")) (EId (Ident "y")))], Fun (Head TInt (Ident "test") []) [SDecl (DTyp TInt (Ident "x")), SExp (EAss (Ident "x") (ECall (Ident "readInt") [])), SDecl (DTyp TInt (Ident "y")), SExp (EAss (Ident "y") (ECall (Ident "readInt") [])), SRet (ECall (Ident "plus") [EId (Ident "x"),EId (Ident "y")])]] ``` #NEW ==Abstract syntax representation in Haskell== Each category is a datatype. Lists are represented as Haskell lists. ``` data Module = Mod [Def] data Def = Fun Header [Stm] data Header = Head Type Ident [Decl] data Stm = SDecl Decl | SExp Exp | SRet Exp ``` Ident is also a datatype. ``` newtype Ident = Ident String ``` Syntax-directed translation is implemented as pattern matching (later lecture). #NEW ==Abstract syntax representation in Java 1.5== Each category is an abstract class. ``` public abstract class Stm public abstract class Exp ``` Each syntax constructor is a subclass of its value type class. ``` public class SDecl extends Stm { public final Decl decl_; } public class SExp extends Stm { public final Exp exp_; } public class EPlus extends Exp { public final Exp exp_1, exp_2; } ``` Lists are treated using Java's linked lists. ``` public class ListDef extends java.util.LinkedList { } ``` The classes have more methods: constructors, equality, visitors. Syntax-directed translation is implemented with visitors (later lecture). #NEW ==Ambiguity== What is the syntax tree for this string? ``` 2 + 3 + 4 ``` Two possibilities are permitted by the grammar: ``` + + / \ / \ + 4 2 + / \ / \ 2 3 3 4 ``` Now, the arithmetic value is the same for both, so one might think it doesn't matter. But just add minus to the grammar, and consider ``` 4 - 3 - 2 ``` The above grammar is **ambiguous**: it gives more than one parse result for some strings. Sometimes the ambiguity does not affect the meaning, sometimes it does. Programming languages in general avoid ambiguity. #NEW ==Parentheses, ssociativity and precedence== One way to avoid ambiguity in a language is to use parentheses: ``` Exp ::= "(" Exp "-" Exp ")" ``` Then the programmer is forced to write either of ``` (4 - (3 -2)) ((4 - 3) - 2) ``` However, it is much convenient to have conventions on parsing (and evaluation order): - **left associativity** - group as long left as possible: ``` 4 - 3 - 2 == (4 - 3) - 2 ``` - **precedence** - e.g. times "binds stronger" than plus: ``` 4 + 3 * 2 == 4 + (3 * 2) ``` Parentheses always have priority over precedence and associativity: ``` (4 + 3) * 2 != 4 + 3 * 2 ``` #NEW ==Precedence levels in BNFC== Having precedence and associativity rules in addition to grammar can be messy. But they can simply be encoded in a BNF grammar by using **precedence levels**: numeral subscripts to categories. Convention: ``Exp = Exp0``. ``` EInt. Exp3 ::= Integer ETimes. Exp2 ::= Exp2 "*" Exp3 EPlus. Exp1 ::= Exp1 "+" Exp2 ``` The following rules implement **coercions** between precedence levels. ``` _. Exp3 ::= "(" Exp ")" _. Exp2 ::= Exp3 _. Exp1 ::= Exp2 _. Exp ::= Exp1 ``` Coercions are **semantic dummies**: they add nothing to the abstract syntax tree. The symbol ``_`` on the constructor place is used to indicate this. A shorthand for such coercions groups is available: ``` coercions Exp 3 ; ``` #NEW ==Precedence levels and trees== By definition, parse trees show coercions but abstract trees don't. | [parse0.png] [abstract0.png] #NEW ==A complete grammar Imp.cf== ``` Mod. Module ::= [Def] ; Fun. Def ::= Header "{" [Stm] "}" ; terminator Def "" ; Head. Header ::= Type Ident "(" [Decl] ")" ; SDecl. Stm ::= Decl ; SExp. Stm ::= Exp ; SRet. Stm ::= "return" Exp ; terminator Stm ";" ; DTyp. Decl ::= Type Ident ; separator Decl "," ; EId. Exp2 ::= Ident ; ECall. Exp1 ::= Ident "(" [Exp] ")" ; EAss. Exp1 ::= Ident "=" Exp1 ; EPlus. Exp ::= Exp "+" Exp1 ; coercions Exp 2 ; separator Exp "," ; TInt. Type ::= "int" ; ``` #NEW ==A quick run with BNFC== Install BNFC: see ``` http://digitalgrammars.com/bnfc/ ``` Generate parser and other files from the grammar ``` bnfc -m Imp.cf ``` Compile a test program ``` make ``` Run the test program on file [``koe.imp`` exx/koe.imp] ``` ./TestImp koe.imp ``` #NEW ==How to approach lab 1== You can proceed by modifying [``Imp.cf`` exx/Imp.cf] We will make a quick tour of [Lab 1 PM ../laborations/lab1/lab1.html] Have a tight modify-compile-test loop! More practical details in next Tuesday's lecture.