Lecture 12: Design and Evolution of Programming Languages
Programming Languages Course
Aarne Ranta (aarne@chalmers.se)
%!target:html
%!postproc(html): #NEW
%!postproc(html): #HR
%!postproc(html): #sub1 1
%!postproc(html): #subn1 n-1
%!postproc(html): #subn n
Book: 1.3, 1.5, 1.6
[Minilanguages http://www.catb.org/~esr/writings/taoup/html/minilanguageschapter.html],
from Eric S. Raymond,
[The Art of Unix Programming http://www.catb.org/~esr/writings/taoup/]
[The Hundred-Year Language http://www.paulgraham.com/hundred.html]
from Paul Graham,
[Hackers & Painters http://www.paulgraham.com/hackpaint.html]
[The Retrocomputing Museum http://www.catb.org/~esr/retro/]
#NEW
==Plan==
How simple can a programming language be?
Turing-completeness
Some programming language history
General and special purpose languages
Case study: the evolution of BNFC
Data formats, XML
#NEW
==Turing completeness==
Before electronic computers were built, mathematical models of computation
were developed. All of these were proven equivalent:
- The **Turing Machine** (Alan Turing), similar to imperative programming.
- **Lambda Calculus** (Alonzo Church), similar to functional programming.
- **Post machines** (Emile Post)
- **Recursive functions**
- **Rewriting systems**
Any programming language equivalent to one of these is said to be
**Turing-complete**.
All usual general-purpose languages are Turing-complete.
#NEW
==Lambda calculus as a programming language==
(Material from [Wikipedia http://en.wikipedia.org/wiki/Lambda_calculus])
Recall how simple lambda calculus is:
```
Exp ::= Ident | "(" Exp Exp ")" | "\" Ident "->" Exp
```
We don't even need integers, because they can be defined as follows:
```
0 = \f -> \x -> x
1 = \f -> \x -> f x
2 = \f -> \x -> f (f x)
3 = \f -> \x -> f (f (f x))
...
```
In other words: number //n// is a higher-order function that applies
any function //n// times to any given argument.
These functions are known as **Church numerals**.
#NEW
==Arithmetic in lambda calculus==
Addition of Church numerals is defined as follows:
```
PLUS = \m -> \n -> \f -> \x -> n f (m f x)
```
Example:
```
PLUS 2 3
= (\m -> \n -> \f -> \x -> n f (m f x))
(\f -> \x -> f (f x)) (\f -> \x -> f (f (f x)))
= \f -> \x -> (\f -> \x -> f (f (f x))) f ((\f -> \x -> f (f x)) f x)
= \f -> \x -> (\f -> \x -> f (f (f x))) f (f (f x))
= \f -> \x -> f (f (f (f (f x))))
= 5
```
Multiplication:
```
MULT = \m -> \n -> m (PLUS n) 0
```
Idea: add //n// to 0 //m// times.
#NEW
==Control structures in lambda calculus==
Booleans (**Church booleans**) and conditions
```
TRUE = \x -> \y -> x
FALSE = \x -> \y -> y
IFTHENELSE = \b -> \x -> \y -> b x y
AND = \a -> \b -> IFTHENELSE a b FALSE
OR = \a -> \b -> IFTHENELSE a TRUE b
```
Recursion, via the **fix-point combinator**
```
Y = \g -> (\x -> g (x x)) (\x -> g (x x))
```
This has the property
```
Y g = g (Y g)
```
which means iterating ``g`` infinitely many times.
#NEW
==Lambda calculus as a programming language?==
To write programs in lambda calculus is possible!
But it is inconvenient and inefficient.
However, it is a good starting point for a language
to have a very small **core language**.
The implementation (compiler, interpreter) is then
done for the core language with syntactic sugar
and possibly optimizations.
Lisp is built from lambda calculus with a few additions,
such as a primitive notion of lists.
Haskell has a small core language based on lambda
calculus with algebraic datatypes and pattern matching,
and primitive number types.
#NEW
==Brainf*ck: another Turing-complete language==
The following material is from
[this website http://www.muppetlabs.com/~breadbox/bf/].
By Urban Müller, with the goal: create a Turing-complete language for which
one could write the smallest compiler ever. His compiler was 240 bytes in size.
A Brainfuck program has an implicit byte pointer, called "the pointer",
which is free to move around within an array of 30000 bytes, initially
all set to zero. The pointer itself is initialized to point to the
beginning of this array.
The Brainfuck programming language consists of eight commands,
each of which is represented as a single character.
```
> Increment the pointer.
< Decrement the pointer.
+ Increment the byte at the pointer.
- Decrement the byte at the pointer.
. Output the byte at the pointer.
, Input a byte and store it in the byte at the pointer.
[ Jump forward past the matching ] if the byte at the pointer is zero.
] Jump backward to the matching [ unless the byte at the pointer is zero.
```
#NEW
==Semantics of BF via translation to C==
```
> becomes ++p;
< becomes --p;
+ becomes ++*p;
- becomes --*p;
. becomes putchar(*p);
, becomes *p = getchar();
[ becomes while (*p) {
] becomes }
```
#NEW
==Example BF program: char.bf==
Display the ASCII character set (Jeffry Johnston 2001)
```
.+[.+]
```
#NEW
==Example BF program: hello.bf==
Print "HELLO WORLD!" (from [Wikipedia http://en.wikipedia.org/wiki/Brainfuck])
```
++++++++++
[>+++++++>++++++++++>+++>+<<<<-] To set up useful values in the array
>++. Print 'H'
>+. Print 'e'
+++++++. Print 'l'
. Print 'l'
+++. Print 'o'
>++. Print ' '
<<+++++++++++++++. Print 'W'
>. Print 'o'
+++. Print 'r'
------. Print 'l'
--------. Print 'd'
>+. Print '!'
>. Print newline
```
#NEW
==Criteria for a good programming language==
Certainly Turing completeness is not enough!
(lambda calculus, Brainf*ck, the corresponding fragment of C...)
More or less obvious criteria:
- succinctness
- efficiency
- clarity
- safety
- orthogonality
These criteria are not always compatible: there are trade-offs.
In practice, different languages are good for different applications.
(And there are languages which are no good for any applications.)
What is Brainf*ck good for? Reasoning about computability!
#NEW
==History of programming languages==
We spend some time on looking at this
[poster http://www.oreilly.com/pub/a/oreilly/news/languageposter_0504.html]
with a time chart of languages.
#NEW
==Some trends==
Toward more **structure** in programs (from GOTOs to ``while`` loops to
recursion)
Toward more **static typing** (from bit strings to numeric types to
structures to algebraic data types to dependent types)
Toward more **abstraction** (from character arrays to strings, from
arrays to vectors and lists, from unlimited access to abstract data types)
Toward more **genericity** (from cut-and-paste to functions to
polymorphic functions to first-class modules)
Toward more **streamlined syntax** (from positions and line numbers,
keywords used as identifiers, ``begin`` and ``end`` markers,
limited-size identifiers, etc,
to a "C-like" syntax that can be processed with standard tools)
In general, toward more **high-level languages**, hence
farther away from the machine. This creates more work for
machines (and compiler writers!) but relieves the burden
of language users.
#NEW
==Special-purpose languages==
Also called **minilanguages**, or **domain (specific) languages**
An "ultimate solution" to a certain class of problems
Examples:
- Lex for lexers, Yacc for parsers
- BNFC for compiler front ends
- XML for structuring documents
- ``make`` for predefining compilations
- ``bash`` for working on files and directories
- PostScript for printing documents
- JavaScript for dynamic web pages
The latter two are actually Turing-complete!
#NEW
==Design decisions for (mini)languages==
Imperative or declarative?
Interpreted or compiled?
Portable or platform-independent?
Statically or dynamically checked?
Turing-complete or limited?
Language or library?
#NEW
==Embedded languages==
Minilanguage that is a fragment of a larger **host language**.
Really the same as a library in the host language.
Advantages
- inherit the implementation of the host language
- very little extra training for those who know the host language
- unlimited access to "language extensions" via using the host language
Disadvantages
- difficult to reason about independently of host language implementation
- access to host language can compromise safety, efficiency, etc
- may be difficult to interface with other languages
#NEW
==Case study: BNFC==
The starting point was the course
[Kompilatorkonstruktion http://www.cs.chalmers.se/Cs/Grundutb/Kurser/komp/2002/]
in 2002. The web page gives a link:
- 1/2 The BNF Converter: en liten leksak som hjälper att skapa abstrakt syntax
och parsare. Även tillgänglig på kurskontot komp.
**Varning**: extremt experimental!
The first version written by Aarne Ranta and Markus Forsberg for
Haskell/Happy/Alex
In 2004, ported to Java, C, and C++ by Michael Pellauer
In 2005, ported to Java 1.5 by Björn Bringert
In 2006, ported to OCaml by Kristofer Johannisson and C# by Johan Broberg
#NEW
==Motivation==
To implement exactly the idea that a parser returns an
abstract syntax tree.
- save writing code
- keep different compiler modules in sync
- use grammar as reliable documentation
- guarantee sameness of parser and pretty printer
(for e.g. source code optimizations)
- be sure of the complexity limit of processing
"The number of bugs per line is independent of programming language"
(Eric S. Raymond)
Code size for language implementation in Lab 2.
|| format | CPP.cf | Haskell | Java 1.5 | C++ | raw C++ ||
| files | 1 | 9 | 55 | 12 | 12
| lines | 63 | 999 | 3353 | 5382 | 9424
| chars | 1548 | 28516 | 92947 | 96587 | 203659
| lines src/tgt | 100% | 6% | 2% | 1% | 0.5%
#NEW
==Design decisions for BNFC==
Imperative or declarative? Declarative.
Interpreted or compiled? Compiled.
Portable or platform-independent? Portable.
Statically checked? Yes, but should be more.
Turing-complete or limited? Limited.
Language or library? Language.
#NEW
==The very first implementation==
WSCC - World's Smallest Compiler Compiler. 114 lines of Haskell.
Functionality:
+ read a labelled BNF grammar
+ build a parser using **parser combinators**
+ using the parser, print out AST for input
#NEW
==Parsing labelled BNF grammars==
```
-- grammar parser: one rule/line, format F. C ::= (C | "s")* ";"
getCF :: String -> CF
getCF = concat . map (getcf . init . words) . filter isRule . lines where
getcf (fun : cat : "::=" : its) = return (init fun, (cat, map mkIt its))
getcf ww = []
mkIt ('"':w@(_:_)) = Right (init w)
mkIt w = Left w
isRule line = not (all isSpace line || take 2 line == "--")
-- the type of context-free grammars
type CF = [Rule]
type Rule = (Fun, (Cat, [Either Cat Tok]))
type Cat = String
type Tok = String
type Fun = String
type Str = [Tok]
```
#NEW
==Parser combinators==
```
-- a complete set of parser combinators ā la Wadler and Hutton
type Parser a b = [a] -> [(b,[a])]
parseResults :: Parser a b -> [a] -> [b]
parseResults p s = [x | (x,r) <- p s, null r]
(...) :: Parser a b -> Parser a c -> Parser a (b,c)
(p ... q) s = [((x,y),r) | (x,t) <- p s, (y,r) <- q t]
(|||) :: Parser a b -> Parser a b -> Parser a b
(p ||| q) s = p s ++ q s
lit :: (Eq a) => a -> Parser a a
lit x (c:cs) = [(x,cs) | x == c]
lit _ _ = []
(***) :: Parser a b -> (b -> c) -> Parser a c
(p *** f) s = [(f x,r) | (x,r) <- p s]
succeed :: b -> Parser a b
succeed v s = [(v,s)]
fails :: Parser a b
fails s = []
```
#NEW
==The parsing method==
```
-- parser that works for non-left-recursive grammars
-- generalization of LL(1) to ambiguous grammars
pTree :: CF -> Cat -> Parser Tok Tree
pTree cf cat = foldr (|||) fails (map pRule (rulesForCat cf cat))
where
pRule (fun, (_,its)) = pIts its *** (\trees -> Tree (fun,trees))
pIts (Left c : ts) = (pTree cf c ... pIts ts) *** (uncurry (:))
pIts (Right s : ts) = (lit s ... pIts ts) *** snd
pIts [] = succeed []
-- the type of syntax trees
newtype Tree = Tree (Fun,[Tree])
```
#NEW
==Using WSCC==
The whole implementation: file [WSCC.hs wscc/WSCC.hs]
Example grammar: file [Mini.cf wscc/Mini.cf]
Example run (interactive; indentation of parse result added):
```
$ runghc WSCC.hs Mini.cf
15 rules
Program>
Prog NilStm
Program> int i ; { i = 1 ; int i ; }
Prog
(ConsStm (SDecl TInt Id_i)
(ConsStm (SBlock
(ConsStm (SAss Id_i (EInt Int_1))
(ConsStm (SDecl TInt Id_i)
NilStm)))
NilStm))
```
#NEW
==Limitations of the first version==
No built-in literals or identifiers
No built-in precedence
No documentation, pretty-printer, skeleton
No connection to standard compiler tools (Happy, Alex)
Only Haskell
No treatment of left recursive grammars
Unpredictable complexity due to backtracking
Undocumented grammar syntax
#NEW
==The next version: BNFC 1.0==
Built-in literals added
Built-in precedence via indexed categories
Documentation in Latex, pretty-printer and skeleton in Haskell
Code generated for standard compiler tools (Happy, Alex)
Only Haskell, still
Left recursion is a virtue in LALR parsing
Predictably linear complexity
BNF grammar syntax implemented in BNFC, therefore documented
#NEW
==Lessons from BNFC==
Compilation to standard tools made it really useful.
The language is declarative and therefore portable and predictable.
- using Happy, CUP, or Bison directly would not be, because any
host language code can be inserted in the semantic actions
Static checking as close to source as possible
- for instance, that list-defining rules are consistent
- unfortunately, BNFC doesn't check e.g. that ``token`` definitions
do not get overshadowed
- using the same identifier both as category and constructor is
legal in Haskell but not in Java. This gives errors, about which
only a warning is issued.
- using the same identifer for the grammar name confuses Java completely.
There is no warning for this.
The source code of BNFC has become a terrible mess and is hard to
maintain.
#NEW
==Using BNFC for implementing (mini)languages==
Easy to get started: a prototype is ready to run in 10 minutes.
Implementation language can be changed, e.g. fast prototype in
Haskell and production system in C++.
Many implementation languages can
be combined, because they can communicate via parser and pretty printer.
The language document can be handed to users.
Restrictions (we call them conditions for "well-behaved languages"):
- lexing strictly finite state, parsing strictly LALR(1)
- white space cannot be given any meaning
#NEW
==BNFC and XML==
XML = Extended Markup Language
- a format for structuring documents and data
- example application: XHTML
Algebraic datatypes can be encoded in XML
DTD = Document Type Definition, tells what combinations are valid
BNFC can generate a DTD and encode syntax tree in XML with the
option ``-xml`` (Haskell only). Try this for the grammar
[``Mini.cf`` ../laborations/mini/Mini.cf]:
```
bnfc -xmlt -m Mini.cf
```
#NEW
==DTD for Mini.cf==
```
]>
```
#NEW
==XML for a Mini program==
```
./TestMini ex.mini
ex.mini
Parse Successful!
[Linearized tree]
int x ;
x = 6 ;
int y ;
y = x + 7 ;
print y ;
{
int y ;
y = 4 ;
print y ;
x = y ;
print x ;
}
print x ;
print y ;
[XML]
```
#NEW
==XML or BNFC==
The question is not an exclusive "or": you can get both!
BNFC can be used for defining a datafile format, which
is more compact than XML but portable between many languages,
since many languages can print and parse these objects.
If this is not enough, the format can be converted automatically
to an XML representation.