Lecture 12: Design and Evolution of Programming Languages

Programming Languages Course
Aarne Ranta (aarne@chalmers.se)

Book: 1.3, 1.5, 1.6

Minilanguages, from Eric S. Raymond, The Art of Unix Programming

The Hundred-Year Language from Paul Graham, Hackers & Painters

The Retrocomputing Museum

Plan

How simple can a programming language be?

Turing-completeness

Some programming language history

General and special purpose languages

Case study: the evolution of BNFC

Data formats, XML

Turing completeness

Before electronic computers were built, mathematical models of computation were developed. All of these were proven equivalent:

The Turing Machine (Alan Turing), similar to imperative programming.
Lambda Calculus (Alonzo Church), similar to functional programming.
Post machines (Emile Post)
Recursive functions
Rewriting systems

Any programming language equivalent to one of these is said to be Turing-complete.

All usual general-purpose languages are Turing-complete.

Lambda calculus as a programming language

(Material from Wikipedia)

Recall how simple lambda calculus is:

    Exp ::= Ident | "(" Exp Exp ")" | "\" Ident "->" Exp

We don't even need integers, because they can be defined as follows:

    0 = \f -> \x -> x
    1 = \f -> \x -> f x
    2 = \f -> \x -> f (f x)
    3 = \f -> \x -> f (f (f x))
    ...

In other words: number n is a higher-order function that applies any function n times to any given argument.

These functions are known as Church numerals.

Arithmetic in lambda calculus

Addition of Church numerals is defined as follows:

    PLUS = \m -> \n -> \f -> \x -> n f (m f x)

Example:

    PLUS 2 3
    = (\m -> \n -> \f -> \x -> n f (m f x)) 
         (\f -> \x -> f (f x)) (\f -> \x -> f (f (f x)))
    = \f -> \x -> (\f -> \x -> f (f (f x))) f ((\f -> \x -> f (f x)) f x)
    = \f -> \x -> (\f -> \x -> f (f (f x))) f (f (f x))
    = \f -> \x -> f (f (f (f (f x))))
    = 5

Multiplication:

    MULT = \m -> \n -> m (PLUS n) 0

Idea: add n to 0 m times.

Control structures in lambda calculus

Booleans (Church booleans) and conditions

    TRUE  = \x -> \y -> x
    FALSE = \x -> \y -> y
  
    IFTHENELSE = \b -> \x -> \y -> b x y
  
    AND = \a -> \b -> IFTHENELSE a b FALSE
    OR  = \a -> \b -> IFTHENELSE a TRUE b

Recursion, via the fix-point combinator

    Y = \g -> (\x -> g (x x)) (\x -> g (x x))

This has the property

    Y g = g (Y g)

which means iterating g infinitely many times.

Lambda calculus as a programming language?

To write programs in lambda calculus is possible!

But it is inconvenient and inefficient.

However, it is a good starting point for a language to have a very small core language.

The implementation (compiler, interpreter) is then done for the core language with syntactic sugar and possibly optimizations.

Lisp is built from lambda calculus with a few additions, such as a primitive notion of lists.

Haskell has a small core language based on lambda calculus with algebraic datatypes and pattern matching, and primitive number types.

Brainf*ck: another Turing-complete language

The following material is from this website.

By Urban Müller, with the goal: create a Turing-complete language for which one could write the smallest compiler ever. His compiler was 240 bytes in size.

A Brainfuck program has an implicit byte pointer, called "the pointer", which is free to move around within an array of 30000 bytes, initially all set to zero. The pointer itself is initialized to point to the beginning of this array.

The Brainfuck programming language consists of eight commands, each of which is represented as a single character.

  > 	Increment the pointer.
  < 	Decrement the pointer.
  + 	Increment the byte at the pointer.
  - 	Decrement the byte at the pointer.
  . 	Output the byte at the pointer.
  , 	Input a byte and store it in the byte at the pointer.
  [ 	Jump forward past the matching ] if the byte at the pointer is zero.
  ] 	Jump backward to the matching [ unless the byte at the pointer is zero.

Semantics of BF via translation to C

  > 	becomes 	++p;
  < 	becomes 	--p;
  + 	becomes 	++*p;
  - 	becomes 	--*p;
  . 	becomes 	putchar(*p);
  , 	becomes 	*p = getchar();
  [ 	becomes 	while (*p) {
  ] 	becomes 	}

Example BF program: char.bf

Display the ASCII character set (Jeffry Johnston 2001)

    .+[.+]

Example BF program: hello.bf

Print "HELLO WORLD!" (from Wikipedia)

  ++++++++++
  [>+++++++>++++++++++>+++>+<<<<-] To set up useful values in the array
  >++.                             Print 'H'
  >+.                              Print 'e'
  +++++++.                         Print 'l'
  .                                Print 'l'
  +++.                             Print 'o'
  >++.                             Print ' '
  <<+++++++++++++++.               Print 'W'
  >.                               Print 'o'
  +++.                             Print 'r'
  ------.                          Print 'l'
  --------.                        Print 'd'
  >+.                              Print '!'
  >.                               Print newline

Criteria for a good programming language

Certainly Turing completeness is not enough! (lambda calculus, Brainf*ck, the corresponding fragment of C...)

More or less obvious criteria:

succinctness
efficiency
clarity
safety
orthogonality

These criteria are not always compatible: there are trade-offs.

In practice, different languages are good for different applications.

(And there are languages which are no good for any applications.)

What is Brainf*ck good for? Reasoning about computability!

History of programming languages

We spend some time on looking at this poster with a time chart of languages.

Some trends

Toward more structure in programs (from GOTOs to while loops to recursion)

Toward more static typing (from bit strings to numeric types to structures to algebraic data types to dependent types)

Toward more abstraction (from character arrays to strings, from arrays to vectors and lists, from unlimited access to abstract data types)

Toward more genericity (from cut-and-paste to functions to polymorphic functions to first-class modules)

Toward more streamlined syntax (from positions and line numbers, keywords used as identifiers, begin and end markers, limited-size identifiers, etc, to a "C-like" syntax that can be processed with standard tools)

In general, toward more high-level languages, hence farther away from the machine. This creates more work for machines (and compiler writers!) but relieves the burden of language users.

Special-purpose languages

Also called minilanguages, or domain (specific) languages

An "ultimate solution" to a certain class of problems

Examples:

Lex for lexers, Yacc for parsers
BNFC for compiler front ends
XML for structuring documents
make for predefining compilations
bash for working on files and directories
PostScript for printing documents
JavaScript for dynamic web pages

The latter two are actually Turing-complete!

Design decisions for (mini)languages

Imperative or declarative?

Interpreted or compiled?

Portable or platform-independent?

Statically or dynamically checked?

Turing-complete or limited?

Language or library?

Embedded languages

Minilanguage that is a fragment of a larger host language.

Really the same as a library in the host language.

Advantages

inherit the implementation of the host language
very little extra training for those who know the host language
unlimited access to "language extensions" via using the host language

Disadvantages

difficult to reason about independently of host language implementation
access to host language can compromise safety, efficiency, etc
may be difficult to interface with other languages

Case study: BNFC

The starting point was the course Kompilatorkonstruktion in 2002. The web page gives a link:

1/2 The BNF Converter: en liten leksak som hjälper att skapa abstrakt syntax och parsare. Även tillgänglig på kurskontot komp. Varning: extremt experimental!

The first version written by Aarne Ranta and Markus Forsberg for Haskell/Happy/Alex

In 2004, ported to Java, C, and C++ by Michael Pellauer

In 2005, ported to Java 1.5 by Björn Bringert

In 2006, ported to OCaml by Kristofer Johannisson and C# by Johan Broberg

Motivation

To implement exactly the idea that a parser returns an abstract syntax tree.

save writing code
keep different compiler modules in sync
use grammar as reliable documentation
guarantee sameness of parser and pretty printer (for e.g. source code optimizations)
be sure of the complexity limit of processing

"The number of bugs per line is independent of programming language" (Eric S. Raymond)

Code size for language implementation in Lab 2.

format	CPP.cf	Haskell	Java 1.5	C++	raw C++
files	1	9	55	12	12
lines	63	999	3353	5382	9424
chars	1548	28516	92947	96587	203659
lines src/tgt	100%	6%	2%	1%	0.5%

Design decisions for BNFC

Imperative or declarative? Declarative.

Interpreted or compiled? Compiled.

Portable or platform-independent? Portable.

Statically checked? Yes, but should be more.

Turing-complete or limited? Limited.

Language or library? Language.

The very first implementation

WSCC - World's Smallest Compiler Compiler. 114 lines of Haskell.

Functionality:

read a labelled BNF grammar
build a parser using parser combinators
using the parser, print out AST for input

Parsing labelled BNF grammars

  -- grammar parser: one rule/line, format F. C ::= (C | "s")* ";"
  
  getCF :: String -> CF
  getCF = concat . map (getcf . init . words) . filter isRule . lines where
    getcf (fun : cat : "::=" : its) = return (init fun, (cat, map mkIt its))
    getcf ww = []
    mkIt ('"':w@(_:_)) = Right (init w)
    mkIt w             = Left  w
    isRule line = not (all isSpace line || take 2 line == "--")
  
  -- the type of context-free grammars
  
  type CF   = [Rule]
  type Rule = (Fun, (Cat, [Either Cat Tok]))
  
  type Cat = String
  type Tok = String
  type Fun = String
  type Str = [Tok]

Parser combinators

  -- a complete set of parser combinators à la Wadler and Hutton
  
  type Parser a b = [a] -> [(b,[a])]
  
  parseResults :: Parser a b -> [a] -> [b]
  parseResults p s = [x | (x,r) <- p s, null r]
  
  (...) :: Parser a b -> Parser a c -> Parser a (b,c)
  (p ... q) s = [((x,y),r) | (x,t) <- p s, (y,r) <- q t]
  
  (|||) :: Parser a b -> Parser a b -> Parser a b
  (p ||| q) s = p s ++ q s
  
  lit :: (Eq a) => a -> Parser a a
  lit x (c:cs) = [(x,cs) | x == c]
  lit _ _ = []
  
  (***) :: Parser a b -> (b -> c) -> Parser a c
  (p *** f) s = [(f x,r) | (x,r) <- p s]
  
  succeed :: b -> Parser a b
  succeed v s = [(v,s)]
  
  fails :: Parser a b
  fails s = []

The parsing method

  -- parser that works for non-left-recursive grammars
  -- generalization of LL(1) to ambiguous grammars
  
  pTree :: CF -> Cat -> Parser Tok Tree
  pTree cf cat = foldr (|||) fails (map pRule (rulesForCat cf cat))
    where
      pRule (fun, (_,its)) = pIts its *** (\trees -> Tree (fun,trees))
      pIts (Left  c : ts) = (pTree cf c ... pIts ts) *** (uncurry (:))
      pIts (Right s : ts) = (lit s      ... pIts ts) *** snd
      pIts [] = succeed []
  
  -- the type of syntax trees
  
  newtype Tree = Tree (Fun,[Tree])

Using WSCC

The whole implementation: file WSCC.hs

Example grammar: file Mini.cf

Example run (interactive; indentation of parse result added):

    $ runghc WSCC.hs Mini.cf
    15 rules
    Program>
    Prog NilStm
  
    Program> int i ; { i = 1 ; int i ; }
    Prog 
      (ConsStm (SDecl TInt Id_i) 
      (ConsStm (SBlock 
         (ConsStm (SAss Id_i (EInt Int_1)) 
         (ConsStm (SDecl TInt Id_i) 
         NilStm))) 
      NilStm))

Limitations of the first version

No built-in literals or identifiers

No built-in precedence

No documentation, pretty-printer, skeleton

No connection to standard compiler tools (Happy, Alex)

Only Haskell

No treatment of left recursive grammars

Unpredictable complexity due to backtracking

Undocumented grammar syntax

The next version: BNFC 1.0

Built-in literals added

Built-in precedence via indexed categories

Documentation in Latex, pretty-printer and skeleton in Haskell

Code generated for standard compiler tools (Happy, Alex)

Only Haskell, still

Left recursion is a virtue in LALR parsing

Predictably linear complexity

BNF grammar syntax implemented in BNFC, therefore documented

Lessons from BNFC

Compilation to standard tools made it really useful.

The language is declarative and therefore portable and predictable.

using Happy, CUP, or Bison directly would not be, because any host language code can be inserted in the semantic actions

Static checking as close to source as possible

for instance, that list-defining rules are consistent
unfortunately, BNFC doesn't check e.g. that token definitions do not get overshadowed
using the same identifier both as category and constructor is legal in Haskell but not in Java. This gives errors, about which only a warning is issued.
using the same identifer for the grammar name confuses Java completely. There is no warning for this.

The source code of BNFC has become a terrible mess and is hard to maintain.

Using BNFC for implementing (mini)languages

Easy to get started: a prototype is ready to run in 10 minutes.

Implementation language can be changed, e.g. fast prototype in Haskell and production system in C++.

Many implementation languages can be combined, because they can communicate via parser and pretty printer.

The language document can be handed to users.

Restrictions (we call them conditions for "well-behaved languages"):

lexing strictly finite state, parsing strictly LALR(1)
white space cannot be given any meaning

BNFC and XML

XML = Extended Markup Language

a format for structuring documents and data
example application: XHTML

Algebraic datatypes can be encoded in XML

DTD = Document Type Definition, tells what combinations are valid

BNFC can generate a DTD and encode syntax tree in XML with the option -xml (Haskell only). Try this for the grammar Mini.cf:

    bnfc -xmlt -m Mini.cf

DTD for Mini.cf

  <?xml version="1.0" standalone="yes"?>
  <!DOCTYPE Mini [
  <!ELEMENT Integer EMPTY>
  <!ATTLIST Integer value CDATA #REQUIRED>
  <!ELEMENT Double EMPTY>
  <!ATTLIST Double value CDATA #REQUIRED>
  <!ELEMENT String EMPTY>
  <!ATTLIST String value CDATA #REQUIRED>
  <!ELEMENT Ident EMPTY>
  <!ATTLIST Ident value CDATA #REQUIRED>
  
  <!ELEMENT Program ((Prog, Stm*))>
  <!ELEMENT Prog EMPTY>
  
  <!ELEMENT Stm ((SDecl, Type, Ident) | 
    (SAss, Ident, Exp) | (SBlock, Stm*) | (SPrint, Exp))>
  <!ELEMENT SDecl EMPTY>
  <!ELEMENT SAss EMPTY>
  <!ELEMENT SBlock EMPTY>
  <!ELEMENT SPrint EMPTY>
  
  <!ELEMENT Exp ((EVar, Ident) | (EInt, Integer) | 
     (EDouble, Double) | (EAdd, Exp, Exp))>
  <!ELEMENT EVar EMPTY>
  <!ELEMENT EInt EMPTY>
  <!ELEMENT EDouble EMPTY>
  <!ELEMENT EAdd EMPTY>
  
  <!ELEMENT Type ((TInt) | (TDouble))>
  <!ELEMENT TInt EMPTY>
  <!ELEMENT TDouble EMPTY>
  ]>

XML for a Mini program

  ./TestMini ex.mini
  ex.mini
  
  Parse Successful!
  
  [Linearized tree]
  
  int x ;
  x = 6 ;
  int y ;
  y = x + 7 ;
  print y ;
  {
    int y ;
    y = 4 ;
    print y ;
    x = y ;
    print x ;
    }
  print x ;
  print y ;
  
  
  [XML]
  
  <Program> <Prog/>
    <Stm> <SDecl/>
      <Type> <TInt/>
      </Type>
      <Ident value = "x" />
    </Stm>
    <Stm> <SAss/>
      <Ident value = "x" />
      <Exp> <EInt/>
        <Integer value = "6" />
      </Exp>
    </Stm>
    <Stm> <SDecl/>
      <Type> <TInt/>
      </Type>
      <Ident value = "y" />
    </Stm>
    <Stm> <SAss/>
      <Ident value = "y" />
      <Exp> <EAdd/>
        <Exp> <EVar/>
          <Ident value = "x" />
        </Exp>
        <Exp> <EInt/>
          <Integer value = "7" />
        </Exp>
      </Exp>
    </Stm>
    <Stm> <SPrint/>
      <Exp> <EVar/>
        <Ident value = "y" />
      </Exp>
    </Stm>
    <Stm> <SBlock/>
      <Stm> <SDecl/>
        <Type> <TInt/>
        </Type>
        <Ident value = "y" />
      </Stm>
      <Stm> <SAss/>
        <Ident value = "y" />
        <Exp> <EInt/>
          <Integer value = "4" />
        </Exp>
      </Stm>
      <Stm> <SPrint/>
        <Exp> <EVar/>
          <Ident value = "y" />
        </Exp>
      </Stm>
      <Stm> <SAss/>
        <Ident value = "x" />
        <Exp> <EVar/>
          <Ident value = "y" />
        </Exp>
      </Stm>
      <Stm> <SPrint/>
        <Exp> <EVar/>
          <Ident value = "x" />
        </Exp>
      </Stm>
    </Stm>
    <Stm> <SPrint/>
      <Exp> <EVar/>
        <Ident value = "x" />
      </Exp>
    </Stm>
    <Stm> <SPrint/>
      <Exp> <EVar/>
        <Ident value = "y" />
      </Exp>
    </Stm>
  </Program>

XML or BNFC

The question is not an exclusive "or": you can get both!

BNFC can be used for defining a datafile format, which is more compact than XML but portable between many languages, since many languages can print and parse these objects.

If this is not enough, the format can be converted automatically to an XML representation.