Lecture 12: Design and Evolution of Programming Languages Programming Languages Course Aarne Ranta (aarne@chalmers.se) %!target:html %!postproc(html): #NEW %!postproc(html): #HR
%!postproc(html): #sub1 1 %!postproc(html): #subn1 n-1 %!postproc(html): #subn n Book: 1.3, 1.5, 1.6 [Minilanguages http://www.catb.org/~esr/writings/taoup/html/minilanguageschapter.html], from Eric S. Raymond, [The Art of Unix Programming http://www.catb.org/~esr/writings/taoup/] [The Hundred-Year Language http://www.paulgraham.com/hundred.html] from Paul Graham, [Hackers & Painters http://www.paulgraham.com/hackpaint.html] [The Retrocomputing Museum http://www.catb.org/~esr/retro/] #NEW ==Plan== How simple can a programming language be? Turing-completeness Some programming language history General and special purpose languages Case study: the evolution of BNFC Data formats, XML #NEW ==Turing completeness== Before electronic computers were built, mathematical models of computation were developed. All of these were proven equivalent: - The **Turing Machine** (Alan Turing), similar to imperative programming. - **Lambda Calculus** (Alonzo Church), similar to functional programming. - **Post machines** (Emile Post) - **Recursive functions** - **Rewriting systems** Any programming language equivalent to one of these is said to be **Turing-complete**. All usual general-purpose languages are Turing-complete. #NEW ==Lambda calculus as a programming language== (Material from [Wikipedia http://en.wikipedia.org/wiki/Lambda_calculus]) Recall how simple lambda calculus is: ``` Exp ::= Ident | "(" Exp Exp ")" | "\" Ident "->" Exp ``` We don't even need integers, because they can be defined as follows: ``` 0 = \f -> \x -> x 1 = \f -> \x -> f x 2 = \f -> \x -> f (f x) 3 = \f -> \x -> f (f (f x)) ... ``` In other words: number //n// is a higher-order function that applies any function //n// times to any given argument. These functions are known as **Church numerals**. #NEW ==Arithmetic in lambda calculus== Addition of Church numerals is defined as follows: ``` PLUS = \m -> \n -> \f -> \x -> n f (m f x) ``` Example: ``` PLUS 2 3 = (\m -> \n -> \f -> \x -> n f (m f x)) (\f -> \x -> f (f x)) (\f -> \x -> f (f (f x))) = \f -> \x -> (\f -> \x -> f (f (f x))) f ((\f -> \x -> f (f x)) f x) = \f -> \x -> (\f -> \x -> f (f (f x))) f (f (f x)) = \f -> \x -> f (f (f (f (f x)))) = 5 ``` Multiplication: ``` MULT = \m -> \n -> m (PLUS n) 0 ``` Idea: add //n// to 0 //m// times. #NEW ==Control structures in lambda calculus== Booleans (**Church booleans**) and conditions ``` TRUE = \x -> \y -> x FALSE = \x -> \y -> y IFTHENELSE = \b -> \x -> \y -> b x y AND = \a -> \b -> IFTHENELSE a b FALSE OR = \a -> \b -> IFTHENELSE a TRUE b ``` Recursion, via the **fix-point combinator** ``` Y = \g -> (\x -> g (x x)) (\x -> g (x x)) ``` This has the property ``` Y g = g (Y g) ``` which means iterating ``g`` infinitely many times. #NEW ==Lambda calculus as a programming language?== To write programs in lambda calculus is possible! But it is inconvenient and inefficient. However, it is a good starting point for a language to have a very small **core language**. The implementation (compiler, interpreter) is then done for the core language with syntactic sugar and possibly optimizations. Lisp is built from lambda calculus with a few additions, such as a primitive notion of lists. Haskell has a small core language based on lambda calculus with algebraic datatypes and pattern matching, and primitive number types. #NEW ==Brainf*ck: another Turing-complete language== The following material is from [this website http://www.muppetlabs.com/~breadbox/bf/]. By Urban Müller, with the goal: create a Turing-complete language for which one could write the smallest compiler ever. His compiler was 240 bytes in size. A Brainfuck program has an implicit byte pointer, called "the pointer", which is free to move around within an array of 30000 bytes, initially all set to zero. The pointer itself is initialized to point to the beginning of this array. The Brainfuck programming language consists of eight commands, each of which is represented as a single character. ``` > Increment the pointer. < Decrement the pointer. + Increment the byte at the pointer. - Decrement the byte at the pointer. . Output the byte at the pointer. , Input a byte and store it in the byte at the pointer. [ Jump forward past the matching ] if the byte at the pointer is zero. ] Jump backward to the matching [ unless the byte at the pointer is zero. ``` #NEW ==Semantics of BF via translation to C== ``` > becomes ++p; < becomes --p; + becomes ++*p; - becomes --*p; . becomes putchar(*p); , becomes *p = getchar(); [ becomes while (*p) { ] becomes } ``` #NEW ==Example BF program: char.bf== Display the ASCII character set (Jeffry Johnston 2001) ``` .+[.+] ``` #NEW ==Example BF program: hello.bf== Print "HELLO WORLD!" (from [Wikipedia http://en.wikipedia.org/wiki/Brainfuck]) ``` ++++++++++ [>+++++++>++++++++++>+++>+<<<<-] To set up useful values in the array >++. Print 'H' >+. Print 'e' +++++++. Print 'l' . Print 'l' +++. Print 'o' >++. Print ' ' <<+++++++++++++++. Print 'W' >. Print 'o' +++. Print 'r' ------. Print 'l' --------. Print 'd' >+. Print '!' >. Print newline ``` #NEW ==Criteria for a good programming language== Certainly Turing completeness is not enough! (lambda calculus, Brainf*ck, the corresponding fragment of C...) More or less obvious criteria: - succinctness - efficiency - clarity - safety - orthogonality These criteria are not always compatible: there are trade-offs. In practice, different languages are good for different applications. (And there are languages which are no good for any applications.) What is Brainf*ck good for? Reasoning about computability! #NEW ==History of programming languages== We spend some time on looking at this [poster http://www.oreilly.com/pub/a/oreilly/news/languageposter_0504.html] with a time chart of languages. #NEW ==Some trends== Toward more **structure** in programs (from GOTOs to ``while`` loops to recursion) Toward more **static typing** (from bit strings to numeric types to structures to algebraic data types to dependent types) Toward more **abstraction** (from character arrays to strings, from arrays to vectors and lists, from unlimited access to abstract data types) Toward more **genericity** (from cut-and-paste to functions to polymorphic functions to first-class modules) Toward more **streamlined syntax** (from positions and line numbers, keywords used as identifiers, ``begin`` and ``end`` markers, limited-size identifiers, etc, to a "C-like" syntax that can be processed with standard tools) In general, toward more **high-level languages**, hence farther away from the machine. This creates more work for machines (and compiler writers!) but relieves the burden of language users. #NEW ==Special-purpose languages== Also called **minilanguages**, or **domain (specific) languages** An "ultimate solution" to a certain class of problems Examples: - Lex for lexers, Yacc for parsers - BNFC for compiler front ends - XML for structuring documents - ``make`` for predefining compilations - ``bash`` for working on files and directories - PostScript for printing documents - JavaScript for dynamic web pages The latter two are actually Turing-complete! #NEW ==Design decisions for (mini)languages== Imperative or declarative? Interpreted or compiled? Portable or platform-independent? Statically or dynamically checked? Turing-complete or limited? Language or library? #NEW ==Embedded languages== Minilanguage that is a fragment of a larger **host language**. Really the same as a library in the host language. Advantages - inherit the implementation of the host language - very little extra training for those who know the host language - unlimited access to "language extensions" via using the host language Disadvantages - difficult to reason about independently of host language implementation - access to host language can compromise safety, efficiency, etc - may be difficult to interface with other languages #NEW ==Case study: BNFC== The starting point was the course [Kompilatorkonstruktion http://www.cs.chalmers.se/Cs/Grundutb/Kurser/komp/2002/] in 2002. The web page gives a link: - 1/2 The BNF Converter: en liten leksak som hjälper att skapa abstrakt syntax och parsare. Även tillgänglig på kurskontot komp. **Varning**: extremt experimental! The first version written by Aarne Ranta and Markus Forsberg for Haskell/Happy/Alex In 2004, ported to Java, C, and C++ by Michael Pellauer In 2005, ported to Java 1.5 by Björn Bringert In 2006, ported to OCaml by Kristofer Johannisson and C# by Johan Broberg #NEW ==Motivation== To implement exactly the idea that a parser returns an abstract syntax tree. - save writing code - keep different compiler modules in sync - use grammar as reliable documentation - guarantee sameness of parser and pretty printer (for e.g. source code optimizations) - be sure of the complexity limit of processing "The number of bugs per line is independent of programming language" (Eric S. Raymond) Code size for language implementation in Lab 2. || format | CPP.cf | Haskell | Java 1.5 | C++ | raw C++ || | files | 1 | 9 | 55 | 12 | 12 | lines | 63 | 999 | 3353 | 5382 | 9424 | chars | 1548 | 28516 | 92947 | 96587 | 203659 | lines src/tgt | 100% | 6% | 2% | 1% | 0.5% #NEW ==Design decisions for BNFC== Imperative or declarative? Declarative. Interpreted or compiled? Compiled. Portable or platform-independent? Portable. Statically checked? Yes, but should be more. Turing-complete or limited? Limited. Language or library? Language. #NEW ==The very first implementation== WSCC - World's Smallest Compiler Compiler. 114 lines of Haskell. Functionality: + read a labelled BNF grammar + build a parser using **parser combinators** + using the parser, print out AST for input #NEW ==Parsing labelled BNF grammars== ``` -- grammar parser: one rule/line, format F. C ::= (C | "s")* ";" getCF :: String -> CF getCF = concat . map (getcf . init . words) . filter isRule . lines where getcf (fun : cat : "::=" : its) = return (init fun, (cat, map mkIt its)) getcf ww = [] mkIt ('"':w@(_:_)) = Right (init w) mkIt w = Left w isRule line = not (all isSpace line || take 2 line == "--") -- the type of context-free grammars type CF = [Rule] type Rule = (Fun, (Cat, [Either Cat Tok])) type Cat = String type Tok = String type Fun = String type Str = [Tok] ``` #NEW ==Parser combinators== ``` -- a complete set of parser combinators ā la Wadler and Hutton type Parser a b = [a] -> [(b,[a])] parseResults :: Parser a b -> [a] -> [b] parseResults p s = [x | (x,r) <- p s, null r] (...) :: Parser a b -> Parser a c -> Parser a (b,c) (p ... q) s = [((x,y),r) | (x,t) <- p s, (y,r) <- q t] (|||) :: Parser a b -> Parser a b -> Parser a b (p ||| q) s = p s ++ q s lit :: (Eq a) => a -> Parser a a lit x (c:cs) = [(x,cs) | x == c] lit _ _ = [] (***) :: Parser a b -> (b -> c) -> Parser a c (p *** f) s = [(f x,r) | (x,r) <- p s] succeed :: b -> Parser a b succeed v s = [(v,s)] fails :: Parser a b fails s = [] ``` #NEW ==The parsing method== ``` -- parser that works for non-left-recursive grammars -- generalization of LL(1) to ambiguous grammars pTree :: CF -> Cat -> Parser Tok Tree pTree cf cat = foldr (|||) fails (map pRule (rulesForCat cf cat)) where pRule (fun, (_,its)) = pIts its *** (\trees -> Tree (fun,trees)) pIts (Left c : ts) = (pTree cf c ... pIts ts) *** (uncurry (:)) pIts (Right s : ts) = (lit s ... pIts ts) *** snd pIts [] = succeed [] -- the type of syntax trees newtype Tree = Tree (Fun,[Tree]) ``` #NEW ==Using WSCC== The whole implementation: file [WSCC.hs wscc/WSCC.hs] Example grammar: file [Mini.cf wscc/Mini.cf] Example run (interactive; indentation of parse result added): ``` $ runghc WSCC.hs Mini.cf 15 rules Program> Prog NilStm Program> int i ; { i = 1 ; int i ; } Prog (ConsStm (SDecl TInt Id_i) (ConsStm (SBlock (ConsStm (SAss Id_i (EInt Int_1)) (ConsStm (SDecl TInt Id_i) NilStm))) NilStm)) ``` #NEW ==Limitations of the first version== No built-in literals or identifiers No built-in precedence No documentation, pretty-printer, skeleton No connection to standard compiler tools (Happy, Alex) Only Haskell No treatment of left recursive grammars Unpredictable complexity due to backtracking Undocumented grammar syntax #NEW ==The next version: BNFC 1.0== Built-in literals added Built-in precedence via indexed categories Documentation in Latex, pretty-printer and skeleton in Haskell Code generated for standard compiler tools (Happy, Alex) Only Haskell, still Left recursion is a virtue in LALR parsing Predictably linear complexity BNF grammar syntax implemented in BNFC, therefore documented #NEW ==Lessons from BNFC== Compilation to standard tools made it really useful. The language is declarative and therefore portable and predictable. - using Happy, CUP, or Bison directly would not be, because any host language code can be inserted in the semantic actions Static checking as close to source as possible - for instance, that list-defining rules are consistent - unfortunately, BNFC doesn't check e.g. that ``token`` definitions do not get overshadowed - using the same identifier both as category and constructor is legal in Haskell but not in Java. This gives errors, about which only a warning is issued. - using the same identifer for the grammar name confuses Java completely. There is no warning for this. The source code of BNFC has become a terrible mess and is hard to maintain. #NEW ==Using BNFC for implementing (mini)languages== Easy to get started: a prototype is ready to run in 10 minutes. Implementation language can be changed, e.g. fast prototype in Haskell and production system in C++. Many implementation languages can be combined, because they can communicate via parser and pretty printer. The language document can be handed to users. Restrictions (we call them conditions for "well-behaved languages"): - lexing strictly finite state, parsing strictly LALR(1) - white space cannot be given any meaning #NEW ==BNFC and XML== XML = Extended Markup Language - a format for structuring documents and data - example application: XHTML Algebraic datatypes can be encoded in XML DTD = Document Type Definition, tells what combinations are valid BNFC can generate a DTD and encode syntax tree in XML with the option ``-xml`` (Haskell only). Try this for the grammar [``Mini.cf`` ../laborations/mini/Mini.cf]: ``` bnfc -xmlt -m Mini.cf ``` #NEW ==DTD for Mini.cf== ``` ]> ``` #NEW ==XML for a Mini program== ``` ./TestMini ex.mini ex.mini Parse Successful! [Linearized tree] int x ; x = 6 ; int y ; y = x + 7 ; print y ; { int y ; y = 4 ; print y ; x = y ; print x ; } print x ; print y ; [XML] ``` #NEW ==XML or BNFC== The question is not an exclusive "or": you can get both! BNFC can be used for defining a datafile format, which is more compact than XML but portable between many languages, since many languages can print and parse these objects. If this is not enough, the format can be converted automatically to an XML representation.