Simple and robust layout rules for parsing programming language code

Modern programming languages like Haskell and Python use indentation to structure code. Haskell’s layout rules give quite a bit of freedom to the programmer to arrange her code, at the expense that programs are not stable under some simple operations such as renaming an identifier. Consider the Haskell snippet:

  f arg = do thing1
             thing2

Renaming arg to argument will produce the text

  f argument = do thing1
             thing2

which is no longer valid Haskell code.

Haskell has so-called layout keywords such as do which start a new block. The column of the token after the layout keyword is the reference for grouping declarations into the block: the block closes if the first token on a line is indented less than the reference token. This makes the first snippet parse as

  f arg = do { thing1
             ; thing2 }

whereas the second is parsed as

  f argument = do { thing1 }
             thing2

The following rules make layout robust with regard to renaming identifiers (such as arg to argument):

  1. The layout keyword must be the last token on the line, or

  2. the layout keyword that starts a new block spanning several lines can only be preceded by other keywords or punctuation characters on the same line.

The task of this Master’s thesis is to design a framework for layout-sensitive grammars and a parser generator. Traditionally, layout is handled on the level of lexical analysis, but this thesis should investigate a handling on the level of parsing.

Prerequisites:

References: