Exercise: Extending a Parsing Library

The lecture on Monads and monad transformers introduced a simple parsing library.implemented using monad transformers. In this exercise, you are to extend the library to support a separate lexical analyser.

First download and unpack this file archive. It contains three files:

Lexical and Syntax Analysis

Real parsers normally use a separate lexical analyser, which takes care of white space, recognising tokens so that identifiers such as "beginning" are not parsed as keyword "begin" followed by "ning", and possibly removing comments. Happy assumes a separate lexical analyser is provided, and of course a hand-written lexer can also be used with this parsing library. But since a lexer is just a (special kind of) parser, we can use the same combinators to define both lexer and parser.

A lexer produces a list of tokens, usually classified into a small set of token types. For the arithmetic expression parser, the token types might be

data TokenType = NumToken | PlusToken | TimesToken | BraToken | KetToken
  deriving (Eq, Show)

Let us represent a token by a pair of a token type and a string:

type Token = (TokenType, String)

Our plan will be to construct a lexical analyser, which produces a list of tokens, and a syntax analyser which parses such a list, both using the parsing library. Since we are now using the library to make two different kinds of parsers, we have an opportunity to extend our DSEL with operators specifically tailored for one or the other.

Common tasks in a lexical analyser are:

Common tasks in a syntax analyser are:

Design suitable combinators for these tasks and add them to the parsing library. Use them to build an expression parser with separate lexical and syntax analysers.

Replacing the Underlying Monad

The little parsing library is overloaded on an underlying monad, which in the expression parser was just the identity monad Id. But sometimes parsers need to invoke other kinds of actions during parsing. For example,

These cases can be handled by replacing the underlying monad with IO and a State monad respectively.

Add an "include" expression to the arithmetic expression evaluator.

include file

should parse and evaluate an expression from the named file. Make sure that if parsing the included file fails, then the parse of the "include file" construction also fails (gracefully, that is, using the failure mechanism of the parsing monad -- not by a run-time error!)

Footnote: Real Parsing Libraries

This parsing library is a simplification of those used in practice, and there is a bit more work to be done to make a truly useful tool. Apart from adding more combinators to extend its functionality, there are two quite fundamental matters to address.

Firstly, this library constructs backtracking parsers, which suffer from excessive space use for large inputs. The problem and solutions are well-known, but beyond the scope of this course. For a brief discussion, read section 5.11 of Monads for Functional Programming (Phil Wadler's Marktoberdorf lecture notes), which introduces the "guarantee" operator as a solution.

Wadler introduces guarantee for a particular monad, but the idea can be generalised. Can you define a class GuaranteeMonad, with guarantee as a method, and give instances for the monad transformers we have considered? (This is an optional and extended exercise, which I haven't worked through myself! You will certainly need Haskell's lazy patterns of the form ~pat. Matching against one of these succeeds immediately, without evaluating the matched value; the real work of matching, and any possible failure, happens only when one the variables bound in the pattern is used).

Secondly, this library provides no indication of why parsing fails: we get no error messages. Real libraries keep track of the parsing position, and distinguish between "failure -- please try an alternative parse", and "failure -- there is a syntax error". I use a library which reports the "high water mark" of parsing: that is, the position of the first symbol which could not be accepted by any alternative parse. I find this works well, but there are many other approaches too.