The GF Resource Grammar Library

Fourth Version, 2 November 2005.
Third Version, 22 May 2005. Completed 1 July. Second Version, 1 March 2005 First Draft, 7 February 2005

Aarne Ranta

aarne@cs.chalmers.se

GF = Grammatical Framework

GF is a grammar formalism based on functional programming and type theory.

GF was designed to be nice for ordinary programmers to use: by this we mean programmers without training in linguistics.

The mission of GF is to make natural-language applications available for ordinary programmers, in tasks like

Thus GF is not primarily another theoretical framework for linguists.

Multilingual grammars

A GF grammar consists of an abstract syntax and a set of concrete syntaxes.

Abstract syntax: language-independent representation

  cat Prop ; Nat ;
  fun Even : Nat -> Prop ;
  fun NInt : Int -> Nat ;
Concrete syntax: mapping from abstract syntax trees to strings in a language (English, French, German, Swedish,...)
  lin Even x = {s = x.s ++ "is" ++ "even"} ; 

  lin Even x = {s = x.s ++ "est" ++ "pair"} ;

  lin Even x = {s = x.s ++ "ist" ++ "gerade"} ;

  lin Even x = {s = x.s ++ "är" ++ "jämnt"} ;
We can translate between languages via the abstract syntax:
  4 is even                  4 ist gerade
             \              /
               Even (NInt 4)
             /              \
  4 est pair                  4 är jämnt

But is it really so simple?

Difficulties with concrete syntax

Most languages have rules of inflection, agreement, and word order, which have to be obeyed when putting together expressions.

The previous multilingual grammar breaks these rules in many situations:

2 and 3 is even
la somme de 3 et de 5 est pair
wenn 2 ist gerade, dann 2+2 ist gerade
om 2 är jämnt, 2+2 är jämnt
All these sentences are grammatically incorrect.

Solving the difficulties

GF has tools for expressing the linguistic rules that are needed to produce correct translations in different languages.

Instead of just strings, we need

parameters, tables, and record types. For instance, French:

  param Mod = Ind | Subj ;
  param Gen = Masc | Fem ;

  lincat Nat = {s : Str ; g : Gen} ;
  lincat Prop = {s : Mod => Str} ;

  lin Even x = {s =
      table {
        m => x.s ++
             case m   of {Ind  => "est" ;  Subj => "soit"} ++
             case x.g of {Masc => "pair" ; Fem  => "paire"}
      }
    } ;
To learn more about these constructs, consult GF documentation, e.g. the New Grammarian's Tutorial. However, in what follows we will show how to avoid learning them and still write linguistically correct grammars.

Language + Libraries

Writing natural language grammars still requires theoretical knowledge about the language.

Which kind of a programmer is it easier to find?

In main-stream programming, sorting algorithms are not written by hand but taken from libraries.

In the same way, we want to create grammar libraries that encapsulate basic linguistic facts.

Cf. the Java success story: the language is just a half of the success - libraries are another half.

Example of library-based grammar writing

To define a Swedish expression of a mathematical predicate from scratch:
  Even x =
    let jämn = case <x.n,x.g> of {
      <Sg,Utr>   => "jämn" ;
      <Sg,Neutr> => "jämnt" ;
      <Pl,_>     => "jämna"
      }
    in
    {s = table {
      Main => x.s ! Nom ++ "är" ++ jämn ;
      Inv  => "är" ++ x.s ! Nom ++ jämn ;
      Sub  => x.s ! Nom ++ "är" ++ jämn
      }
    }
To use library functions for syntax and morphology:
  Even = predA (regA "jämn") ;
For the French version, we write
  Even = predA (regA "pair") ;

Questions in grammar library design

What should there be in the library?
  • morphology, lexicon, syntax, semantics,...

    How do we organize and present the library?

  • division into modules, level of granularity
  • "school grammar" vs. sophisticated linguistic concepts

    Where do we get the data from?

  • automatic extraction or hand-writing?
  • reuse of existing resources?
    Extra constraint: we want open-source free software and hence cannot use existing proprietary resources.

    Answers to questions in grammar library design

    The current GF resource grammar library has made the following decisions:

    The library has, for each language

  • complete morphology, some lexicon (500 words), representative fragment of syntax, very little semantics,

    Organization and presentation:

  • division into top-level (API) modules, and internal modules (only interesting for resource implementors)
  • the API is, as much as possible, common in different languages
  • we favour "school grammar" concepts rather than innovative linguistic theory

    Where do we get the data from?

  • morphology and syntax are hand-written
  • the 500-word lexicon is hand-written, but a tool is provided for automatic lexicon extraction
  • we have not reused existing resources
    The resource grammar library is entirely open-source free software (under GNU GPL license).

    The scope of a resource grammar library for a language

    All morphological paradigms

    Basic lexicon of structural, common, and irregular words

    Basic syntactic structures

    Currently,

  • no semantics,
  • no language-specific structures if not necessary for expressivity.

    Success criteria

    Grammatical correctness

    Semantic coverage: you can express whatever you want.

    Usability as library for non-linguists.

    (Bonus for linguists:) nice generalizations w.r.t. language families, using the module system of GF.

    These are not our success criteria

    Language coverage: to be able to parse all expressions.
    Example: the French passé simple tense, although covered by the morphology, is not used in the language-independent API, but only the passé composé is. However, an application accessing the French-specific (or Romance-specific) modules can use the passé simple.

    Semantic correctness: only to produce meaningful expressions.
    Example: the following sentences can be generated

      colourless green ideas sleep furiously
    
      the time is seventy past forty-two
    
    However, an applicatio grammar can use a domain-specific semantics to guarantee semantic well-formedness.

    (Warning for linguists:) theoretical innovation in syntax is not among the goals (and it would be hidden from users anyway!).

    So where is semantics?

    GF incorporates a Logical Framework and is therefore capable of expressing logical semantics ā la Montague or any other flavour, including anaphora and discourse.

    But we do not try to give semantics once and for all for the whole language.

    Instead, we expect semantics to be given in application grammars built on semantic models of different domains.

    Example application: number theory

      fun Even : Nat -> Prop ;         -- a mathematical predicate
    
      lin Even = predA (regA "even") ; -- English translation
      lin Even = predA (regA "pair") ; -- French translation
      lin Even = predA (regA "jämn") ; -- Swedish translation
    
    How could the resource predict that just these translations are correct in this domain?

    Application grammars are built by experts of these domains who - thanks to resource grammars - do no more need to be experts in linguistics.

    Languages

    The current GF Resource Project covers ten languages: The first three letters (Dan etc) are used in grammar module names

    Library structure 1: language-independent API

  • Lang is the top module collecting all of the following.

  • syntactic Categories (parts of speech, word classes), e.g.
      V ; NP ; CN ; Det ;  -- verb, noun phrase, common noun, determiner
    
  • Rules for combining words and phrases, e.g.
      DetNP : Det -> CN -> NP ; -- combine Det and CN into NP
    
  • the most common Structural words (determiners, conjunctions, pronouns) (now 83), e.g.
      and_Conj : Conj ;
    
    Numerals, number words from 1 to 999,999 with their inflections, e.g.
      n8 : Digit ;
    
    Basic lexicon of (now 218) frequent everyday words
      man_N : N ;
    

    In addition, and not included in Lang, there is

  • SwadeshLex, lexicon of (now 206) words from the Swadesh list, e.g.
      squeeze_V : V ;
    
    Of course, there is some overlap between SwadeshLex and the other modules.

    Library structure 2: language-dependent modules

  • morphological Paradigms, e.g. Swedish
      mkN : Str -> Str -> Str -> Str -> Gender -> N ; -- worst-case nouns
      mkN : Str -> N ;                                -- regular nouns
    
  • (in some languages) irregular Verbs, e.g.
      angripa_V = irregV "angripa" "angrep" "angripit" ;
    
  • (not yet available) Extended syntax with language-specific rules
      PassBli : V2 -> NP -> VP ;  -- bli överkörd av ngn
    

    How much can be language-independent?

    For the ten languages we have considered, it is possible to implement the current API.

    Reservations: