GF Resources for Swedish

Språkdata Seminar, Gothenburg, 1 March 2005

Aarne Ranta

aarne@cs.chalmers.se

Plan

Introduction to resource grammars

Swedish morphology and lexicon in GF

Syntax case study: Swedish determiners

Syntax case study: Swedish sentence structure

Danish and Norwegian through parametrization

GF = Grammatical Framework

A grammar formalism based on functional programming and type theory.

Designed to be nice for ordinary programmers to use.

Mission: to make natural-language applications available for ordinary programmers, in tasks like

Thus not primarily another theoretical framework for linguists.

Multilingual grammars

Abstract syntax: language-independent representation
  cat Prop ; Nat ;
  fun Even : Nat -> Prop ;
Concrete syntax: mapping from abstract syntax trees to strings in a language (English, French, German, Swedish,...)
  lin Even x = {s = x.s ++ "is" ++ "even"} ; 

  lin Even x = {s = x.s ++ "est" ++ "pair"} ;

  lin Even x = {s = x.s ++ "ist" ++ "gerade"} ;

  lin Even x = {s = x.s ++ "är" ++ "jämnt"} ;
We can translate between language via the abstract syntax.

Is it really so simple?

Difficulties with concrete syntax

Most languages have rules of inflection, agreement, and word order, which have to be obeyed when putting together expressions.

The previous multilingual grammar breaks these rules in many situations:

2 and 3 is even
la somme de 3 et de 5 est pair
wenn 2 ist gerade, dann 2+2 ist gerade
om 2 är jämnt, 2+2 är jämnt

Solving the difficulties

GF has tools for expressing the linguistic rules that are needed to produce correct translations in different languages.

Instead of just strings, we need

parameters, tables, and record types. For instance, French:

  param Mod = Ind | Subj ;
  param Gen = Masc | Fem ;

  lincat Nat = {s : Str ; g : Gen} ;
  lincat Prop = {s : Mod => Str} ;

  lin Even x = {s =
      table {
        m => x.s ++
             case m   of {Ind  => "est" ;  Subj => "soit"} ++
             case x.g of {Masc => "pair" ; Fem  => "paire"}
      }
    } ;

Language + Libraries

Writing natural language grammars still requires theoretical knowledge about the language.

Which kind of a programmer is easier to find?

In main-stream programming, sorting algorithms are not written by hand but taken from libraries.

In the same way, we want to create grammar libraries that encapsulate basic linguistic facts.

Cf. the Java success story: the language is just a half of the success - libraries are another half.

Example of library-based grammar writing

To define a Swedish expression of a mathematical predicate from scratch:
  Even x =
    let jämn = case <x.n,x.g> of {
      <Sg,Utr>   => "jämn" ;
      <Sg,Neutr> => "jämnt" ;
      <Pl,_>     => "jämna"
      }
    in
    {s = table {
      Main => x.s ! Nom ++ "är" ++ jämn ;
      Inv  => "är" ++ x.s ! Nom ++ jämn ;
      Sub  => x.s ! Nom ++ "är" ++ jämn
      }
    }
To use library functions for syntax and morphology:
  Even = predA (regA "jämn") ;

Questions in grammar library design

What should there be in the library?
  • morphology, lexicon, syntax, semantics,...

    How do we organize and present the library?

  • division into modules, level of granularity
  • "school grammar" vs. sophisticated linguistic concepts

    Where do we get the data from?

  • automatic extraction or hand-writing?
  • reuse of existing resources?

    Extra constraint: we want open-source free software.

    The scope of the resource grammar library

    All morphological paradigms

    Basic lexicon of structural, common, and irregular words

    Basic syntactic structures

    Currently,

  • no semantics,
  • no language-specific structures if not necessary for expressivity.

    Success criteria

    Grammatical correctness

    Semantic coverage: you can express whatever you want.

    Usability as library for non-linguists.

    (Bonus for linguists:) nice generalizations w.r.t. language families, using the module system of GF.

    These are not our success criteria

    Language coverage: you can parse all expressions. Example: the French passé simple tense, although covered by the morhology, is not used in the language-independent API, but only the passé composé is.

    Semantic correctness

      colourless green ideas sleep furiously
    
      the time is seventy past forty-two
    

    (Warning for linguists:) theoretical innovation in syntax (and it will all be hidden anyway!)

    So where is semantics?

    GF incorporates a Logical Framework and is therefore capable of expressing logical semantics ā la Montague or any other flavour, including anaphora and discourse.

    But we do not try to give semantics once and for all for the whole language.

    Instead, we expect semantics to be given in application grammars built on semantic models of different domains.

    Example application: number theory

      fun Even : Nat -> Prop ;         -- a mathematical predicate
    
      lin Even = predA (regA "even") ; -- English translation
      lin Even = predA (regA "pair") ; -- French translation
      lin Even = predA (regA "jämn") ; -- Swedish translation
    
    How could the resource predict that just these translations are correct in this domain?

    Languages

    The current GF Resource Project covers ten languages: The first three letters (Dan etc) are used in grammar module names

    Library structure 1: language-independent API

  • syntactic Categories (parts of speech, word classes)

  • Rules for combining words and phrases, e.g.

  • the most common Structural words (determiners, conjunctions, pronouns), e.g.

    Library structure 2: language-dependent modules

  • morphological Paradigms, e.g.

  • Lexicon of frequent words

  • Extended syntax with language-specific rules

    How much can be language-independent?

    For the ten languages we have considered, it is possible to implement the current API.

    Reservations: