GF Resource Grammar Library v. 1.1

Author: Aarne Ranta <aarne (at) cs.chalmers.se>
Last update: Thu Apr 19 23:35:29 2007

The GF Resource Grammar Library defines the basic grammar of ten languages: Danish, English, Finnish, French, German, Italian, Norwegian, Russian, Spanish, Swedish. A still incomplete implementation for Arabic is also included.

New in Version 1.1

Simpler APIs using overloading:
- Constructors: almost all trees in a category C can be built by the function mkC.
- Combinators: cross-cut grammatical functions: predication, application, modification, coordination.
- Symbolic: noun phrases with mathematical symbols.

An example of use is logic. The API of version 1.0 remains valid and can be used in combination with this.

Some new functions.
Bug fixes.

Authors

Inger Andersson and Therese Soderberg (Spanish morphology), Nicolas Barth and Sylvain Pogodalla (French verb list), Ali El Dada (Arabic modules), Janna Khegai (Russian modules), Bjorn Bringert (many Swadesh lexica), Carlos Gonzalía (Spanish cardinals), Harald Hammarström (German morphology), Patrik Jansson (Swedish cardinals), Andreas Priesnitz (German lexicon), Aarne Ranta.

We are grateful for contributions and comments to several other people who have used this and the previous versions of the resource library, including Ludmilla Bogavac, Ana Bove, David Burke, Lauri Carlson, Gloria Casanellas, Karin Cavallin, Robin Cooper, Hans-Joachim Daniels, Elisabet Engdahl, Markus Forsberg, Kristofer Johannisson, Anni Laine, Peter Ljunglöf, Saara Myllyntausta, Wanjiku Ng'ang'a, Jordi Saludes.

License

The GF Resource Grammar Library is open-source software licensed under GNU General Public License. See the file LICENSE for more details.

Scope

Coverage, for each language:

complete morphology
lexicon of the ca. 100 most important structural words
test lexicon of ca. 300 content words (rough equivalents in each language)
list of irregular verbs (separately for each language)
representative fragment of syntax (cf. CLE (Core Language Engine))
rather flat semantics (cf. Quasi-Logical Form of CLE)

Organization:

top-level (API) modules
Ground API + special-purpose APIs
"school grammar" concepts rather than advanced linguistic theory

Presentation:

tool gfdoc for generating HTML from grammars
example collections

Quick start

Go to the main directory, compile the grammars, and run a test.

    cd GF/lib/resource-1.0
    make
    make test

This will take quite some time. An alternative is to use the precompiled grammar package compiled.tgz. This package has the necessary gfc and gfr files directly under GF/lib.

    GF/lib/alltenses
    GF/lib/mathematical
    GF/lib/multimodal
    GF/lib/present

Do for instance

    cd GF/lib/
    gf
      > i -path=present:prelude present/LangEng.gfc
      > gr -cat=S -number=3 -cf | tb

For more examples, see the Overview slides. The make procedure does not make Arabic, but it can be compiled in a similar way as the other languages.

Encoding

Finnish, German, Romance, and Scandinavian languages are in isolatin-1.

Arabic and Russian are in UTF-8.

English is in pure ASCII.

The language independent ground API

This API is accessible by both present and alltenses. The API is divided into a bunch of abstract modules. The following figure gives the dependencies of these modules.

The documentation of the individual modules:

Common: abstract notions with language-indep. implementations
Cat: the category system
Noun: construction of nouns and noun phrases
Adjective: construction of adjectival phrases
Verb: construction of verb phrases
Adverb: construction of adverbial phrases
Numeral: construction of cardinal and ordinal numerals
Sentence: construction of sentences and imperatives
Question: construction of questions
Relative: construction of relative clauses
Conjunction: coordination of phrases
Phrase: construction of the major units of text and speech
Text: construction of texts from phrases, using punctuation
Idiom: idiomatic phrases, such as existentials
Structural: a lexicon of structural words
Lexicon: a lexicon of other common words, for test purposes
Grammar: the main module comprising all but Lexicon
Lang: the main module comprising both Grammar and Lexicon

The language-dependent APIs

ParadigmsDan: Danish lexical paradigms
ParadigmsEng: English lexical paradigms
ParadigmsFin: Finnish lexical paradigms
ParadigmsFre: French lexical paradigms
ParadigmsIta: Italian lexical paradigms
ParadigmsGer: German lexical paradigms
ParadigmsNor: Norwegian lexical paradigms
ParadigmsRus: Russian lexical paradigms
ParadigmsSpa: Spanish lexical paradigms
ParadigmsSwe: Swedish lexical paradigms

IrregDan: Danish irregular verbs (very incomplete)
IrregEng: English irregular verbs
IrregFre: French irregular verbs
IrregGer: German irregular verbs
IrregNor: Norwegian irregular verbs (very incomplete)
IrregSpa: Spanish irregular verbs
IrregSwe: Swedish irregular verbs

This is the structure of each language-dependent top module.

Extra: extra constructs implemented in some languages
ExtraDan: extra constructs in Danish
ExtraEng: extra constructs in English
ExtraFin: extra constructs in Finnish
ExtraFre: extra constructs in French
ExtraIta: extra constructs in Italian
ExtraNor: extra constructs in Norwegian
ExtraRus: extra constructs in Russian
ExtraScand: extra constructs in Scandinavian
ExtraSpa: extra constructs in Spanish
ExtraSwe: extra constructs in Swedish

Danish: Danish with all extras
English: English with all extras
Finnish: Finnish with all extras
French: French with all extras
German: German with all extras
Italian: Italian with all extras
Norwegian: Norwegian with all extras
Russian: Russian with all extras
Spanish: Spanish with all extras
Swedish: Swedish with all extras

Special-purpose APIs

Present

The API is the same as for the full ground API, but the compiler has ignored all verb and sentence tenses except the present. Lines ignored in the source files are marked by --# notpresent. The result is a smaller and more efficient grammar, which is still sufficient for many applications.

Multimodal

The API is the same as for the full ground API, but with modified linearization types of NP and Adv, and all other categories depending on them: an extra field is added to a demonstrative pointing gesture. Some functions for constructing demonstratives are provided.

Multi: main module for multimodal dialogue systems

Mathematical

Mathematical: main module for mathematical language
Predication: predication with verbs, adjectives, etc
Symbol: symbols and numbers in text

Using the library

The compiled version

The simplest way to get the library is to install the precompiled version lib/compiled.tgz. Just do

    cd GF/lib
    tar xvfz compiled.tgz

There is no need to link application grammars to the source directories of the library. Use one (or several) of the following packages instead:

lib/alltenses the complete ground-API library with all forms
lib/present a pruned ground-API library with present tense only
lib/mathematical special-purpose API for mathematical applications
lib/multimodal the complete ground-API with demonstratives for multimodal dialogue applications

Linking applications to libraries

Typically, open one of

GrammarX for just syntax
LangX for both syntax and a small lexicon
X (e.g. English) for syntax, lexicon, and language-dependent extensions

Usually you also need your own lexicon, and hence have to open

ParadigmsX for lexicon-building functions

It is advisable to use the bare package names in paths pointing to the libraries. Here is an example, from examples/dialogue/LightsEng.gf:

    --# -path=.:alltenses:multimodal:prelude

To reach these directories from anywhere, set the environment variable GF_LIB_PATH to point to the directory GF/lib/. For instance, I have the following line in my .bashrc file:

    export GF_LIB_PATH=/home/aarne/GF/lib

The mathematical API shares modules with present. It is therefore not a good idea to use it in combination with alltenses.

Using the libraries as top-level grammars

If you have done make in lib/resource-1.0, you will have a file langs.gfcm. This file can be used with fast startup for tasks such as treebank generation:

    > i -nocf langs.gfcm
    > gr -cat=S -cf -number=10 | tb

The -nocf flag saves startup time and memory by preventing the creation of context-free parse grammars. The resource grammar libraries do not support parsing very well. While it is theoretically possible to parse with any GF grammar, the resource grammars are so abstract and complex that building the actual parser in memory may just need too much resources to succeed.

An exception is LangEng. It is actually feasible to parse with both alltenses/LangEng and present/LangEng - the latter being much faster than the former. The -fcfg flag (fast multiple context-free grammar) must be used:

    p -lang=LangEng -fcfg "this man is old"

Parsing with the -fcfg flag takes a few extra seconds the first time during each session, but gets faster at later runs. From GF 2.6, fcfg is the default parser of GF and the flag is not needed.

It is also possible to parse in Scandinavian languages (Danish, Norwegian, Swedish) and, with enough memory (gf +RTS -K512M), German.

Example applications

These applications are meant to serve as starting points for new applications, showing how the libraries can be used in typical situations.

Bronzeage

The examples/bronzeage grammar set implements a language fragment based on the Swadesh list of 200 words. It is useful for things like language training.

Dialogue

The examples/dialogue grammar set implements the user grammars of some multimodal dialogue system. Its purpose is to serve as a prototype for applications in the TALK project.

Animals

The examples/animal grammar set implements some queries about animals. Its purpose is to serve as a prototype for example-based grammar writing.

Known bugs and missing components

Danish

the lexicon and chosen inflections are only partially verified

English

Finnish

wrong cases in some passive constructions

French

multiple clitics (with V3) not always right
third person pronominal questions with inverted word order have wrong forms if "t" is required e.g. (e.g. "comment fera-t-il" becomes "comment fera il")

German

Italian

multiple clitics (with V3) not always right

Norwegian

the lexicon and chosen inflections are only partially verified

Russian

some functions missing
some regular paradigms are missing

Spanish

multiple clitics (with V3) not always right
missing contractions with imperatives and clitics

Swedish