Fourth Version, 2 November 2005.
Third Version, 22 May 2005. Completed 1 July.
Second Version, 1 March 2005
First Draft, 7 February 2005
Aarne Ranta
aarne@cs.chalmers.se
GF was designed to be nice for ordinary programmers to use: by this we mean programmers without training in linguistics.
The mission of GF is to make natural-language applications available for ordinary programmers, in tasks like
Abstract syntax: language-independent representation
cat Prop ; Nat ; fun Even : Nat -> Prop ; fun NInt : Int -> Nat ;Concrete syntax: mapping from abstract syntax trees to strings in a language (English, French, German, Swedish,...)
lin Even x = {s = x.s ++ "is" ++ "even"} ; lin Even x = {s = x.s ++ "est" ++ "pair"} ; lin Even x = {s = x.s ++ "ist" ++ "gerade"} ; lin Even x = {s = x.s ++ "är" ++ "jämnt"} ;We can translate between languages via the abstract syntax:
4 is even 4 ist gerade \ / Even (NInt 4) / \ 4 est pair 4 är jämnt
But is it really so simple?
The previous multilingual grammar breaks these rules in many situations:
2 and 3 is even
la somme de 3 et de 5 est pair
wenn 2 ist gerade, dann 2+2 ist gerade
om 2 är jämnt, 2+2 är jämnt
All these sentences are grammatically incorrect.
Instead of just strings, we need
parameters, tables, and record types. For instance, French:
param Mod = Ind | Subj ; param Gen = Masc | Fem ; lincat Nat = {s : Str ; g : Gen} ; lincat Prop = {s : Mod => Str} ; lin Even x = {s = table { m => x.s ++ case m of {Ind => "est" ; Subj => "soit"} ++ case x.g of {Masc => "pair" ; Fem => "paire"} } } ;To learn more about these constructs, consult GF documentation, e.g. the New Grammarian's Tutorial. However, in what follows we will show how to avoid learning them and still write linguistically correct grammars.
Which kind of a programmer is it easier to find?
In main-stream programming, sorting algorithms are not written by hand but taken from libraries.
In the same way, we want to create grammar libraries that encapsulate basic linguistic facts.
Cf. the Java success story: the language is just a half of the success - libraries are another half.
Even x = let jämn = case <x.n,x.g> of { <Sg,Utr> => "jämn" ; <Sg,Neutr> => "jämnt" ; <Pl,_> => "jämna" } in {s = table { Main => x.s ! Nom ++ "är" ++ jämn ; Inv => "är" ++ x.s ! Nom ++ jämn ; Sub => x.s ! Nom ++ "är" ++ jämn } }To use library functions for syntax and morphology:
Even = predA (regA "jämn") ;For the French version, we write
Even = predA (regA "pair") ;
How do we organize and present the library?
Where do we get the data from?
The library has, for each language
Organization and presentation:
Where do we get the data from?
Basic lexicon of structural, common, and irregular words
Basic syntactic structures
Currently,
Semantic coverage: you can express whatever you want.
Usability as library for non-linguists.
(Bonus for linguists:) nice generalizations w.r.t. language families, using the module system of GF.
Semantic correctness: only to produce meaningful expressions.
Example: the following sentences can be generated
colourless green ideas sleep furiously the time is seventy past forty-twoHowever, an applicatio grammar can use a domain-specific semantics to guarantee semantic well-formedness.
(Warning for linguists:) theoretical innovation in syntax is not among the goals (and it would be hidden from users anyway!).
But we do not try to give semantics once and for all for the whole language.
Instead, we expect semantics to be given in application grammars built on semantic models of different domains.
Example application: number theory
fun Even : Nat -> Prop ; -- a mathematical predicate lin Even = predA (regA "even") ; -- English translation lin Even = predA (regA "pair") ; -- French translation lin Even = predA (regA "jämn") ; -- Swedish translationHow could the resource predict that just these translations are correct in this domain?
Application grammars are built by experts of these domains who - thanks to resource grammars - do no more need to be experts in linguistics.
V ; NP ; CN ; Det ; -- verb, noun phrase, common noun, determiner
DetNP : Det -> CN -> NP ; -- combine Det and CN into NP
and_Conj : Conj ;Numerals, number words from 1 to 999,999 with their inflections, e.g.
n8 : Digit ;Basic lexicon of (now 218) frequent everyday words
man_N : N ;
In addition, and not included in Lang, there is
squeeze_V : V ;Of course, there is some overlap between SwadeshLex and the other modules.
mkN : Str -> Str -> Str -> Str -> Gender -> N ; -- worst-case nouns mkN : Str -> N ; -- regular nouns
angripa_V = irregV "angripa" "angrep" "angripit" ;
PassBli : V2 -> NP -> VP ; -- bli överkörd av ngn
Reservations:
Two alternative views on sentence formation by predication: Clause, Verbphrase
Finnish paradigms
example use of Finnish oaradigms
French paradigms
example use of French paradigms
French verbs
Italian paradigms
example use of Italian paradigms
Italian verb conjugations
Norwegian paradigms
example use of Norwegian paradigms
Norwegian verbs
Spanish paradigms
example use of Spanish paradigms
Spanish verb conjugations
Swedish paradigms
example use of Swedish paradigms
Swedish verbs
i english/LangEng.gf i swedish/LangSwe.gfAlternatively, you can make a precompiled package of all the languages by using lib/resource/Makefile:
make gf langs.gfcmThen you can test with translation, random generation, morphological analysis...
> p -lang=LangEng "I have loved her." | l -lang=LangFre Je l' ai aimée. > gr -cat=NP | l -multi The sock Strumpan Strømpen La media La calza La chaussette Sukka
i french/VerbsFre.gf mq -cat=VMorpho quiz with phrases (e.g. Swedish clauses):
i swedish/LangSwe.gf mq -cat=ClTranslation quiz with sentences (e.g. sentences from English to Swedish):
i swedish/LangEng.gf i swedish/LangSwe.gf tq -cat=S LangEng LangSwe
concrete AppNor of App = open LangNor, ParadigmsNor in {...}(Note for the users of GF 2.1 and older: the dummy reuse modules and their bulky .gfr versions are no longer needed!)
If you need to convert resource records to strings, and don't want to know the concrete type (as you never should), you can use
Predef.toStr : (L : Type) -> L -> Str ;L must be a linearization type. For instance,
toStr LangNor.CN (ModAP (PositADeg old_ADeg) (UseN car_N)) ---> "gammel bil"
Using the -v option shows if the parser fails because of unknown words.
> p -cat=S -v -lexer=words "jag ska åka till Chalmers" unknown tokens [TS "åka",TS "Chalmers"]Then try to select words that LangX recognizes:
> p -cat=S "jag ska springa till Danmark" UseCl (PosTP TFuture ASimul) (AdvCl (SPredV i_NP run_V) (AdvPP (PrepNP to_Prep (UsePN (PNCountry Denmark)))))Use these API structures and extend vocabulary to match your need.
åka_V = lexV "åker" ; Chalmers = regPN "Chalmers" neutrum ;
gfeditor LangEng.gf LangFre.gfopens the editor with English and French views. The Editor User Manual gives more information on the use of the editor.
A restriction of the editor is that it does not give access to ParadigmsX modules. An IDE environment extending the editor to a grammar programming tool is work in progress.
Who chases mice ? Whom does the lion chase ? The dog chases cats.We build the abstract syntax in two phases:
The concrete syntax of English is built in three phases:
The concrete syntax of Swedish is built upon QuestionsI in a similar way, with the modules QuestionsSwe and. AnimalsSwe.
The concrete syntax of French consists similarly of the modules QuestionsFre and AnimalsFre.
To produce an end-user multilingual grammar animals.gfcm, write the sequence of compilation commands in a gfs (GF script) file, say mkAnimals.gfs, and then call GF with
gf <mkAnimals.gfsTo try out the grammar,
> i animals.gfcm > gr | l -multi vem jagar hundar ? qui chasse des chiens ? who chases dogs ?
You can use the resource grammar as a parser on a special file format, .gfe ("GF examples"). Here is the real source, QuestionsI.gfe, which generated QuestionsI.gf. when you executed the GF command
i -ex AnimalsEng.gfSince QuestionsI is an incomplete module ("functor"), it need only be built once. This is why only the first command in mkAnimals.gfs needs the flag -ex.
Of course, the grammar of any language can be created by parsing any language, as long as they have a common resource API. The use of English resource is generally recommended, because it is smaller and faster to parse than the other languages.
lin Who love_V2 man_N = in Phr "who loves men ?" ;uses as argument variables constants for words that can be found in the lexicon. It is due to this that the example can be parsed. When the resulting rule,
lin Who love_V2 man_N = QuestPhrase (UseQCl (PosTP TPresent ASimul) (QPredV2 who8one_IP love_V2 (IndefNumNP NoNum (UseN man_N)))) ;is read by the GF compiler, the identifiers love_V2 and man_N are not treated as constants, but, following the normal binding rules of functional languages, as bound variables. This is what gives the example method the generality that is needed.
To write linearization rules by examples one thus has to know at least one abstract syntax constant for each category for which one needs a variable.
lin Pope = in NP "the man" {man_N = regN "pope"} ;The resulting linearization rule is initially
lin Pope = DefOneNP (UseN man_N) ;but the substitution changes this to
lin Pope = DefOneNP (UseN (regN "pope")) ;In this way, you do not have to extend the resource lexicon, but you need to open the Paradigms module to compile the resulting term.
Of course, the substituted expressions may come from another language than the main language of the example:
lin Pope = in NP "the man" {man_N = regN "pape" masculine} ;If many substitutions are needed, semicolons are used as separators:
{man_N = regN "pope" ; walk_N = regV "pray"} ;
Each of the API implementations uses the following auxiliary resource modules:
(Much more to be written...)
For this, you need
A useful command to test opers:
i -retain MorphoRot.gf cc regNoun "foo"
Language | v0.6 | v0.9 API | Paradigms | Basic lex | Verbs |
Danish | - | X | - | - | - |
English | X | X | X | X | X |
Finnish | X | + | X | X | 0 |
French | X | X | X | X | X |
German | X | - | * | - | - |
Italian | X | X | X | X | X |
Norwegian | - | X | X | X | X |
Russian | X | * | * | - | - |
Spanish | - | X | X | X | X |
Swedish | X | X | X | X | X |
Danish
English: missing uncontracted negations.
Finnish: compiling the heuristic paradigms is slow; possessive and interrogative suffixes have no proper lexer.
French: no inverted questions; some verbs in Basic should be reflexive
German
Italian: no omission of unstressed subject pronouns; some verbs in Basic should be reflexive; bad forms of reflexive infinitives
Norwegian: possessives of type bilen min not included
Russian
Spanish: no omission of unstressed subject pronouns; no switch to dative case for human objects; some verbs in Basic should be reflexive; bad forms of reflexive infinitives; spurious parameter for verb auxiliary inherited from Romance
Swedish:
The very very latest version of GF and its libraries is in Snapshots.