Current funding
Previous funding
Main applications
Staff contributions to grammar libraries:
Student projects on grammar libraries:
Technology, also:
Various grammar library contributions from the multilingual Chalmers community:
Resource library patches and suggestions from the WebALT staff:
The main device of division of labour in programming.
Instead of writing a sorting algorithm over and over again, the programmers take it from a library. You write (in Haskell),
Data.List.sort xs
instead of a lot of code actually implementing sorting.
Practical advantages:
Libraries promote abstraction: you abstract away from details.
The use of libraries is therefore a good programming style.
It is also scientifically interesting to create libraries: you have to think about abstractions on your domain of expertise.
Notice: libraries can bring abstraction to almost any language, if it just has a support for functions or macros.
Example: we want to create a GUI (Graphical User Interface) button that says yes, and localize it to different languages:
Yes Ja Kyllä Oui Ja Sė
Possible ways to do this:
yesButton english = button "Yes" yesButton swedish = button "Ja" yesButton finnish = button "Kyllä"
3. Use a library Text
such that you can write
yesButton lang = button (Text.render lang Text.Yes)
The library has an API (Application Programmer's Interface) with:
Yes : Text No : Text
render : Language -> Text -> String
This is what you often see as a feedback from a program:
You have 1 messages.
Or perhaps with a little more thought:
You have 1 message(s).
The code that should be written is of course
mess n = "You have" +++ show n +++ messages ++ "." where messages = if n==1 then "message" else "messages"
(E.g. VoiceXML supports this.)
The same as with "Yes": you have to know the words "you", "have", "message".
Moreover, you have to know the inflection of the equivalent of "message":
if n == 1 then "meddelande" else "meddelanden"
Moreover, you have to know the congruence with different numbers (e.g. Arabic):
if n == 1 then "risAlaö" else if n == 2 then "risAlatAn" else if n < 11 then "rasA'il" else "risAlaö"
You also have to know the case required by the verb "have" e.g. Finnish:
1 viesti -- nominative 4 viestiä -- partitive
Moreover, you have to know what is the proper way to politely address the user:
Du har 3 meddelanden / Ni har 3 meddelanden Vous avez 3 messages / Tu as 3 messages
(This can also depend on country and the kind of program.)
In analogy with the "Yes" case, you write
mess lang n = render lang (Text.YouHaveMessages n)
Hmm, is this so smart? What about if you want to say
You have 4 documents. You have 5 jewels. I have 7 surprises.
It is time to move from canned text to a grammar.
You may want to write
mess lang n = render lang (Have PolYou (Num n Message)) sword lang n = render lang (Have FamYou (Num n Jewel)) surpr lang n = render lang (Have I (Num n Surprise))
For this purpose, you need a library with the API
Have : NounPhrase -> NounPhrase -> Sentence PolYou : NounPhrase FamYou : NounPhrase Num : Int -> Noun -> NounPhrase Message : Noun Jewel : Noun
The library API for language will certainly grow big and become difficult to use. Why couldn't I just write
mess lang n = render lang (parse english "you have n messages")
To this end, the API should provide the top-level function
parse : Language -> String -> Sentence
The library that we will present actually has this as well!
The only complication is that parse
does not always return
just one sentence. Those may be zero:
"you have n mesaggse"
or many:
"you have n messages" Have PolYou (Num n Message) Have FamYou (Num n Message) Have PlurYou (Num n Message)
Thus some amount of interaction is needed.
The library has construction functions like
Have : NounPhrase -> NounPhrase -> Sentence PolYou : NounPhrase
These functions build grammatical structures, which can have different realizations in different languages.
Therefore we also need realization functions,
render : Language -> Sentence -> String parse : Language -> String -> [Sentence]
Both of them require linguistic expertise to write - but, one this is done, they can be used with very little linguistic knowledge by application programmers!
GF = Grammatical Framework
Those who know GF have already seen the introduction as a seduction argument leading to GF.
In GF,
Simplest possible example:
abstract Text = { cat Text ; fun Yes : Text ; fun No : Text ; } concrete TextEng of Text = { lin Yes = ss "yes" ; lin No = ss "no" ; } concrete TextFin of Text = { lin Yes = ss "kyllä" ; lin No = ss "ei" ; }
The realizatin function is, for each language, implemented by
linearization rules (lin
).
The linearization rules directly give the render
method:
render english x = TextEng.lin x
The GF formalism moreover has the property of reversibility:
multilingual grammar = abstract syntax + concrete syntaxes
Examples of the idea:
An abstract syntax has other names:
The concrete syntax defines how the ontology is represented in a language.
The following requirements are made:
Benefit: translation via semantic model of domain can reach high quality.
Problem: the expertise of both a linguist and a domain expert are required.
Arithmetic of natural numbers: abstract syntax
cat Prop ; Nat ; fun Even : Nat -> Prop ;
Concrete syntax: mapping from abstract syntax trees to strings in a language (English, French, German, Swedish,...)
lin Even x = {s = x.s ++ "is" ++ "even"} ; lin Even x = {s = x.s ++ "est" ++ "pair"} ; lin Even x = {s = x.s ++ "ist" ++ "gerade"} ; lin Even x = {s = x.s ++ "är" ++ "jämnt"} ;
We can translate using the abstract syntax as interlingua:
4 is even 4 ist gerade \ / Even (NInt 4) / \ 4 est pair 4 är jämnt
This idea is used e.g. in the WebALT project to generate mathematical teaching material in 7 languages.
But is it really so simple?
The previous multilingual grammar breaks these rules in many situations:
2 and 3 is even la somme de 3 et de 5 est pair wenn 2 ist gerade, dann 2+2 ist gerade om x är jämnt, summan av x och 2 är jämnt
All these sentences are grammatically incorrect.
GF can express the linguistic rules that are needed to produce correct translations:
In addition to strings, we use parameters, tables, and record types. For instance, French:
param Mod = Ind | Subj ; param Gen = Masc | Fem ; lincat Nat = {s : Str ; g : Gen} ; lincat Prop = {s : Mod => Str} ; lin Even x = {s = table { m => x.s ++ case m of {Ind => "est" ; Subj => "soit"} ++ case x.g of {Masc => "pair" ; Fem => "paire"} } } ;
Linguistic knowledge dominates in the size of this grammar.
Application grammar ("semantic grammar")
Resource grammar ("syntactic grammar")
The expressive power is between TAG and HPSG.
The language is more high-level: a modern, typed functional programming language.
It enables linguistic generalizations and abstractions.
But we don't want to bother application grammarians with these details.
We have built a module system that can hide details.
Assume the following API
cat S ; NP ; A ; fun predA : A -> NP -> S ; oper regA : Str -> A ;
Now implement Even
for four languages
lincat Prop = S ; Nat = NP ; lin Even = predA (regA "even") ; -- English Even = predA (regA "jämn") ; -- Swedish Even = predA (regA "pair") ; -- French Even = predA (regA "gerade") ; -- German
Notice: the choice of adjective is domain expert knowledge.
What should there be in the library?
How do we organize and present the library?
Where to get the data from?
Extra constraint: we want open-source free software and hence cannot use existing proprietary resources.
Coverage, for each language:
Organization:
Presentation:
gfdoc
for generating HTML from grammars
Where do we get the data from?
The resource grammar library is entirely open-source free software (under GNU GPL license).
Grammatical correctness of everything generated.
Semantic coverage: you can express whatever you want.
Usability as library for non-linguists.
Evaluation: tested in third-party projects.
Tools for regression testing (treebank generation and comparison)
Language coverage:
Semantic correctness:
colourless green ideas sleep furiously the time is seventy past forty-two
Linguistic innovation in syntax:
Application grammars use domain-specific semantics to guarantee semantic well-formedness.
GF incorporates a Logical Framework and can express
Language-independent API is a rough semantic model.
But we do not try to give semantics once and for all for the whole language.
Grammar composition: any grammar can serve as resource to another one.
No fixed set of representation levels; here some examples for
2 is even 2 är jämnt
In Arithm
Even 2
In Predication
(high level resource API)
predA (IntNP 2) (regA "even") predA (IntNP 2) (regA "jämn")
In Lang
(ground level resource API)
UseCl TPres ASimul PPos (PredVP (UsePN (IntPN 2)) (UseComp (CompAP (PositA (regA "even"))))) UseCl TPres ASimul PPos (PredVP (UsePN (IntPN 2)) (UseComp (CompAP (PositA (regA "jämn")))))
The current GF Resource Project covers ten languages:
Dan
ish
Eng
lish
Fin
nish
Fre
nch
Ger
man
Ita
lian
Nor
wegian (bokmål)
Rus
sian
Spa
nish
Swe
dish
Implementation of API v 1.0 projected for the end of February.
In addition, we have parts (morphology) of Arabic, Estonian, Latin, and Urdu
Cf. "matrix" in BLARK, LinGo
ParadigmsSwe
mkN : (man,mannen,män,männen : Str) -> N ; -- worst-case nouns regV : (leker : Str) -> V ; -- regular verbs
IrregSwe
angripa_V = irregV "angripa" "angrep" "angripit" ;
ExtNor
PostPoss : CN -> Pron -> NP ; -- bilen min
English: negation and auxiliary vs. non-auxiliary verbs
Finnish: object case
German: double infinitives
Romance: clitic pronouns
Scandinavian: determiners
In particular: how to make the grammars efficient
For the ten languages we have considered, it is possible to implement the current API.
Reservations:
Simplest case: use the API in the same way for all languages.
In practice: use the API in different ways for different languages
-- Eng: x's name is y Name x y = predNP (GenCN x (regN "name")) (StringNP y) -- Swe: x heter y Name x y = predV2 x heta_V2 (StringNP y)
This amounts to compile-time transfer.
Surprisingly, writing an application grammar requires more native-speaker knowledge than writing a resource grammar!
We can go even farther than share an abstract API: we can share implementations among related languages.
Exploited in two families:
The declarations of Scandinavian syntax differences
We cannot anticipate all vocabulary needed in application grammars.
Therefore we provide high-level paradigms to add new words.
Example heuristic, from ParadigsSwe:
regV : (leker : Str) -> V ; regV leker = case leker of { lek + ("a" | "ar") => conj1 (lek + "a") ; lek + "er" => conj2 (lek + "a") ; bo + "r" => conj3 bo }
decl2Noun : Str -> N = \bil -> let bb : Str * Str = case bil of { pojk + "e" => <pojk + "ar", bil + "n"> ; nyck + "e" + l@("l" | "r") => <nyck + l + "ar",bil + "n"> ; sock + "e" + "n" => <sock + "nar", sock + "nen"> ; _ => <bil + "ar", bil + "en"> } ; in mkN bil bb.p2 bb.p1 (bb.p1 + "na") ;
-printer=lbnf BNF Converter, thereby C/Bison, Java/JavaCup -printer=fullform full-form lexicon, short format -printer=xml XML: DTD for the pg command, object for st -printer=gsl Nuance GSL speech recognition grammar -printer=jsgf Java Speech Grammar Format -printer=srgs_xml SRGS XML format -printer=srgs_xml_prob SRGS XML format, with weights -printer=slf a finite automaton in the HTK SLF format -printer=regular a regular grammar in a simple BNF -printer=gfc-prolog gfc in prolog format (also pg)
Haskell, Java, Prolog
Parsing, generation, translation
Push-button creation of spoken language translators (using Nuance)
Can we use the libraries outside domain-specific fragments?
We seem to be approaching full coverage from below.
The resource API is not good for heavy-duty parsing (too abstract and therefore too inefficient).
Two ideas:
The most general format is multilingual treebank generation:
> gr -tr | l -multi UseCl TCond AAnter PNeg (PredVP (DetCN (DetSg DefSg NoOrd) (AdjCN (PositA young_A) (UseN woman_N))) (ComplV2 love_V2 (UsePron he_Pron))) The young woman wouldn't have loved him Den unga kvinnan skulle inte ha älskat honom Den unge kvinna ville ikke ha elska ham La joven mujer no lo habría amado La giovane donna non lo avrebbe amato La jeune femme ne l' aurait pas aimé Nuori nainen ei olisi rakastanut häntä
This is either exhaustive or random, possibly with probability weights attached to constructors.
A special case is corpus generation: just leave one language.
Can this be useful? Cf. Rebecca Jonson this afternoon.
CLE = Core Language Engine
LinGo Matrix project (HPSG)
Parsing detached from grammar (Nivre) - grammar detached from parsing
Stoneage grammar, based on the Swadesh word list.
Implemented as application on top of the resource grammar.
Illustrate generation and spoken-language parsing.