The MULTEXT-East multilingual language resources

The MULTEXT-East
multilingual language
resources
Tomaž Erjavec
Department of Knowledge Technologies
Jožef Stefan Institute, Ljubljana
tomaz.erjavec@ijs.si, http://nl.ijs.si/et/
Overview
1.
2.
3.
4.
Introduction to Language Resources
MULTEXT-East: morphosyntactic
resources for East-European
languages
A tour of Slovene language resources
Conclusions
Graz Uni
January 27 2006
Tomaž Erjavec
Dept. of Knowledge Technologies, Jozef Stefan Institute
Introduction to Language
Resources

LR comprise two types of data:
– corpora: mono- or multilingual, reference or
specialised, …, /variously annotated/
– lexica: vocabularies, morphosyntactic, syntactic,
semantic (ontologies)

LRs, esp. corpora are used for empirical
language research:
– linguistic research:
(annotated) corpus + (sophisticated) search engine
– human language technology R&D:
testing and training dataset
Graz Uni
January 27 2006
Tomaž Erjavec
Dept. of Knowledge Technologies, Jozef Stefan Institute
Characteristics of LRs

Separate development for each language

Costly to produce, so should be widely available, but:
 great variation in availability between languages
–
–
–

–
–

“monopoly protection”
problems of copyright
lack of encoding standardisation
Good side:
text is becoming increasingly easy to acquire (WWW)
un- & semi-supervised ML methods give increasingly good
results
Ideal:
lots of different, large, high-quality, standardised, freely available,
and supported LRs for all languages, multilingual and multimodal
Graz Uni
January 27 2006
Tomaž Erjavec
Dept. of Knowledge Technologies, Jozef Stefan Institute
History of LRs


70s: Chomskyan paradigm – no LRs
85-95: renaissance of empiricism (LR-based):
– became accepted in academic circles:
corpus linguistics / (statistical) machine learning
– advances in standardisation: TEI, EAGLES
– large EU funded HLT/LR projects: EAGLES, MULTEXT,…
– EU Copernicus (1995,’97): MULTEXT-East, TELRI,…
– LR brokers: LDC (1992), ELRA (1995)

95-05: established field ~ old hat
–
–
–
–
–
LREC: bi-annual conferences (1998-), LRE journal (2005)
XML based standards: TEI, ISO, W3C
national initiatives
no more EU funding for LR collection or HLT R&D
EU funding for component multimodal / multilingual technologies,
standardisation and research infrastructures
Graz Uni
January 27 2006
Tomaž Erjavec
Dept. of Knowledge Technologies, Jozef Stefan Institute
MULTEXT-East
resources

MULTEXT-East: Copernicus Joint Project COP 106
(1995-1997) Multilingual Texts and Corpora for
Eastern and Central European Languages


Based on the results of EU MULTEXT (~West)
To produce a harmonised BLARK for six languages:
–
–
–
–
–
–
corpus encoding standardisation (TEI / CES)
multilingual parallel, comparable, speech corpora
morphosyntactic specifications (EAGLES / MULTEXT)
(inflectional) lexicon
annotated corpus
language processing tools
Graz Uni
January 27 2006
Tomaž Erjavec
Dept. of Knowledge Technologies, Jozef Stefan Institute
History of MULTEXT-East
resources





First release 1998 on TELRI CD-ROM Vol II:
already extended with new languages
Resources since 1998 available on the Web:
http://nl.ijs.si/ME/
Second release 2002 in scope of EU CONCEDE:
re-encoding in XML/TEI, harmonisation
Third release 2004:
merge of first two releases, further languages
Work (indirectly) supported by:
TELRI, CONCEDE, NSF grant, bi-lateral projects
Graz Uni
January 27 2006
Tomaž Erjavec
Dept. of Knowledge Technologies, Jozef Stefan Institute
The Languages of
MULTEXT-East




Germanic: English
Romance: Romanian
Baltic:
– Latvian
– Lithuanian
Finno-Ugric:
– Estonian
– Hungarian
Graz Uni
January 27 2006
Slavic:
 Russian (East Slavic)
 Czech (West Slavic)
 Slovene (South West Slavic)
 Resian (Slovene dialect)
 Croatian (South West Slavic)
 Serbian (South West Slavic)
 Bulgarian (South East Slavic)
In progress:
 Macedonian
 Persian
Tomaž Erjavec
Dept. of Knowledge Technologies, Jozef Stefan Institute
Version 3



Available on http://nl.ijs.si/ME/V3/
Some parts completely free, others
free for research  licence
Web pages gives:
– extensive documentation
– bibliography list
– web licence form
– resources
Graz Uni
January 27 2006
Tomaž Erjavec
Dept. of Knowledge Technologies, Jozef Stefan Institute
The MULTEXT
morphosyntactic trinity
1.
2.
3.
MULTEXT-East morphosyntactic
specifications
MULTEXT-East morphosyntactic
lexica
MULTEXT-East morphosyntactically
annotated "1984" corpus
Graz Uni
January 27 2006
Tomaž Erjavec
Dept. of Knowledge Technologies, Jozef Stefan Institute
1. Morphosyntactic
specifications



Based on EAGLES / MULTEXT
Define PoS, their attributes and values
The specs are a document containing:
– introduction
– common tables
– language particular sections


Written in LaTeX  PDF & HTML
Derived XML/TEI encoding as feature
structures
Graz Uni
January 27 2006
Tomaž Erjavec
Dept. of Knowledge Technologies, Jozef Stefan Institute
Example common table
Graz Uni
January 27 2006
Tomaž Erjavec
Dept. of Knowledge Technologies, Jozef Stefan Institute
Example
language
specific
table
Graz Uni
January 27 2006
Tomaž Erjavec
Dept. of Knowledge Technologies, Jozef Stefan Institute
Complexity
Graz Uni
January 27 2006
Tomaž Erjavec
Dept. of Knowledge Technologies, Jozef Stefan Institute
2. The lexica




Medium size morphosyntactic lexica
Languages: English, Romanian, Slovene,
Czech, Bulgarian, Estonian, Hungarian,
Serbian.
~ all word-forms of cca 15.000 lemmas
Lexical entry is composed of three fields:
– the word-form: the inflected form of the word
– the lemma: the base-form of the word
– the morphosyntactic description (MSD)
Graz Uni
January 27 2006
Tomaž Erjavec
Dept. of Knowledge Technologies, Jozef Stefan Institute
Example: Slovene lexicon
abeced
abeced
abeceda
abecedah
abecedah
abecedam
abecedama
abecedama
abecedami
abecede
abecede
abecede
abecedi
abecedi
…
Graz Uni
January 27 2006
abeceda
abeceda
=
abeceda
abeceda
abeceda
abeceda
abeceda
abeceda
abeceda
abeceda
abeceda
abeceda
abeceda
Ncfdg
Ncfpg
Ncfsn
Ncfdl
Ncfpl
Ncfpd
Ncfdd
Ncfdi
Ncfpi
Ncfpa
Ncfpn
Ncfsg
Ncfda
Ncfdn
Tomaž Erjavec
Dept. of Knowledge Technologies, Jozef Stefan Institute
Lexicon sizes
Graz Uni
January 27 2006
Tomaž Erjavec
Dept. of Knowledge Technologies, Jozef Stefan Institute
The specification in as TEI FS
<fLib type="Noun">
<f id="N0." select="en ro sl cs bg et hu hr sr sl-rozaj" name="PoS">
<sym value="Noun" />
</f>
<f id="N1.c" select="en ro sl cs bg et hu hr sr sl-rozaj" name="Type">
<sym value="common" />
</f>
<f id="N1.p" select="en ro sl cs bg et hu hr sr sl-rozaj" name="Type">
<sym value="proper" />
</f>
…
<fsLib type="Noun">
<fs id="Nc" select="en et sr" feats="N0. N1.c" />
<fs id="Nc---n" select="ro" feats="N0. N1.c N5.n" />
<fs id="Nc--g" select="sr" feats="N0. N1.c N4.g" />
<fs id="Nc-p" select="cs en" feats="N0. N1.c N3.p" />
<fs id="Nc-p1" select="et" feats="N0. N1.c N3.p N4.1" />
…
Graz Uni
January 27 2006
Tomaž Erjavec
Dept. of Knowledge Technologies, Jozef Stefan Institute
3. The “1984” corpus





Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr,…))
Structurally annotated
Sentence aligned with English
Words annotated with lemma and MSD
Encoded in TEI P4 (XML)
Graz Uni
January 27 2006
Tomaž Erjavec
Dept. of Knowledge Technologies, Jozef Stefan Institute
Example linguistic encoding
<text id="Osl." lang="sl">
Context disambiguated
<body>
<div type="part" id="Osl.1">
lemmas and MSDs
<div type="chapter" id="Osl.1.2">
<p id="Osl.1.2.2">
<s id="Osl.1.2.2.1">
<w lemma="biti" ana="Vcps-sma">Bil</w>
<w lemma="biti" ana="Vcip3s--n">je</w>
<w lemma="jasen" ana="Afpmsnn">jasen</w>
<c>,</c>
<w lemma="mrzel" ana="Afpmsnn">mrzel</w>
<w lemma="aprilski" ana="Aopmsn">aprilski</w>
<w lemma="dan" ana="Ncmsn">dan</w>
<w lemma="in" ana="Ccs">in</w>
<w lemma="ura" ana="Ncfpn">ure</w>
<w lemma="biti" ana="Vcip3p--n">so</w>
<w lemma="biti" ana="Vmps-pfa">bile</w>
<w lemma="trinajst" ana="Mcnpnl">trinajst</w>
<c>.</c>
</s>
…
Graz Uni
January 27 2006
Tomaž Erjavec
Dept. of Knowledge Technologies, Jozef Stefan Institute
Quantifying the corpus
Graz Uni
January 27 2006
Tomaž Erjavec
Dept. of Knowledge Technologies, Jozef Stefan Institute
Utility of MULTEXT-East
LRs



Specifications became, for some, the “national” standard
Training/testing dataset for HLT development:
PoS taggers, lemmatizers, lexicon extractors, ILP
A base dataset for further annotation and experiments:
– Word-sense disambiguation
– WordNet development and evaluation
– Syntactic parser induction



Teaching aid in HLT courses
~ 100 registered users
As a BLARK “best practice” for new languages:
Resian, Croatian, Macedonian, Persian
Graz Uni
January 27 2006
Tomaž Erjavec
Dept. of Knowledge Technologies, Jozef Stefan Institute
LRs @ JSI
Graz Uni
January 27 2006
Also ours: VAYNA, GORE, sloWNet
Contributors to: FIDA, DSI, FDV, JRC-ACQUIS
Contractors for: Inxight
Nice try: EU CULTACT
Tomaž Erjavec
Dept. of Knowledge Technologies, Jozef Stefan Institute
JSI know-how in corpus
compilation
Encoding standardisation:
XML, TEI, ISO
2.
Up-conversion:
character set, structure, meta-data
3.
Linguistic annotation:
token, lemma, MSD, alignment
4.
Distribution via nl.ijs.si:
concordancing, browsing, download
& teaching in these areas:
ESSLLI, JSIPS, FF, NG
1.
Graz Uni
January 27 2006
Tomaž Erjavec
Dept. of Knowledge Technologies, Jozef Stefan Institute
Slovene LRs @ SDJT
Graz Uni
January 27 2006
Tomaž Erjavec
Dept. of Knowledge Technologies, Jozef Stefan Institute
Conclusions




Introduced language resources, MULTEXT-East
and Slovene LRs
Useful basis for empirical studies of the
(Slovene) language
Of course, more resources are needed, but we
are working on it:
SDT, sloWNet, jaSlo, ACQUIS, MULTEXT-East
Further collaborations welcome…
Graz Uni
January 27 2006
Tomaž Erjavec
Dept. of Knowledge Technologies, Jozef Stefan Institute
Thank you!