The MULTEXT-East multilingual language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.si, http://nl.ijs.si/et/ Overview 1. 2. 3. 4. Introduction to Language Resources MULTEXT-East: morphosyntactic resources for East-European languages A tour of Slovene language resources Conclusions Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Introduction to Language Resources LR comprise two types of data: – corpora: mono- or multilingual, reference or specialised, …, /variously annotated/ – lexica: vocabularies, morphosyntactic, syntactic, semantic (ontologies) LRs, esp. corpora are used for empirical language research: – linguistic research: (annotated) corpus + (sophisticated) search engine – human language technology R&D: testing and training dataset Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Characteristics of LRs Separate development for each language Costly to produce, so should be widely available, but: great variation in availability between languages – – – – – “monopoly protection” problems of copyright lack of encoding standardisation Good side: text is becoming increasingly easy to acquire (WWW) un- & semi-supervised ML methods give increasingly good results Ideal: lots of different, large, high-quality, standardised, freely available, and supported LRs for all languages, multilingual and multimodal Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute History of LRs 70s: Chomskyan paradigm – no LRs 85-95: renaissance of empiricism (LR-based): – became accepted in academic circles: corpus linguistics / (statistical) machine learning – advances in standardisation: TEI, EAGLES – large EU funded HLT/LR projects: EAGLES, MULTEXT,… – EU Copernicus (1995,’97): MULTEXT-East, TELRI,… – LR brokers: LDC (1992), ELRA (1995) 95-05: established field ~ old hat – – – – – LREC: bi-annual conferences (1998-), LRE journal (2005) XML based standards: TEI, ISO, W3C national initiatives no more EU funding for LR collection or HLT R&D EU funding for component multimodal / multilingual technologies, standardisation and research infrastructures Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute MULTEXT-East resources MULTEXT-East: Copernicus Joint Project COP 106 (1995-1997) Multilingual Texts and Corpora for Eastern and Central European Languages Based on the results of EU MULTEXT (~West) To produce a harmonised BLARK for six languages: – – – – – – corpus encoding standardisation (TEI / CES) multilingual parallel, comparable, speech corpora morphosyntactic specifications (EAGLES / MULTEXT) (inflectional) lexicon annotated corpus language processing tools Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute History of MULTEXT-East resources First release 1998 on TELRI CD-ROM Vol II: already extended with new languages Resources since 1998 available on the Web: http://nl.ijs.si/ME/ Second release 2002 in scope of EU CONCEDE: re-encoding in XML/TEI, harmonisation Third release 2004: merge of first two releases, further languages Work (indirectly) supported by: TELRI, CONCEDE, NSF grant, bi-lateral projects Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute The Languages of MULTEXT-East Germanic: English Romance: Romanian Baltic: – Latvian – Lithuanian Finno-Ugric: – Estonian – Hungarian Graz Uni January 27 2006 Slavic: Russian (East Slavic) Czech (West Slavic) Slovene (South West Slavic) Resian (Slovene dialect) Croatian (South West Slavic) Serbian (South West Slavic) Bulgarian (South East Slavic) In progress: Macedonian Persian Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Version 3 Available on http://nl.ijs.si/ME/V3/ Some parts completely free, others free for research licence Web pages gives: – extensive documentation – bibliography list – web licence form – resources Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute The MULTEXT morphosyntactic trinity 1. 2. 3. MULTEXT-East morphosyntactic specifications MULTEXT-East morphosyntactic lexica MULTEXT-East morphosyntactically annotated "1984" corpus Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute 1. Morphosyntactic specifications Based on EAGLES / MULTEXT Define PoS, their attributes and values The specs are a document containing: – introduction – common tables – language particular sections Written in LaTeX PDF & HTML Derived XML/TEI encoding as feature structures Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Example common table Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Example language specific table Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Complexity Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute 2. The lexica Medium size morphosyntactic lexica Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian. ~ all word-forms of cca 15.000 lemmas Lexical entry is composed of three fields: – the word-form: the inflected form of the word – the lemma: the base-form of the word – the morphosyntactic description (MSD) Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Example: Slovene lexicon abeced abeced abeceda abecedah abecedah abecedam abecedama abecedama abecedami abecede abecede abecede abecedi abecedi … Graz Uni January 27 2006 abeceda abeceda = abeceda abeceda abeceda abeceda abeceda abeceda abeceda abeceda abeceda abeceda abeceda Ncfdg Ncfpg Ncfsn Ncfdl Ncfpl Ncfpd Ncfdd Ncfdi Ncfpi Ncfpa Ncfpn Ncfsg Ncfda Ncfdn Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Lexicon sizes Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute The specification in as TEI FS <fLib type="Noun"> <f id="N0." select="en ro sl cs bg et hu hr sr sl-rozaj" name="PoS"> <sym value="Noun" /> </f> <f id="N1.c" select="en ro sl cs bg et hu hr sr sl-rozaj" name="Type"> <sym value="common" /> </f> <f id="N1.p" select="en ro sl cs bg et hu hr sr sl-rozaj" name="Type"> <sym value="proper" /> </f> … <fsLib type="Noun"> <fs id="Nc" select="en et sr" feats="N0. N1.c" /> <fs id="Nc---n" select="ro" feats="N0. N1.c N5.n" /> <fs id="Nc--g" select="sr" feats="N0. N1.c N4.g" /> <fs id="Nc-p" select="cs en" feats="N0. N1.c N3.p" /> <fs id="Nc-p1" select="et" feats="N0. N1.c N3.p N4.1" /> … Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute 3. The “1984” corpus Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr,…)) Structurally annotated Sentence aligned with English Words annotated with lemma and MSD Encoded in TEI P4 (XML) Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Example linguistic encoding <text id="Osl." lang="sl"> Context disambiguated <body> <div type="part" id="Osl.1"> lemmas and MSDs <div type="chapter" id="Osl.1.2"> <p id="Osl.1.2.2"> <s id="Osl.1.2.2.1"> <w lemma="biti" ana="Vcps-sma">Bil</w> <w lemma="biti" ana="Vcip3s--n">je</w> <w lemma="jasen" ana="Afpmsnn">jasen</w> <c>,</c> <w lemma="mrzel" ana="Afpmsnn">mrzel</w> <w lemma="aprilski" ana="Aopmsn">aprilski</w> <w lemma="dan" ana="Ncmsn">dan</w> <w lemma="in" ana="Ccs">in</w> <w lemma="ura" ana="Ncfpn">ure</w> <w lemma="biti" ana="Vcip3p--n">so</w> <w lemma="biti" ana="Vmps-pfa">bile</w> <w lemma="trinajst" ana="Mcnpnl">trinajst</w> <c>.</c> </s> … Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Quantifying the corpus Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Utility of MULTEXT-East LRs Specifications became, for some, the “national” standard Training/testing dataset for HLT development: PoS taggers, lemmatizers, lexicon extractors, ILP A base dataset for further annotation and experiments: – Word-sense disambiguation – WordNet development and evaluation – Syntactic parser induction Teaching aid in HLT courses ~ 100 registered users As a BLARK “best practice” for new languages: Resian, Croatian, Macedonian, Persian Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute LRs @ JSI Graz Uni January 27 2006 Also ours: VAYNA, GORE, sloWNet Contributors to: FIDA, DSI, FDV, JRC-ACQUIS Contractors for: Inxight Nice try: EU CULTACT Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute JSI know-how in corpus compilation Encoding standardisation: XML, TEI, ISO 2. Up-conversion: character set, structure, meta-data 3. Linguistic annotation: token, lemma, MSD, alignment 4. Distribution via nl.ijs.si: concordancing, browsing, download & teaching in these areas: ESSLLI, JSIPS, FF, NG 1. Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Slovene LRs @ SDJT Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Conclusions Introduced language resources, MULTEXT-East and Slovene LRs Useful basis for empirical studies of the (Slovene) language Of course, more resources are needed, but we are working on it: SDT, sloWNet, jaSlo, ACQUIS, MULTEXT-East Further collaborations welcome… Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Thank you!
© Copyright 2024 Paperzz