UNIVERSITÀ DEGLI STUDI DI MACERATA Dipartimento di Studi Umanistici – Lingue, Mediazione, Storia, Lettere, Filosofia Corso di Laurea Magistrale in Lingue Moderne per la Comunicazione e la Cooperazione Internazionale (Classe LM-38) TPCI inglese - mod. B Strumenti e tecnologie per la traduzione specialistica - a.a. 2016/2017 PART 2: Corpora and Translation Sara Castagnoli sara.castagnoli@unimc.it 1 What is a corpus? Some (authoritative) definitions • “a collection of naturally-occurring language text, chosen to characterize a state or variety of a language” (Sinclair, 1991:171) • “a collection of texts assumed to be representative of a given language, dialect, or other subset of a language, to be used for linguistic analysis” (Francis, 1992:7) • “a closed set of texts in machine-readable form established for general or specific purposes by previously defined criteria” (Engwall, 1992:167) • “a finite-sized body of machine-readable text, sampled in order to be maximally representative of the language variety under consideration” (McEnery & Wilson, 1996:23) • “a collection of (1) machine-readable (2) authentic texts […] which is (3) sampled to be (4) representative of a particular language or language variety” (McEnery et al., 2006:5) What is / is not a corpus…? • A newspaper archive on CD-ROM? The answer is • An online glossary? always “NO” • A digital library (e.g. Project (see Gutenberg)? definition) • All RAI 1 programmes (e.g. for spoken TV language) Corpora vs. web •Corpora: – Usually stable •searches can be replicated – Control over contents •we can select the texts to be included, or have control over selection strategies – Ad-hoc linguistically-aware software to investigate them •concordancers can sort / organise concordance lines • Web (as accessed via Google or other search engines): – Very unstable •results can change at any time for reasons beyond our control – No control over contents •what/how many texts are indexed by Google’s robots? – Limited control over search results •cannot sort or organise hits meaningfully; they are presented randomly Click here for another corpus vs. Google comparison What types of corpora exist? A brief overview • A corpus is a principled collection of naturally occurring electronic texts designed to be a representative sample of language in actual use • Some of the main features and criteria used to describe and classify corpora: general closed / finite specialised open-ended (monitor) written raw (pre-corpus) spoken (transcribed) marked-up (augmented) multimodal (audio/video) POS-tagged (augmented) balanced (sample) annotated (augmented) opportunistic monolingual synchronic bi- / multilingual diachronic parallel static comparable dynamic An example of planned balance: the British National Corpus • 100 m words of contemporary spoken and written British English • Representative of British English “as a whole” • Designed to be appropriate for a variety of uses: lexicography, education, research, commercial applications (computational tools) • Balanced with regard to genre, subject matter and style • Sampling and representativeness very difficult to ensure BNC NEW Spoken BNC to be released in 2017! • 4,124 texts: 90% written, 10% spoken • Largest collection of spoken English ever collected (10m words), but reflects typical imbalance in favour of written text (for understandable practical reasons) • Written portion: 75% informative, 25% imaginative BNC written material Sources: • 60% books • 25% periodicals • 5% brochures and other ephemera • E.g. bus tickets, produce containers, junk mail • 5% unpublished letters, essays, minutes • 5% plays, speeches (written to be spoken) Register levels: • 30% literary or technical “high” • 45% “middle” • 25% informal “low” BNC Subject coverage • Planned to reflect pattern of book publishing in UK over last 20 years Subject Imaginative World affairs Social science Leisure Applied science Commerce Arts Natural science Belief & thought Unclassified Number of texts 625 453 510 374 364 284 259 144 146 50 % of total written 22 18 15 11 8 8 8 4 3 3 BNC Spoken corpus • Context-governed material • • • • Lectures, tutorials, classrooms News reports Product demonstrations, consultations, interviews Sermons, political speeches, public meetings, parliamentary debates • Sports commentaries, phone-ins, chat shows • Samples from 12 different regions 10/18 BNC Spoken corpus • Ordinary conversation • • • • • • 2000 hrs from 124 volunteers, 38 different regions Four different socio-economic groupings Equal male and female, age range 15 to 60+ All conversations over a 2-day period recorded No secret recording, and allowed to erase Systematic details kept of time, location, details of participants (sex, age, race, occupation, education, social group, ), topic, etc. • Transcription issues: • • • • include false starts, hesitations, etc. some paralinguistic features (shouting, whispering), use of dialect words/grammar but no phonetic information What types of corpora exist? A brief overview • A corpus is a principled collection of naturally occurring electronic texts designed to be a representative sample of language in actual use • Some of the main features and criteria used to describe and classify corpora: general closed / finite specialised open-ended (monitor) written raw (pre-corpus) spoken (transcribed) marked-up (augmented) multimodal (audio/video) POS-tagged (augmented) balanced (sample) annotated (augmented) opportunistic monolingual synchronic bi- / multilingual diachronic parallel static comparable 12 dynamic Dynamic (Monitor) vs static (Finite) • A static corpus will give a snapshot of language use at a given time • Easier to control balance of content • May limit usefulness, esp. as time passes • A dynamic corpus is ever-changing • Called “monitor” corpus because allows us to monitor language change over time Key concepts and technical notions in corpus-based translation studies • Wordlist, frequency list, keyword list • Types, tokens, type/token ratio (lexical variation) • Function/grammatical words vs. content/lexical words (lexical density) “Type” and “token” • “Token” means individual occurrence of a word • “Type” means instance of a given word • The man saw the girl with the telescope • 8 tokens, 6 types • “Type” may refer to lexeme, or individual word form • run, runs, ran, running: 1 or 4 types? Key concepts and technical notions • Wordlist, frequency list, keyword list • Types, tokens, type/token ratio (lexical variation) • Function/grammatical words vs. content/lexical words (lexical density) • Concordance (concordancing software) • KWIC (keyword in context) • Nodeword • Sorting Concordance for nodeword “eyes” (sorted 1L) generated from the BNC Key concepts and technical notions • Wordlist, frequency list, keyword list • Types, tokens, type/token ratio (lexical variation) • Function/grammatical words vs. content/lexical words (lexical density) • Concordance (concordancing software) • KWIC (keyword in context) • Nodeword • Sorting • Collocation (collocates) • Lemmatisation (morphological analysis) • (POS-)Tagging (grammatical analysis) • Parsing (syntactic analysis) A translation-relevant corpus typology Corpora general / reference monolingual (usually) (normally ready) specialised monolingual parallel multilingual comparable General / reference monolingual corpora Corpora general / reference monolingual (usually) (normally ready) very big (>100M words) ) specialised monolingual British National Corpus (BrEng) COCA (AmEng) La Repubblica (?) CORIS (Ita), WaCky corpora, PAISA’ Leeds Internet corpora Mannheim corpora multilingual parallel comparable Example: BNC concordances for nodewords “eyes” and “eye” 21 Concordance for nodeword “eyes” (sorted 1L) generated from the BNC Concordance for nodeword “eye” (sorted 1L) generated from the BNC 24 www.nature.com/nature/journal/v455/n7215/full/455835b.html General / reference monolingual corpora (of English) Last week, tens of thousands of researchers took to the streets to register their opposition to a proposed bill designed to control civilservice spending. Took to the streets • http://corpus.leeds.ac.uk/internet.html • English • Let’s try to understand: • Meaning • Extended (sentential) co-text, preferential co-selections • Context(s) of use • Semantic preference • Semantic prosody Using general / reference monolingual corpora (from/on the Web): Leeds Internet corpora * http://corpus.leeds.ac.uk/internet.html Let’s explore internal variation - Examples of (possible) useful queries • Any other forms of the verb take? (colligational constraints) • Plural/singular of the noun street? (colligational constraints) • Other verbs? (collocational flexibility) • Other nouns? (collocational flexibility) • Select “CQP syntax only” * (automatic POS-tagging!) • http://cwb.sourceforge.net/files/CQP_Tutorial/ • Look at the examples on the following slides for guidance and adapt those models to your searches • Try out a number of different options to familiarise yourself with the search syntax, and understand what kinds of searches it can support Examples of (possible) useful queries • Any other forms of the verb take? (colligational constraints) Plural/singular of the noun street? (colligational constraints) • [lemma="take"] "to" "the" [lemma="street"] • Lemmatised search: finds all possible forms of verb and noun • [pos="V.*"] "to" "the" [lemma="street"] • Lemmatised and POS-specific search: as above but finds all verbs • [lemma="take"] "to" "the" [pos="N.*"] • Lemmatised and POS-specific search: as above but finds all nouns • Click on the link to the left of the concordance line for context Now the translation into Italian of “took to the streets” • Verb? • andare • scendere • …? • Preposition? • • • • in nella/nelle per la/per le? …? • Noun? • strada/strade • piazza/piazze • …? Which queries do we need? How many are necessary? Now the translation into Italian of “take to the street” • [pos="V.*"] [] [lemma="strada"] NB: [] means ‘any word in that position’ • [pos="V.*"] [] [lemma="piazza"] • very general (slower) • [pos="V.*"] [word="(in|nella|nelle)"] [lemma="strada"] • [pos="V.*"] [word="(in|nella|nelle)"] [lemma="piazza"] • more specific/restrictive NB: | is called ‘pipe’, lists alternatives • [lemma="scendere"] [word="(in|nella|nelle)"] [lemma="strada"] • [lemma="andare"] [word="(in|nella|nelle)"] [lemma="piazza"] • very specific/restrictive Last week, tens of thousands of researchers took to the streets to register their opposition to a proposed bill designed to control civilservice spending. REGISTER ONE’S OPPOSITION • Now search the BNC for this expression. • What does it mean? • Which “feelings” are usually “registered”? • • • • • • • • • • interest concern support dismay frustrations dissatisfaction disapproval protest commitment … Monolingual general / reference corpora available online (at least partially, i.e. as demos) • British National Corpus (BNC, British English) • www.natcorp.ox.ac.uk • COCA (American English) • http://corpus.byu.edu/coca/ • The CORIS corpus (Italian) • http://corpora.dslo.unibo.it/coris_ita.html • Leeds Internet corpora • English, Chinese, Arabic, French, German, Italian, Japanese, Polish, Portuguese, Russian, Spanish: http://corpus.leeds.ac.uk/internet.html • Mannheim corpora (German) • http://corpora.ids-mannheim.de/ccdb • Corpus del Español (Spanish) • www.corpusdelespanol.org • CREA (Spanish) • http://corpus.rae.es/creanet.html explore the Web to see what other corpora are available !
© Copyright 2024 Paperzz