UNIVERSITÀ DEGLI STUDI DI MACERATA Dipartimento di Studi Umanistici – Lingue, Mediazione, Storia, Lettere, Filosofia Corso di Laurea Magistrale in Lingue Moderne per la Comunicazione e la Cooperazione Internazionale (Classe LM-38) TPCI inglese - mod. B Strumenti e tecnologie per la traduzione specialistica - a.a. 2016/2017 PART 2.2: Parallel and Comparable Corpora Sara Castagnoli sara.castagnoli@unimc.it 1 A translation-relevant corpus typology Corpora general / reference monolingual specialised monolingual multilingual (usually) Comparable texts in terms of genre/text type or topic. Usually rather small, created ad-hoc for specific tasks («DIY»), «disposable» comparable parallel Original texts aligned to corresponding translations. Typically available and precompiled (as for general, 2 reference monolingual corpora) Parallel vs. comparable multilingual corpora Parallel (translational) corpora • contain translationally “equivalent” texts: STs and their corresponding TTs • need to be aligned, usually at the sentence level, i.e. SL sentence X matched to TL sentence X’ • context is provided to account for “equivalence” and “translation shifts” between ST and TT • translation direction needs to be clear, i.e. which are SL and TL components of the corpus Comparable corpora • texts originally produced (not translated) in the respective languages • consist of independent texts which are “similar” according to some pre-determined criteria •the various language components share a set of common features, e.g. text type, genre, publication span, domain, topic • parameters defining this similarity vary widely 3 Parallel and comparable multilingual corpora Analogies with dictionary usage Monolingual source and target corpora can be compared to monolingual source and target language dictionaries. While dictionaries favor a synthetic approach to lexical meaning (via a definition), corpora offer an analytic approach (via multiple contexts). Translators can use target monolingual corpora to check the meaning and usage of translation candidates in the target contexts. Like source language dictionaries, source language corpora can be consulted for source text analysis and understanding. Large reference corpora can function as general dictionaries, while smaller, specialized and bilingual comparable corpora can be seen as analogous to specialized monolingual dictionaries. 4 (Zanettin 2002 “Corpora in Translation Practice”) Parallel and comparable multilingual corpora Parallel corpora can be compared to bilingual dictionaries, with a few important differences: bilingual dictionaries are repertories of lexical equivalents (general dictionaries) or terms (specialized dictionaries and terminologies) established by dictionaries makers which are offered as translation candidates. Parallel corpora are repertoires of strategies deployed by past translators, as well as repertoires of translation equivalents. In selecting a translation equivalent from a general bilingual dictionary a translator has to assess the appropriateness of the candidate to the new context by starting from a definition and a few usage examples. A parallel corpus will offer a repertoire of translation strategies past translators have resorted to when confronted with similar problems to the ones that have prompted a search in a parallel corpus. Parallel corpora can provide information that bilingual dictionaries do not usually contain. They can not only offer equivalence at the word level, but also non-equivalence, i.e. cases where there is no easy equivalent for words, terms or phrases across languages. A parallel corpus can provide evidence of how actual translators have dealt with this lack of direct equivalence at word level. (Zanettin 2002 “Corpora in Translation Practice”) 5 Bilingual parallel corpora What is needed: • Pairs of texts (ST/TT) • Aligner / alignment • Parallel concordancer 6 She is the author of numerous articles regarding learning disabilities and she speaks often before parent and teacher groups concerning learning and behavior problems. È autrice di numerosi articoli riguardanti le disabilità di apprendimento e ha tenuto spesso conferenze davanti a gruppi di genitori e insegnanti sui problemi del comportamento e dell’apprendimento. All expectations need to be direct and explicit. Don't require this child to 'read between the lines' to glean your intentions. Esplicitare chiaramente tutte le aspettative, in modo da non richiedere al bambino di “leggere tra le righe” per cogliere le intenzioni. Obviously, the child with nonverbal learning disorders would not be expected to be the 'scribe' in a cooperative grouping - her contribution should be in the verbal arena. Ovviamente non ci si deve aspettare che sia lo “scriba” del gruppo cooperativo, il suo contributo deve essere inserito nell’arena verbale. Sentence-level alignment (new line delimited) 7 1:1 alignment 2:1 alignment Sentence-level alignment (XML format) Bilingual parallel corpora What is needed: • Pairs of texts (ST/TT) • Aligner / alignment • Parallel concordancer Parallel corpora cannot represent the full range of linguistic possibilities of the target language and may reflect the stylistic idiosyncrasies of the source language and of individual translators. However, the comparison between large numbers of texts and their acknowledged translations can show how equivalence has been established by translators under certain circumstances and provide examples of translation strategies. If such corpora are sufficiently varied and large, looking at recurring linguistic choices made by translators allows general patterns to be perceived. Learners can thus notice "preferred ways of putting things" (Kennedy 1992), and generalize from the aggregation of sets of individual instances. (Zanettin 1998 “Bilingual Comparable corpora and the Training of Translators”) 9 Bilingual parallel corpora on the web • OPUS corpus, opus.lingfil.uu.se • A variety of multilingual parallel corpora • • • • • • European Parliament debates (EuroParl corpus) European Central Bank corpus UN documents Subtitles (open subtitle project) Software manuals (PHP, OO) … • With linguistic annotation • Online interface based on CWB/CQP syntax • Corpora can also be downloaded for local use • COMPARA (EN-PT) • OSLO Multilingual Corpus 10 http://opus.lingfil.uu.se/ EuroParl v7 search interface help Choose SL Query Choose TL(s) Other useful functions Sort + Launch the query http://opus.lingfil.uu.se/ EuroParl v7 search interface [word="a|an|the"] [tnt="JJ.*"] "issue" http://opus.lingfil.uu.se/ EuroParl v7 search interface http://opus.lingfil.uu.se/ EuroParl v7 search interface Wrong alignments http://opus.lingfil.uu.se/ EuroParl v7 search interface Bad translation! http://opus.lingfil.uu.se/ OPUS multilingual search interface > Europarl Query Launch the query Choose TL(s) Format of search results http://opus.lingfil.uu.se/ OPUS multilingual search interface > Europarl 18 www.nature.com/nature/journal/v455/n7215/full/455835b.html Parallel corpora (English / Italian) Last week, tens of thousands of researchers took to the streets to register their opposition to a proposed bill designed to control civilservice spending. 19 Exercise: “control civil-service spending” • Going parallel: the OPUS corpus… • opus.lingfil.uu.se • Select: • • • • Under “Search & Browse”: OPUS multilingual search interface Under “Corpus”: Europarl3 Under “Languages”: En Under “Alignments”: check it • Focus on two separate parts (as basic / initial queries): • control spending • civil service 20 Exercise: “control civil-service spending” • Europarl corpus = EU parliamentary debates (proceedings) • “control” “spending” • [lem="control"] [] "spending" • What equivalents do we find in Italian texts? (set “show max --hits” to 100) 21 Exercise: “control civil-service spending” • Europarl corpus = EU parliamentary debates (proceedings) • “control” “spending” • [lem="control"] [] "spending" • What equivalents do we find in Italian texts? (set “show max --hits” to 100) • For example • contenimento della spesa (pubblica) • controllare la spesa (pubblica) • controllo (valido ed efficace) sulla spesa • controllo della spesa (pubblica) • rigore della spesa 23 Exercise: “control civil-service spending” • Now we just focus on the Italian version (without aligned English) • Which (other) verbs accompany spesa (pubblica) and mean “control”? • [pos="V.*"] []? "spesa" "pubblica" NB: []? means the slot may be empty • (controllare), tagliare, ridurre, limitare, contrarre, riorganizzare • But… why only [pos="V.*"] ? There is no need for this! Other possible queries (involving nouns or prepositions) • [pos="NOM"] []? "spesa" "pubblica“ • [pos=“PRE”] "spesa" "pubblica" • riduzione, controllo, contenimento, tagli, restrizioni, limitazione 24 Exercise: “control civil-service spending” • ST: civil-service spending [vs. public spending] • Let’s look for civil service in the aligned Eng/Ita texts of the European Parliament with this query: "civil" "service" • What Italian equivalents do we find? • funzione pubblica • amministrazione pubblica • pubblica amministrazione • settore pubblico • pubblico impiego 25 Translation into Italian of “control civil-service spending” (double-checking context of use in the Italian EU texts) Controllare/controllo Ridurre/riduzione Tagliare/tagli … della/alla/la spesa dei/ai/i costi della funzione pubblica dell’amministrazione pubblica della pubblica amministrazione del settore pubblico più neutro, vari contesti d’uso, forse candidato migliore termine tecnico (ambito legale/ amministrativo) connotazioni piuttosto negative usato soprattutto in contesti burocratici 26 One possible solution Last week, tens of thousands of researchers took to the streets to register their opposition to a proposed bill designed to control civil-service spending. La settimana scorsa decine di migliaia di ricercatori sono scesi in piazza per protestare contro una proposta di legge che prevede tagli al settore pubblico. Tools and resources for offline parallel corpora • Parallel concordancers • Paraconc • www.athel.com/para.html • Demo freely available: the number of hits is restricted to 150 and the results cannot be saved or printed. • AntPConc • http://www.laurenceanthony.net/software.html • Still being developed • WordSmith Tools • (older) version 4.0 now freely available http://lexically.net/wordsmith/version4/index.htm • Viewer and Aligner program • Multiconcord • http://artsweb.bham.ac.uk/pKing/multiconc/l_text.htm • old(ish) • Aligners • • • • Alinea InterText hunalign YouAlign 28 Parallel corpora: homework Given the following extract, from an administrative text, La Provincia è chiamata ad assumere un ruolo propositivo e di coordinamento per quanto concerne lo sviluppo socioeconomico dell'area provinciale nel suo complesso e dei singoli ambiti territoriali. how would you translate the word propositivo into English, in a similar context? • Open the Europarl corpus • Find the most common equivalents for propositivo, as well as their collocates • Take note of other translation solutions • Any odd one? • Using an English monolingual reference corpus (e.g. on Leeds internet corpora website), can you explain why they are not appropriate? A translation-relevant corpus typology Corpora general / reference monolingual specialised monolingual multilingual (usually) comparable Comparable texts in terms of genre/text type or topic. Normally relatively small, created ad-hoc for specific translation assignments («DIY»), «disposable», for texts belonging to specialised domains parallel 34 Using comparable corpora for translation • Learn something about a specific domain/topic • Understand the source text • Choose the “right” TL term/word/collocation • Identify and reproduce the features of the specific genre/register in the TL • Look for equivalents, definitions and contexts of use in both the source and target language Text selection • Parameters that should be taken into account when manually selecting texts for the corpus: • • • • • • • • Topic (relevance) Source (authoritativeness, reliability) Target audience/reader (who it was written for) Text function Informativeness (information richness) Language (accurate, native) Date/Time of production (recent, up-to-date) Technical/practical issues • Formatting • Filetype • Conversion Source text P. R. O. Bally (1959) “Monadenium arborescens”. Candollea 17:25-26. Coming from Tanzania, this is a robust growing species and is a semi woody succulent, forming a lightly branched shrub/tree up to 4.25 metres high. The stems can grow to 10 cm. thick, are five angled and may be slightly spirally twisted. They are erect and may be solitary or in twos. If branched, the branches are quite slender, grow erect, and are some 30 – 60 cm. apart. They are smooth, and covered in a green bloom. Leaf scars, which are 10 mm. in diameter, are borne 4 –7 cm. apart and below each leaf scar is a small tubercle which on older plants has a small reddish/brown spine, but a more robust one up to 2 cm. long on is produced on younger plants. The leaves are crowded terminally around the ends of the stems and are produced from the angles of the stems. They are obovate, pointed and heart shaped, 7 – 19 cm long and 5.6 – 11 cm. wide. Flowering takes place from an eye situated directly above the leaf scar and several cymes may be produced near the apex of the branches with peduncles 6 –7 cm. long and 5 – 6 mm. thick. The colour of the inflorescence is red. This species is not in general cultivation due to its 37 rapid growth and size. The process for manual corpus construction • We want to build a bilingual specialised comparable corpus for the translation task (English Italian) • Two stages: a) Source language corpus component (English) b) Target language corpus component (Italian) 38 Searching for similar SL (English) texts for the corpus • We look for: • web pages in English, as similar to our ST as possible • e.g. searching for ‘monadenium’ on google.co.uk • We find, e.g.: en.wikipedia.org/wiki/Monadenium_arborescens www.sdcss.com/monadenium.html davesgarden.com/guides/pf/go/65135/ www.gardening.eu/plants/Succulent-Plants/Monadenium-guentheri/3708/ • You can add to the search string: monadenium filetype:pdf In general pages in pdf format tend to be more informative and authoritative 39 Uninformative, different genre 40 Very informative, authoritative (source: San Diego Cactus Society), similar genre (journal article) 41 Uninformative, little connected text, different function (promotional) and genre 42 Low quality, unreliable (language) 43 Searching for TL texts • We look for “monadenium” in Italian (reliable) webpages, e.g.: • http://www.giardinaggio.it/grasse/singolegrasse/Monadenium/Monadenium.asp • We make a list of (candidate) key words, e.g. : • monadenium, caudice, ciazi, caudex, infiorescenze, ritchiei… • More google searches • Google Similar pages = related:digilander.libero.it/cacti/p06/PL6052.htm Practical considerations: file types • Corpus files must be downloaded/saved in this format: • Simple/pure text (.txt) • save as “text only” • Common formats of online texts • HTML • File save as xxx.txt • (just modify the file extension) • Microsoft Word these must be converted into (saved as) .txt format • Save as xxx.txt • File type plain text “.txt” ok (ignore any error message) • pdf • image/“dead pdf” (not good) vs. searchable pdf (OK) • edit select all copy paste in a new text file save • Plan separate folders for each corpus (sub-)component • e.g. SL/TL, but also more/less authoritative, different genres etc. 45 Practical considerations: corpus query tools Now that we have built the corpus, what concordancing tools are available? • AntConc – user-friendly, many functionalities, and you can download it (for free and legally!!) from this URL: • www.laurenceanthony.net/software.html • TextStat – free, includes an interesting web-spider which downloads as many pages as you want from a particular website (good if you have identified a reliable website) • http://neon.niederlandistik.fu-berlin.de/en/textstat/ • WordSmith Tools – commercial tool • (older) version 4.0 now freely available • http://lexically.net/wordsmith/version4/index.htm And what can we do with them? 46 AntConc “File” menu => “Open Dir…” 47 Names of the files in the corpus 48 Frequency list Clusters / N-grams Clusters / N-grams size Min. frequency Upload reference corpus Keyword list Find or investigate collocates from a comparable Eng/Ita corpus on botany (with AntConc) • search the English corpus for collocates of: • roots • in a span (i.e. distance/space) of 2L/1L • with min. fq of collocate=2 • then search a comparable Italian corpus for collocates of: • radici • in a span (i.e. distance/space) of 1R/2R • with min. fq of collocate=2 52 53 54 Corpus query tool: AntConc 55 Study concordances in a comparable Eng/Ita corpus on botany (with AntConc) • first search the English corpus for leaves are • sort concordance results according to the 2nd word to the right (2R) • then search the Italian corpus for foglie sono • sort concordance results in the same way 56 57 58 Study concordances in a comparable Eng/Ita corpus on botany 59 Using a comparable corpus: understand (parts of) the source text • Example: “robust” Coming from Tanzania, this is a robust growing species and is a semi woody succulent, forming a lightly branched shrub/tree up to 4.25 metres high. The stems can grow to 10 cm. thick, are five angled and may be slightly spirally twisted. 60 • Write ‘robust’ in the “search term” box • Make sure the box next to “Words” is ticked • Click on ‘Start’ Search term The file in which the term was found Tick this option to use regular expressions Search term in context (4-5 words to the left and right) 61 • Ordering concordance lines: “Kwic Sort” => tick “Level 1” • Scroll down to 1R (1 word to the right) • Click on “Sort” • The words immediately to the right of the search term are now red and have been grouped together (robust growing/growth) Sort 62 Using a comparable corpus: now turn to the TL to produce the actual translation • When we are sure that we understand the ST, we make hypotheses as to possible translation equivalents for “robust growing species” • We close the SL corpus (in English in our case) • menu file=> “close all files” • We open the TL corpus (in Italian in our case) • menu file=> “open dir” • We repeat the steps followed for English, searching for evidence supporting our hypotheses • The comparable corpus helps us to replicate the style of domain experts in the TL (terms, phraseology...) 63 Comparable corpora: Homework • Build a small SL+TL comparable corpus for the Monadenium translation task • Searching the corpus with AntConc, try to find candidate Italian translations for (some of) the underlined expressions: Coming from Tanzania, this is a robust growing species and is a semi woody succulent, forming a lightly branched shrub/tree up to 4.25 metres high. The stems can grow to 10 cm. thick, are five angled and may be slightly spirally twisted. 64 Summing up: corpus use in translation Main uses: • Test/generate hypotheses as to interpretation of the source text, and as to appropriate translations • helpful when you’re dealing with little known text-types / domains • helpful when you’re dealing with a little known language • Improve quality – capture subtleties of source text, produce translations which read like native speaker texts More precisely, • Reference corpora provide insights on phraseological regularities in discourse • Comparable corpora (automatic and manual) can be used for (contrastive) specialised/genre-controlled text analysis • Parallel corpora provide equivalents in context/evidence of translation strategies (and are more versatile than TMs) Main sources • Bowker, L. & J. Pearson (2002) Working with Specialized Language: a Practical Guide to Using Corpora. London and New York, Routledge: This book is recommended as a particularly accessible and comprehensive introduction to the use of corpora •Austermühl, F. (2001) Electronic Tools for Translators. Manchester, St. Jerome Publishing: Chapter 8 “Corpora as Translation Tools” (pages 124-133). • Bowker, L. (2002) Computer-Aided Translation Technology. A Practical Introduction. Ottawa, University of Ottawa Press: Chapter 2 “Capturing Data in Electronic Form” (pages 22-42) & Chapter 3 “Corpora and Corpus-Analysis Tools” (pages 43-76) • Laviosa, S. (2003) “Corpora and the translator”: Chapter 7 in H. Somers (ed.) Computers and Translation. A Translator’s Guide. 66 Amsterdam and Philadelphia, John Benjamins (pages 105-117) Further (optional) reading These additional books provide more technical and in-depth discussions of how corpora can be built (e.g. from the Web) and used, for students who are interested: • Baroni, M. & S. Bernardini (eds) (2006) Wacky! Working Papers on the Web as Corpus. Bologna: GEDIT. Freely downloadable from http://wackybook.sslmit.unibo.it • Gatto, M. (2009) From Body to Web: An Introduction to the Web as Corpus. Bari/Roma: Università di Bari/Editori Laterza. Downloadable here. • McEnery, T., R. Xiao & Y. Tono (2006) Corpus-based Language Studies: an Advanced Resource Book. London and New York: Routledge • Zanettin, F. (2012) Translation-Driven Corpora: Corpus Resources for 67 Descriptive and Applied Translation Studies. Manchester: St. Jerome
© Copyright 2024 Paperzz