Parallel corpora

UNIVERSITÀ DEGLI STUDI DI MACERATA
Dipartimento di Studi Umanistici – Lingue, Mediazione, Storia, Lettere, Filosofia
Corso di Laurea Magistrale in Lingue Moderne
per la Comunicazione e la Cooperazione Internazionale (Classe LM-38)
TPCI inglese - mod. B
Strumenti e tecnologie per la traduzione
specialistica - a.a. 2016/2017
PART 2.2:
Parallel and Comparable Corpora
Sara Castagnoli
sara.castagnoli@unimc.it
1
A translation-relevant corpus typology
Corpora
general / reference
monolingual
specialised
monolingual
multilingual
(usually)
Comparable texts in terms of
genre/text type or topic.
Usually rather small, created
ad-hoc for specific tasks
(«DIY»), «disposable»
comparable
parallel
Original texts aligned to
corresponding translations.
Typically available and precompiled (as for general,
2
reference monolingual corpora)
Parallel vs. comparable multilingual corpora
Parallel (translational) corpora
• contain translationally
“equivalent” texts: STs and their
corresponding TTs
• need to be aligned, usually at
the sentence level, i.e. SL
sentence X matched to TL
sentence X’
• context is provided to account
for “equivalence” and
“translation shifts” between ST
and TT
• translation direction needs to
be clear, i.e. which are SL and TL
components of the corpus
Comparable corpora
• texts originally produced (not
translated) in the respective
languages
• consist of independent texts
which are “similar” according to
some pre-determined criteria
•the various language
components share a set of
common features, e.g. text type,
genre, publication span, domain,
topic
• parameters defining this
similarity vary widely
3
Parallel and comparable multilingual corpora
Analogies with dictionary usage
Monolingual source and target corpora can be compared to
monolingual source and target language dictionaries. While
dictionaries favor a synthetic approach to lexical meaning (via a
definition), corpora offer an analytic approach (via multiple
contexts).
Translators can use target monolingual corpora to check the
meaning and usage of translation candidates in the target
contexts. Like source language dictionaries, source language
corpora can be consulted for source text analysis and
understanding. Large reference corpora can function as general
dictionaries, while smaller, specialized and bilingual comparable
corpora can be seen as analogous to specialized monolingual
dictionaries.
4
(Zanettin 2002 “Corpora in Translation Practice”)
Parallel and comparable multilingual corpora
Parallel corpora can be compared to bilingual dictionaries, with a few important
differences: bilingual dictionaries are repertories of lexical equivalents (general
dictionaries) or terms (specialized dictionaries and terminologies) established by
dictionaries makers which are offered as translation candidates. Parallel corpora
are repertoires of strategies deployed by past translators, as well as repertoires
of translation equivalents.
In selecting a translation equivalent from a general bilingual dictionary a
translator has to assess the appropriateness of the candidate to the new context
by starting from a definition and a few usage examples. A parallel corpus will
offer a repertoire of translation strategies past translators have resorted to when
confronted with similar problems to the ones that have prompted a search in a
parallel corpus. Parallel corpora can provide information that bilingual
dictionaries do not usually contain. They can not only offer equivalence at the
word level, but also non-equivalence, i.e. cases where there is no easy
equivalent for words, terms or phrases across languages. A parallel corpus can
provide evidence of how actual translators have dealt with this lack of direct
equivalence at word level.
(Zanettin 2002 “Corpora in Translation Practice”)
5
Bilingual parallel corpora
What is needed:
• Pairs of texts (ST/TT)
• Aligner / alignment
• Parallel concordancer
6
She is the author of numerous articles regarding
learning disabilities and she speaks often before
parent and teacher groups concerning learning and
behavior problems.
È autrice di numerosi articoli riguardanti le
disabilità di apprendimento e ha tenuto spesso
conferenze davanti a gruppi di genitori e insegnanti
sui problemi del comportamento e dell’apprendimento.
All expectations need to be direct and explicit. Don't
require this child to 'read between the lines' to
glean your intentions.
Esplicitare chiaramente tutte le aspettative, in modo
da non richiedere al bambino di “leggere tra le righe”
per cogliere le intenzioni.
Obviously, the child with nonverbal learning disorders
would not be expected to be the 'scribe' in a
cooperative grouping - her contribution should be in
the verbal arena.
Ovviamente non ci si deve aspettare che sia lo “scriba”
del gruppo cooperativo, il suo contributo deve essere
inserito nell’arena verbale.
Sentence-level alignment
(new line delimited)
7
1:1 alignment
2:1 alignment
Sentence-level alignment (XML format)
Bilingual parallel corpora
What is needed:
• Pairs of texts (ST/TT)
• Aligner / alignment
• Parallel concordancer
Parallel corpora cannot represent the full range of linguistic possibilities of
the target language and may reflect the stylistic idiosyncrasies of the
source language and of individual translators.
However, the comparison between large numbers of texts and their
acknowledged translations can show how equivalence has been
established by translators under certain circumstances and provide
examples of translation strategies. If such corpora are sufficiently varied
and large, looking at recurring linguistic choices made by translators
allows general patterns to be perceived. Learners can thus notice
"preferred ways of putting things" (Kennedy 1992), and generalize from
the aggregation of sets of individual instances.
(Zanettin 1998 “Bilingual Comparable corpora and the Training of Translators”)
9
Bilingual parallel corpora on the web
• OPUS corpus, opus.lingfil.uu.se
• A variety of multilingual parallel corpora
•
•
•
•
•
•
European Parliament debates (EuroParl corpus)
European Central Bank corpus
UN documents
Subtitles (open subtitle project)
Software manuals (PHP, OO)
…
• With linguistic annotation
• Online interface based on CWB/CQP syntax
• Corpora can also be downloaded for local use
• COMPARA (EN-PT)
• OSLO Multilingual Corpus
10
http://opus.lingfil.uu.se/  EuroParl v7 search interface
help
Choose SL
Query
Choose TL(s)
Other useful functions
Sort +
Launch the query
http://opus.lingfil.uu.se/  EuroParl v7 search interface
[word="a|an|the"] [tnt="JJ.*"] "issue"
http://opus.lingfil.uu.se/  EuroParl v7 search interface
http://opus.lingfil.uu.se/  EuroParl v7 search interface
Wrong
alignments
http://opus.lingfil.uu.se/  EuroParl v7 search interface
Bad translation!
http://opus.lingfil.uu.se/  OPUS multilingual search
interface > Europarl
Query
Launch the query
Choose TL(s)
Format of search results
http://opus.lingfil.uu.se/  OPUS multilingual search
interface > Europarl
18
www.nature.com/nature/journal/v455/n7215/full/455835b.html
Parallel corpora (English / Italian)
Last week, tens of thousands
of researchers took to the
streets to register their
opposition to a proposed bill
designed to control civilservice spending.
19
Exercise: “control civil-service spending”
• Going parallel: the OPUS corpus…
• opus.lingfil.uu.se
• Select:
•
•
•
•
Under “Search & Browse”: OPUS multilingual search interface
Under “Corpus”: Europarl3
Under “Languages”: En
Under “Alignments”: check it
• Focus on two separate parts (as basic / initial queries):
• control spending
• civil service
20
Exercise: “control civil-service spending”
• Europarl corpus = EU parliamentary debates (proceedings)
• “control” “spending”
• [lem="control"] [] "spending"
• What equivalents do we find in Italian texts? (set “show max --hits” to 100)
21
Exercise: “control civil-service spending”
• Europarl corpus = EU parliamentary debates (proceedings)
• “control” “spending”
• [lem="control"] [] "spending"
• What equivalents do we find in Italian texts? (set “show max --hits” to 100)
• For example
• contenimento della spesa (pubblica)
• controllare la spesa (pubblica)
• controllo (valido ed efficace) sulla spesa
• controllo della spesa (pubblica)
• rigore della spesa
23
Exercise: “control civil-service spending”
• Now we just focus on the Italian version (without aligned English)
• Which (other) verbs accompany spesa (pubblica) and mean “control”?
• [pos="V.*"] []? "spesa" "pubblica" NB: []? means the slot may be empty
• (controllare), tagliare, ridurre, limitare, contrarre, riorganizzare
• But… why only [pos="V.*"] ? There is no need for this!
Other possible queries (involving nouns or prepositions)
• [pos="NOM"] []? "spesa" "pubblica“
• [pos=“PRE”] "spesa" "pubblica"
• riduzione, controllo, contenimento, tagli, restrizioni, limitazione
24
Exercise: “control civil-service spending”
• ST: civil-service spending [vs. public spending]
• Let’s look for civil service in the aligned Eng/Ita texts of the
European Parliament with this query:
"civil" "service"
• What Italian equivalents do we find?
• funzione pubblica
• amministrazione pubblica
• pubblica amministrazione
• settore pubblico
• pubblico impiego
25
Translation into Italian of “control civil-service spending”
(double-checking context of use in the Italian EU texts)
Controllare/controllo
Ridurre/riduzione
Tagliare/tagli
…
della/alla/la spesa
dei/ai/i costi
della funzione pubblica
dell’amministrazione pubblica
della pubblica amministrazione
del settore pubblico
più neutro, vari contesti
d’uso, forse candidato
migliore
termine tecnico
(ambito legale/
amministrativo)
connotazioni
piuttosto
negative
usato soprattutto
in contesti
burocratici
26
One possible solution
Last week, tens of
thousands of
researchers took to the
streets to register their
opposition to a
proposed bill designed
to control civil-service
spending.
La settimana scorsa
decine di migliaia di
ricercatori sono scesi in
piazza per protestare
contro una proposta di
legge che prevede tagli
al settore pubblico.
Tools and resources for offline parallel corpora
• Parallel concordancers
• Paraconc
• www.athel.com/para.html
• Demo freely available: the number of hits is restricted to 150 and the results
cannot be saved or printed.
• AntPConc
• http://www.laurenceanthony.net/software.html
• Still being developed
• WordSmith Tools
• (older) version 4.0 now freely available
http://lexically.net/wordsmith/version4/index.htm
• Viewer and Aligner program
• Multiconcord
• http://artsweb.bham.ac.uk/pKing/multiconc/l_text.htm
• old(ish)
• Aligners
•
•
•
•
Alinea
InterText
hunalign
YouAlign
28
Parallel corpora: homework
Given the following extract, from an administrative text,
La Provincia è chiamata ad assumere un ruolo propositivo e di
coordinamento per quanto concerne lo sviluppo socioeconomico
dell'area provinciale nel suo complesso e dei singoli ambiti territoriali.
how would you translate the word propositivo into English, in
a similar context?
• Open the Europarl corpus
• Find the most common equivalents for propositivo, as well as their
collocates
• Take note of other translation solutions
• Any odd one?
• Using an English monolingual reference corpus (e.g. on Leeds internet
corpora website), can you explain why they are not appropriate?
A translation-relevant corpus typology
Corpora
general / reference
monolingual
specialised
monolingual
multilingual
(usually)
comparable
Comparable texts in terms of genre/text type or topic.
Normally relatively small, created ad-hoc for specific
translation assignments («DIY»), «disposable», for texts
belonging to specialised domains
parallel
34
Using comparable corpora for translation
• Learn something about a specific domain/topic
• Understand the source text
• Choose the “right” TL term/word/collocation
• Identify and reproduce the features of the specific
genre/register in the TL
• Look for equivalents, definitions and contexts of use in
both the source and target language
Text selection
• Parameters that should be taken into account when
manually selecting texts for the corpus:
•
•
•
•
•
•
•
•
Topic (relevance)
Source (authoritativeness, reliability)
Target audience/reader (who it was written for)
Text function
Informativeness (information richness)
Language (accurate, native)
Date/Time of production (recent, up-to-date)
Technical/practical issues
• Formatting
• Filetype
• Conversion
Source text
P. R. O. Bally (1959) “Monadenium arborescens”. Candollea 17:25-26.
Coming from Tanzania, this is a robust growing species and is
a semi woody succulent, forming a lightly branched shrub/tree
up to 4.25 metres high. The stems can grow to 10 cm. thick, are
five angled and may be slightly spirally twisted.
They are erect and may be solitary or in twos. If branched, the branches are
quite slender, grow erect, and are some 30 – 60 cm. apart. They are smooth,
and covered in a green bloom. Leaf scars, which are 10 mm. in diameter, are
borne 4 –7 cm. apart and below each leaf scar is a small tubercle which on
older plants has a small reddish/brown spine, but a more robust one up to 2
cm. long on is produced on younger plants. The leaves are crowded
terminally around the ends of the stems and are produced from the angles of
the stems. They are obovate, pointed and heart shaped, 7 – 19 cm long and
5.6 – 11 cm. wide. Flowering takes place from an eye situated directly above
the leaf scar and several cymes may be produced near the apex of the
branches with peduncles 6 –7 cm. long and 5 – 6 mm. thick. The colour of the
inflorescence is red. This species is not in general cultivation due to its 37
rapid
growth and size.
The process for manual corpus construction
• We want to build a bilingual specialised comparable corpus
for the translation task (English  Italian)
• Two stages:
a) Source language corpus component (English)
b) Target language corpus component (Italian)
38
Searching for similar SL (English) texts for the corpus
• We look for:
• web pages in English, as similar to our ST as possible
• e.g. searching for ‘monadenium’ on google.co.uk
• We find, e.g.:
en.wikipedia.org/wiki/Monadenium_arborescens
www.sdcss.com/monadenium.html
davesgarden.com/guides/pf/go/65135/
www.gardening.eu/plants/Succulent-Plants/Monadenium-guentheri/3708/
• You can add to the search string: monadenium filetype:pdf
In general pages in pdf format tend to be more informative and authoritative
39
Uninformative, different genre
40
Very informative, authoritative (source: San Diego
Cactus Society), similar genre (journal article) 41
Uninformative, little connected text, different
function (promotional) and genre
42
Low quality, unreliable (language)
43
Searching for TL texts
• We look for “monadenium” in Italian (reliable)
webpages, e.g.:
• http://www.giardinaggio.it/grasse/singolegrasse/Monadenium/Monadenium.asp
• We make a list of (candidate) key words, e.g. :
• monadenium, caudice, ciazi, caudex, infiorescenze, ritchiei…
• More google searches
• Google Similar pages =
related:digilander.libero.it/cacti/p06/PL6052.htm
Practical considerations: file types
• Corpus files must be downloaded/saved in this format:
• Simple/pure text (.txt)
• save as “text only”
• Common formats of online texts
• HTML
• File  save as  xxx.txt
• (just modify the file extension)
• Microsoft Word
these must be
converted into
(saved as) .txt
format
• Save as  xxx.txt
• File type  plain text “.txt”  ok (ignore any error message)
• pdf
• image/“dead pdf” (not good) vs. searchable pdf (OK)
• edit  select all  copy  paste in a new text file  save
• Plan separate folders for each corpus (sub-)component
• e.g. SL/TL, but also more/less authoritative, different genres etc.
45
Practical considerations: corpus query tools
Now that we have built the corpus, what concordancing tools are
available?
• AntConc – user-friendly, many functionalities, and you can
download it (for free and legally!!) from this URL:
• www.laurenceanthony.net/software.html
• TextStat – free, includes an interesting web-spider which
downloads as many pages as you want from a particular website
(good if you have identified a reliable website)
• http://neon.niederlandistik.fu-berlin.de/en/textstat/
• WordSmith Tools – commercial tool
• (older) version 4.0 now freely available
• http://lexically.net/wordsmith/version4/index.htm
And what can we do with them?
46
AntConc
“File” menu => “Open Dir…”
47
Names of the files in the
corpus
48
Frequency list
Clusters / N-grams
Clusters / N-grams size
Min. frequency
Upload reference
corpus
Keyword list
Find or investigate collocates from a
comparable Eng/Ita corpus on botany (with AntConc)
• search the English corpus for collocates of:
• roots
• in a span (i.e. distance/space) of 2L/1L
• with min. fq of collocate=2
• then search a comparable Italian corpus for collocates of:
• radici
• in a span (i.e. distance/space) of 1R/2R
• with min. fq of collocate=2
52
53
54
Corpus query tool: AntConc
55
Study concordances in a
comparable Eng/Ita corpus on botany (with AntConc)
• first search the English corpus for leaves are
• sort concordance results according to the 2nd word to
the right (2R)
• then search the Italian corpus for foglie sono
• sort concordance results in the same way
56
57
58
Study concordances in a
comparable Eng/Ita corpus on botany
59
Using a comparable corpus:
understand (parts of) the source text
• Example: “robust”
Coming from Tanzania, this is a robust
growing species and is a semi woody
succulent, forming a lightly branched
shrub/tree up to 4.25 metres high. The stems
can grow to 10 cm. thick, are five angled
and may be slightly spirally twisted.
60
• Write ‘robust’ in the “search term” box
• Make sure the box next to “Words” is ticked
• Click on ‘Start’
Search
term
The file in
which the
term was
found
Tick this option to use
regular expressions
Search term in
context (4-5
words to the
left and right)
61
• Ordering concordance lines: “Kwic Sort” => tick “Level 1”
• Scroll down to 1R (1 word to the right)
• Click on “Sort”
• The words immediately to the right of the search term are now red
and have been grouped together (robust growing/growth)
Sort
62
Using a comparable corpus:
now turn to the TL to produce the actual translation
• When we are sure that we understand the ST, we make
hypotheses as to possible translation equivalents for “robust
growing species”
• We close the SL corpus (in English in our case)
• menu file=> “close all files”
• We open the TL corpus (in Italian in our case)
• menu file=> “open dir”
• We repeat the steps followed for English, searching for
evidence supporting our hypotheses
• The comparable corpus helps us to replicate the style of
domain experts in the TL (terms, phraseology...)
63
Comparable corpora: Homework
• Build a small SL+TL comparable corpus for the
Monadenium translation task
• Searching the corpus with AntConc, try to find candidate
Italian translations for (some of) the underlined
expressions:
Coming from Tanzania, this is a robust growing species
and is a semi woody succulent, forming a lightly
branched shrub/tree up to 4.25 metres high. The stems
can grow to 10 cm. thick, are five angled and may be
slightly spirally twisted.
64
Summing up: corpus use in translation
Main uses:
• Test/generate hypotheses as to interpretation of the source text, and
as to appropriate translations
• helpful when you’re dealing with little known text-types / domains
• helpful when you’re dealing with a little known language
• Improve quality – capture subtleties of source text, produce
translations which read like native speaker texts
More precisely,
• Reference corpora provide insights on phraseological regularities in
discourse
• Comparable corpora (automatic and manual) can be used for
(contrastive) specialised/genre-controlled text analysis
• Parallel corpora provide equivalents in context/evidence of translation
strategies (and are more versatile than TMs)
Main sources
• Bowker, L. & J. Pearson (2002) Working with Specialized Language: a
Practical Guide to Using Corpora. London and New York, Routledge:
This book is recommended as a particularly accessible and
comprehensive introduction to the use of corpora
•Austermühl, F. (2001) Electronic Tools for Translators. Manchester,
St. Jerome Publishing: Chapter 8 “Corpora as Translation Tools”
(pages 124-133).
• Bowker, L. (2002) Computer-Aided Translation Technology.
A Practical Introduction. Ottawa, University of Ottawa Press:
Chapter 2 “Capturing Data in Electronic Form” (pages 22-42) &
Chapter 3 “Corpora and Corpus-Analysis Tools” (pages 43-76)
• Laviosa, S. (2003) “Corpora and the translator”: Chapter 7 in
H. Somers (ed.) Computers and Translation. A Translator’s Guide.
66
Amsterdam and Philadelphia, John Benjamins (pages 105-117)
Further (optional) reading
These additional books provide more technical and in-depth discussions
of how corpora can be built (e.g. from the Web) and used, for students
who are interested:
• Baroni, M. & S. Bernardini (eds) (2006) Wacky! Working Papers on
the Web as Corpus. Bologna: GEDIT. Freely downloadable from
http://wackybook.sslmit.unibo.it
• Gatto, M. (2009) From Body to Web: An Introduction to the Web as
Corpus. Bari/Roma: Università di Bari/Editori Laterza. Downloadable
here.
• McEnery, T., R. Xiao & Y. Tono (2006) Corpus-based Language
Studies: an Advanced Resource Book. London and New York:
Routledge
• Zanettin, F. (2012) Translation-Driven Corpora: Corpus Resources for
67
Descriptive and Applied Translation Studies. Manchester: St. Jerome