Collocations in translated text issues, insights, implications

Collocations in translated textissues, insights, implications Silvia Bernardini University of Bologna, Italy silvia.bernardini@unibo.it Aston Corpus symposium 23 May 2008

Talk outline • Collocations • Corpus Linguistics • Corpus-Based Translation Studies • Research questions, methodology, results • Fiction • Open source software • Implications • Descriptive and applied • Methodological follow up • Future work

Background: Collocations in CL • “Phraseology-oriented” approaches • E.g. (Howarth 1996:47) [Restricted collocations are] combinations in which one component is used in its literal meaning, while the other is used in a specialised sense. The specialised meaning of one element can be figurative, delexical or in some way technical and is an important determinant of limited collocability at the other. These combinations are, however, fully motivated.

Background: Collocations in CL • “Parameters” of collocation within phraseology approaches • Motivation/arbitrariness • Commutability • Non-literalness • Transparency • Unpredictability

Background: Collocations in CL • “Frequency-oriented” approaches • “Automatisation” is the result of repetition • British school of linguistics (Firth) • The statistical tendency of words to co-occur (Hunston 2002: 12) • “Significant” collocation is regular collocation between items, such that they occur more often than their respective frequencies and the length of the text in which they occur would predict (Jones and Sinclair 1974:19)

Searching for collocations in text • “Keyword” method • Starting from a (set of) keyword(s) and looking left and right • E.g. Sinclair 1998, Stubbs 2001, Danielsson 2001 • “Sequence” method • Selecting all sequences of N words (or lemmas, or POS tags) that recur a certain number of times • E.g. Kjellmer 1994, Biber et al. 1999, Johansson 1993

Statistics • MI, t-score, z-score, log-likelihood… • P. Baker (2006), McEnery et al (2006) • Bare frequency • Krenn and Evert (2001) • A mixture of both • MI * log fq • Kilgarriff and Tugwell (2001) • frequency-based cut-offs • Krenn (2000)

NN in ukWaC (bare fq, top 10) 175642 web site 81127 case study 70514 search engine 66693 application form 65198 credit card 60626 web page 56721 car park 48833 health care 47655 climate change 46643 email address

Collocations in CBTSapplied perspectives • Bahumaid (2006) • Arab university lecturers translating sentences containing collocations (make a noise, domino effect) into English and into Arabic with any reference tools available • Less than 50% “correct” answers even when translating into their L1 • Paraphrase most common strategy (40-48%)

Collocations in CBTSapplied perspectives • Hatim and Mason (1997:205) • Collocations should in general be neither less unexpected (i.e. more banal) nor more unexpected (i.e. demanding greater processing effort) than in the ST • Baker (1992: 56ff) • Engrossing effect of source text patterning • Tension between accuracy and naturalness • The use of established patterns of collocation […] helps to distinguish between a smooth translation, one that reads like an original, and a clumsy translation which sounds ‘foreign’.

Issues in descriptive CBTS • Translation “norms” or “universals” • Corpus research in TS should focus on the identification of “features which typically occur in translated text rather than original utterances and which are not the result of interference from specific linguistic systems”. (Baker 1993:243) • E.g.: explicitation/explicitness, simplification, disambiguation, levelling out (homogeneity), preference for conventional grammar, avoidance of repetition, exaggeration of features of the target language, normalisation/sanitisation…

Collocations in CBTSdescriptive perspectives • Anecdotal evidence by Øverås (1998): ST: Arket i skrivemaskinen var like skinnende nyfødt blankt som da hun satte det inn i valsen for en time siden. (newborn blank) TT: The sheet of paper in her typewriter was as pristinely white as when she had inserted it over an hour ago. • Confirms Toury’s (1995) hypothesis that translators often produce repertoremes in place of textemes, i.e. they “produce ready-made, cliché structures”.

Collocations in CBTSdescriptive perspectives • Kenny (2001) • Normalisation/sanitisation in the translation of creative lexical combinations • Danielsson (2001) • Automatic identification of collocations (keyword-based) in ST corpus and analysis of renderings in TT corpus • Dayrell (2007) • Range of collocations employed in original vs. translated language (monolingual comparable comparison) • 10 nouns with frequency >200 and their collocates in a span ±4, fq4, MI4

Limits • Kenny (2001) • Habitual collocations not covered; method not scalable • Danielsson (2001) • Plagued by data-sparseness • Only 2 units of meaning (of the ~12K identified in a large monolingual corpus) occur 5 times in a 800K word parallel corpus • Dayrell (2007) • Main issue investigated is lexical repetitiveness at the collocational level • Selective focus: collocations of frequent words only • No cross-check with source texts • Uncontrolled variable makes results difficult to interpret

An alternative approachResearch questions • Are translated texts more/less collocational than original texts in the same language • i.e., are their collocation types overall more/less frequently attested and/or significant? • If so, is this a consequence of the translation process? • i.e., can we identify shifts that could account for the observed overall differences?

An alternative approachCorpus resources • Literary and specialised texts English/Italian • Monolingual comparable corpora (MCC) • Originals in Language A and comparable translations into Language A • Parallel corpora • Originals in Language A and their translations into Language B, usually combined with reference corpora + Reference corpora of English (BNC) and Italian (Repubblica)

An alternative approachCorpus resources • Literary texts • 8 English STs→ Italian TTs (samples) • 7 Italian STs→ English TTs (samples) • ~150K words per component • Specialised texts • Open-source software documentation • 10 English STs→ Italian TTs (full texts) • 6 Italian originals (full texts) (→ 1 English translation) • ~250K words per component

Fiction texts sampled

OSS texts sampled S.Frampton Linux administration made easy) L.Wirzenius The Linux System Administrator’s Guide M.Cooper the Advanced Bash- Scripting Guide G.Beekmans Linux from scratch G. Short 3-button mouse HOWTO D.Jarvis 3D Graphics Modelling and Rendering mini HOWTO J.Tranter Linux Amateur Radio AX.25 HOWTO E.Raymond The DocBook Demystification HOWTO P.Gortmaker Linux Ethernet HOWTO R.Russell Linux IPCHAINS HOWTO A. Madesani IDE e SoundBlaster 32 creative – HOWTO L. Pulici Adaptec AVA 1505 mini- HOWTO G. Paolone LDR Linux Domande e Risposte D. MedriLinux facile G. Giusti Programmare in PHP D. GiacominiAppunti di informatica libera

Extracting collocations • Target sequences • Lexical collocations • Made of two words • Contiguous • Pos-based extraction from study corpora • JN, NN, VN, V * N, N * * N (types) • Collection of token frequencies from reference corpora (BNC and Repubblica)

Extracting collocations • Calculate Mutual Information (MI) • Rank sequences • Take top • Arbitrary cut-off point: MI>2 and fq>1 • Calculate significance of difference btwn original and translated • Mann-Whitney significance tests

Mutual Information MI compares the probability of observing x and y together (the joint probability) with the probabilities of observing x and y independently (chance). If there is a genuine association between x and y, then the joint probability P(x,y) will be much larger than chance […]. (Church & Hanks 1990:77) p(xy) * N MI(x;y)= log2 ------------- p(x) * p(y)

Mann-Whitney-Wilcoxon ranks test • Confidence with which we can reject the null hypothesis that two ranked sets of observations are taken from the same population • Non-parametric, i.e. makes no assumptions about observations being normally distributed • Used (and tested) by Kilgarriff (2001) in comparisons of the LOB and Brown corpora and of male and female speech in the BNC

Original fiction corpus MI collocation fq (BNC) 7,0621 Shredded Wheat 9 6,4372 open-toed sandals 5 5,9465 beta carotene 5 5,7365 Milky Way 80 5,5479 barbed wire 193 5,4172 floppy disks 63 5,3891 eternal damnation 14 5,3798 cursive script 18 5,3046 pearl necklace 14 5,2500 herbal teas 7 Rankings (top 10) for JN (eng) Translated fiction corpus MI collocation fq (BNC) 6,2687 wall-to-wall carpeting 6 6,1698 vous plait 10 5,6773 pistachio nuts 10 5,3305 boric acid 5 5,2218 submachine gun 9 5,2170 Venetian blinds 16 5,2060 Neapolitan dialect 4 5,1170 nasal twang 2 5,0816 westering sun 4 5,0775 hard-boiled eggs 30

Results - Fiction

Results - OSS

Summing up • Translated fiction texts (Italian and English) tend to be (overall) richer in salient collocations than original texts in the same language • Italian (and English) open source software manuals however show the opposite trend…

Implications for descriptive TS • Norm/law-governed (rather than universal) trends (Toury 1995) • Law of interference • Stronger in OSS translation • Law of growingstandardization • Stronger in fiction translation

Implications for applied TS • Parallel comparison (not discussed here) highlights strategies displayed by professional translators at the collocational level • Starting point for awareness-raising and revision exercises focusing on: • Normalization • Rise in formality • Explicitation

Methodological follow up • Crucial role played by reference corpora • What happens if we repeat the calculations with MI data from different reference corpora?

Adjective-Noun (Italian OSS texts) • Repubblica (fq>1 and MI>2) • itWaC (fq>10 and MI>1)

Noun – prep|conj - Noun (Italian fiction texts) • Repubblica (fq>1 and MI>2) • itWaC (fq>10 and MI>1)

Further work • Bottom-up search for regularities • Other genres? • Source-oriented approach • Starting from ST collocations • Collocation extraction and reference corpora • Evaluation of method • Search for creative exploitation of collocations • Can it be automatised?

Thank you Silvia Bernardini University of Bologna, Italy silvia.bernardini@unibo.it Aston Corpus symposium 23 May 2008

Collocations in translated text issues, insights, implications

Collocations in translated text issues, insights, implications

Presentation Transcript

Collocations

Ch5 .COLLOCATIONS

Collocations

Collocations

English Collocations in Use

Implications of Text Complexity

Collocations

Insights and Implications

Text Encoding Issues

(Some issues in) Text Ranking

Insights and Implications

Collocations

Issues in Text Similarity and Categorization

Getting Under the Skin of Government 2.0 - Issues, Insights and Implications

Medical Collocations

(Some issues in) Text Ranking

Collocations

Collocations

COLLOCATIONS

Text Encoding Issues

What are Collocations Examples of common Collocations