Comparable corpora and its application

Comparable corporaand its application Guide: Dr. Pushpak Bhattacharya Presented by Srijit Dutt(10305056) Janardhan Singh(10305067) AshutoshNirala(10305906) Brijesh Bhatt(10405301)

Outline • Motivation • Comparable Corpora (Non-parallel Corpora) • Basic Architecture • Geometrical view • Improvements

Motivation • Corpus is holy grail in NLP • Bilingual Dictionary Generation • Parallel Corpora • One to one correspondence in content • Parallel corpora is rare • Resource constraint language (Punjabi - Spanish) • Monolingual corpus readily available • World Wide Web(Non-parallel corpus) • Techniques to work on non-parallel corpus

Non-parallel corpora • Characteristics • No parallel sentence • No parallel paragraphs • Fewer overlapping terms and words • Four dimension • Author • Domain • Topic • Time Finding terminology translations from non parallel corpora, Fung et al, 1997

Comparable Corpora OneIndia.in

Comparable Corpora Navbharat Times

Postulates for non-parallel corpora • Basic postulate (Fung et al. 1997) T T’ B’ B C C’ A’ E A E’ D’ D • If a domain specific term A is related to another term B in some text T then its counterpart A' is related to B' in some other text T' Finding terminology translations from non parallel corpora, Fung et al, 1997

Using non-parallel corpora • Basic postulate (Fung et al. 1997) T T’ B’ B C C’ A’ E A E’ D’ D • If A is less associated with E then A' is less associated with E' Finding terminology translations from non parallel corpora, Fung et al, 1997

Using non-parallel corpora • Basic postulate (Fung et al. 1997) T T’ B’ B C C’ A’ E A E’ D’ D • Given a large set of words, a words is only associated with some of the words. Finding terminology translations from non parallel corpora, Fung et al, 1997

Using non-parallel corpora • Basic postulate (Fung et al. 1997) T T’ B’ B C C’ A’ E A E’ D’ D • If A is closely associated with word B, C in varying degree then A' is also closely associated with the same varying degrees to B’ and C’. Finding terminology translations from non parallel corpora, Fung et al, 1997

Histogram (Debenture) Corpus2 Corpus1 Frequency Seed Words Seed Words

Histogram (Administration) Corpus1 Corpus2 Frequency Seed Words Seed Words

Co-occurrence Relation Known seed words of both languages (Online dictionary) किताब book

Co-occurrence matrix Base Lexicon/ Dictionary Knowledge Tree Library Target Language Matrix Words in Corpus Book 1 … 0 … 1 ज्ञान पेड़ पुस्तककालय Word in source language Co-occurence vector किताब 1 0 1 Base Lexicon/ Dictionary

Improvements on Basic Architecture • Co-occurrence Counts • Similarity Measure • Window Size • Is it same for all words ? • Dictionary • Polysemous and Synonym Words • What if dictionary is not available ?

Context vector Word Co-occurrence Count Word A A X B A B X X B A Window Size : 3 B X A B occurs in dictionary X is any word Automatic Identification of Word Translations, Rapp, 1999

Co-occurrence Counts • Mutual Information (Church et al 1989) • Conditional Probability (Fung et al 1996) • Chi-Square Test (Dunning et al 1993) • Log-likelihood Ratio (Rapp 1998) • TF-IDF (Fung et al 1997)

Conditional Probability • k11= frequency of common occurrence of word ws and word wt • k12 = corpus frequency of word ws – k11 • k21 = corpus frequency of word wt – k11 • k22 = size of corpus (no. of tokens) – corpus frequency of ws - corpus frequency of wt • Marginal and joint probability Finding terminology translations from non parallel corpora, Fung et al, 1997

Co-occurrence Counts • Mutual information • TF-IDF Finding terminology translations from non parallel corpora, Fung et al, 1997

Co-occurrence Counts • Log Likelihood • k11= frequency of common occurrence of word ws and word wt • k12 = corpus frequency of word ws – k11 • k21 = corpus frequency of word wt – k11 • k22 = size of corpus (no. of tokens) – corpus frequency of ws - corpus frequency of wt where Automatic Identification of Word Translations, Rapp, 1999

Similarity Measures • Cosine Similarity • Jaccard Similarity • Euclidian\L2 • Manhattan\L1\City-Block

Window Size • What is ideal context size ? • Same window size • “amount” : more frequent • “debenture” : less frequent Window Size

Dependency Tree Improving Translation Lexicon Induction from Monolingual Corpora via dependency context and POS equivalence, N Garera et al. , 2009

Modeling context using dependency tree • The four vectors for positions are mapped as follows: -1 – Immediate parent +1 – Immediate child -2 – grand parent +2 – grand child Improving Translation Lexicon Induction from Monolingual Corpora via dependency context and POS equivalence, N Garera et al. , 2009

Context vector v/s dependency parsing Improving Translation Lexicon Induction from Monolingual Corpora via dependency context and POS equivalence, N Garera et al. , 2009

Dependency Tree • Context is better captured in dependency information rather than adjacent words • Long distance dependencies capture associated words • Languages with different word orders : parent and child relationship • Higher Accuracy Improving Translation Lexicon Induction from Monolingual Corpora via dependency context and POS equivalence, N Garera et al. , 2009

Dictionary as seed word list (issues) • Multiple translation • Polysemous words • Words in one text may not be present in other • Word may not be in dictionary format Finding terminology translations from non parallel corpora, Fung et al, 1997

Geometrical View (translation) A Geometric View on Bilingual Lexicon Extraction from Comparable Corpora, E. Gaussier,J M Renders, I Matveeva, C Goutte, H Dejean,2004

Geometric View (Extended Approach) (translation) A Geometric View on Bilingual Lexicon Extraction from Comparable Corpora, E. Gaussier,J M Renders, I Matveeva, C Goutte, H Dejean,2004

Translation without dictionary • What if dictionary is not available? • Find language for which dictionary is available. • Use that language as intermediate language between source and target language.

Use of pivot language Unavailability of bilingual lexicon Use pivot language for which bilingual lexicon is available. Y Z X Source Language Pivot Language Target Language What if Y is polysemous??? Bilingual Lexicon Generation Using Non-Aligned Signatures, Shezaf et al,

Use of pivot language Source : Hindi Pivot: English X = प्रकाश Y = light Y Z X Source Language Pivot Language Target Language Lexicons are intransitive. Results in noisy translation. Bilingual Lexicon Generation Using Non-Aligned Signatures, Shezaf et al,

Corpus to handle intransitivity C1 : Source Corpus C2: Target Corpus Pivot X Z {Z1,Z2} S(X) C1 C2 S(X) = Signature of X Z1, Z2 Target signature NAS(s,t) =Z = Winning signature Bilingual Lexicon Generation Using Non-Aligned Signatures, Shezaf et al,

Limitation of Context-based Approach • Lexical context around translation candidates. • Words may appear in similar context but are not translation of each other. So leads to false translation. • E.g.# using Chinese English comparable corpus we get (using definition of Fung 1995) • Distance between vector 1 & 2 is 0.084 > distance between vector 1 and 3 which is 0.075 • Does not use rich syntactic information other than bag-of-words. Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity, Kun Yu & Junichi Tsujii, 2009

Dependency Heterogeneity • Dependency Heterogeneity phenomena:a word in source language shares similar head and modifiers with its translation in target language, no matter whether they occur in similar context or not. • Uses rich syntactic information. • E.g.# • big(MOD) brown(MOD) dog(HEAD) • Bird(MOD) song(HEAD) • Song(MOD) bird(HEAD) Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity, Kun Yu & Junichi Tsujii, 2009

Does it work? Frequently used Modifier Frequently used Head Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity, Kun Yu & Junichi Tsujii, 2009

Comparable Corpora Preprocessing Refinements • Stemming on translation candidate. • Removal of stop words. • Sentences having more than k (= 30) words are removed. Focus is on Chinese-English bilingual dictionary extraction for single-nouns Raw corpora: Chinese and English pages from Wikipedia with inter-language links Morphological Analyzer POS tagger MaltParser to get syntactic dependency. Refinement to get preprocessed corpora. Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity, Kun Yu & Junichi Tsujii, 2009

Dependency Heterogeneity Vector Calculation Where: • NMOD : noun modifier • SUB : subject • OBJ : object, are the dependency labels produced by MaltParser. No Bilingual Dictionary is needed Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity, Kun Yu & Junichi Tsujii, 2009

Bilingual Dictionary Extraction (contd) • From this method distance(DH) between • DH(经济学, economics) = 0.222 & • DH(经济学, medicine) = 0.496. Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity, Kun Yu & Junichi Tsujii, 2009

Results of Bilingual Dictionary Extraction • Performed on 250 Chinese/English single-noun pairs • only-mod: (HNMODMod ) • only-head: (HNMODHead ,HSUBHead ,HOBJHead ) • only-NMOD: (HNMODHead ,HNMODMod ) Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity, Kun Yu & Junichi Tsujii, 2009

Result

Conclusion • Use of non-parallel corpora is inevitable and reduces the efforts of development of parallel corpora. • Modern techniques achieve accuracy upto 70% with non-parallel corpora. • Polysemy and sense disambiguation remains major challenge. • It becomes difficult to compare different implementation due to different nature of language and corpus.

References • Fung, P.; McKeown, K. (1997). Finding terminology translations from non-parallel corpora. Proceedings of the 5th Annual Workshop on Very Large Corpora,Hong Kong, 192-202. • Fung, P.; Yee, L. Y. (1998). An IR approach for translating new words from nonparallel, comparable texts. In: Proceedings of COLING-ACL 1998,Montreal, Vol. 1,414-420. • R. Rapp. 1999. Automatic Identification of Word Translations from Unrelated English and German Corpora. In Proc. of the ACL-99. pp. 1–17. College Park, USA.

References • Gaussier, Eric, Jean-Michel Renders, Irina Matveeva, Cyril Goutte, and Herve Dejean. 2004. A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pages 527–534, Barcelona, Spain. • X.Robitaille, Y.Sasaki, M.Tonoike, S.Sato and T.Utsuro. 2006. Compiling French Japanese Terminologies from the Web. Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics. • E.Morin, B.Daille, K.Takeuchi and K.Kageura. 2007. Bilingual Terminology Mining – Using Brain, not Brawn Comparable Corpora. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. pp. 664-671.

References • Nikesh Garera, Chris Callison-Burch, and David Yarowsky. 2009. Improving translation lexicon induction from monolingual corpora via dependency contexts and part-of-speech equivalences. In Proceedings of the Conference on Natural Language Learning (CoNLL), Boulder, Colorado. • K.Yu and J.Tsujii. 2009. Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity. Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies (NAACL-HLT 2009). • Daphna Shezaf, Ari Rappoport, Bilingual lexicon generation using non-aligned signatures Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 98–107, Uppsala, Sweden, 11-16 July 2010. c 2010 Association for Computational Linguistics

THANK YOU Questions ?

Comparable corpora and its application

Comparable corpora and its application

Presentation Transcript

Learning Translation Lexicons from Comparable Corpora

Elasticity and its Application

Comparable and Comparator

Elasticity and its Application

Elasticity and its Application

Gossip and its application

Comparable Corpora

Generalising lexical translation strategies for MT using comparable corpora

Extracting bilingual terminologies from comparable corpora

Comparable Corpora BootCat (CCBC)

Elasticity and Its Application

Elasticity and its Application

Finding Translations for Low-Frequency Words in Comparable Corpora

Elasticity and its Application

Elasticity and its Application

Using Comparable Corpora to Adapt a Translation Model to Domains

Comparable Corpora for Terminology

Co-referential chains and discourse topic shifts in parallel and comparable corpora

On-line Compilation of Comparable Corpora and Their Evaluation

Research and its application

Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT)

CNC And Its Application