1 / 44

Catch the Link! Combining Clues for Word Alignment

Catch the Link! Combining Clues for Word Alignment. Jörg Tiedemann Uppsala University joerg@stp.ling.uu.se. Outline. Background What do we want? What do we have? What do we need? Clue Alignment What is a clue? How do we find clues? How do we use clues? What do we get?.

melosa
Download Presentation

Catch the Link! Combining Clues for Word Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Catch the Link! Combining Clues for Word Alignment Jörg Tiedemann Uppsala University joerg@stp.ling.uu.se

  2. Outline • Background • What do we want? • What do we have? • What do we need? • Clue Alignment • What is a clue? • How do we find clues? • How do we use clues? • What do we get?

  3. What do we want? automatically language independent Source Aligned corpus Word aligner Sentence aligner Parallel corpus Trans- lation 1 Token links Trans- lation 2 Type links

  4. What do we have? • tokeniser (ca 99%) • POS tagger (ca 96%) • lemmatiser (ca 99%) • shallow parser (ca 92%), parser (> 80%) • sentence aligner (ca 96%) • word aligner • 75% precision • 45% recall

  5. What’s the problem with Word Alignment? (1) Alsop says, "I have a horror of the bad American practice of choosing up sides in other people's politics, ..." (2) Alsop förklarar: "Jag fasar för den amerikanska ovanan att välja sida i andra människors politik, ...” (Saul Bellow “To Jerusalem and back: a personal account”) (1) Neutralitetspolitiken stöds av ett starkt försvar till värn för vårt oberoende. (2) Our policy of neutrality is underpinned by a strong defence. (The Declarations of the Swedish Government, 1988) (1) Armén kommer att reformeras och effektiviseras. (2) The army will be reorganized with the aim of making it more effective. (The Declarations of the Swedish Government, 1988) • Word alignment challenges: • non-linear mapping • grammatical/lexical differences • translation gaps • translation extensions • idiomatic expressions • multi-word equivalences (1) I take the middle seat, which I dislike, but I am not really put out. (2) Jag tar mittplatsen, vilket jag inte tycker om, men det gör mig inte så mycket. (Saul Bellow “To Jerusalem and back: a personal account”) (1) Our Hasid is in his late twenties. (2) Vår chassid är bortåt de trettio. (Saul Bellow “To Jerusalem and back: a personal account”)

  6. So what? What are the real problems? • Word alignment • uses simple, fixed tokenisation • fails to identify appropriate translation units • ignores contextual dependencies • ignores relevant linguistic information • uses poor morphological analyses

  7. What do we need? • flexible tokenisation • possible multi-word units • linguistic tools for several languages • integration of linguistic knowledge • combination of knowledge resources • alignment in context

  8. Let’s go! • Clue Alignment! • finding clues • combining clues • aligning words

  9. Word Alignment Clues • The United Nations conference has started today . • Idag började FN-konferensen . NP VP ADVP [ ][ ][ ] ADVP VC NP DT NNP NNP NN VBZ VBN RB RGOS V@IIAS NCUSN@DS conference konferensen

  10. Word Alignment Clues • Def.: A word alignment clue Ci(s,t) is a probability which indicates an association between two lexical items, s and t, from parallel texts. • Def.: A lexical item is a set of words with associated features attached to it.

  11. How do we find clues? (1) • Clues can be estimated from association scores: • Ci(s,t) = wi *Ai (s,t) • co-occurrence: • Dice coefficient: A1 (s,t) = Dice (s,t) • Mutual information: A2 (s,t) = I (s;t) • string similarity • longest common sub-seq.ratio: A3 (s,t) = LCSR (s,t)

  12. How do we find clues? (2) • Clues can be estimated from training data: • Ci(s,t) = wi *P(ft |fs)  wi *freq(ft ,fs )/freq(fs) • fs, ftare features of s and t, e.g. • part-of-speech sequences of s, t • phrase category (NP, VP etc), syntactic function • word position • context features

  13. How do we use clues? (1) • Clues are simply sets of association measures • The crucial point: we have to combine them! If Ci(s,t) = P(ai ), define the total clue as • Call(s,t) = P(A) = P(a1 a2 ... an) Clues are not mutually exclusive! • P(a1 a2 ) = P(a1) + P(a2 ) - P(a1 a2 ) Assume independence! • P(a1 a2 ) = P(a1) * P(a2 )

  14. How do we use clues? (2) • Clues can refer to any set of tokens from source and target language segments. • overlaps • inclusions • Def.: A clue shares its indication with all member tokens! • allow clue combinations at the level of single tokens

  15. Clue overlaps - an example • The United Nations conference has started today. • Idag började FN-konferensen. Clue 1 (co-occurrence) United Nations FN-konferensen 0.4 Nations conference FN-konferensen 0.5 United FN-konferense 0.3 Clue 2 (string similarity) conference FN-konferensen 0.57 Nations FN-konferensen 0.29 Clueall United FN-konferensen 0.58 Nations FN-konferensen 0.787 conference FN-konferensen 0.785

  16. The Clue Matrix Idag började FN-konferensen The United Nations Conference has started today 0.5 0.5 0.5 0.7 0.7 0.4 0.4 0.787 0.57 0.2 0.3 0.3 0.72 0.58 Clue 2 (string similarity) conference FN-konferensen 0.57 Nations FN-konferensen 0.29 today idag 0.4 Clue 1 (co-occurrence) The United Nations FN-konferensen 0.5 United Nations FN-konferensen 0.4 has började 0.2 started började 0.6 started today idag 0.3 Nations conference började 0.4

  17. Clue Alignment (1) • general principles: • combine all clues and fill the matrix • highest score = best link • allow overlapping links only • if there is no better link for both tokens • if tokens are next to each other • links which overlap at one point form a link cluster

  18. Clue Alignment (2) • the alignment procedure: 1. find the best link 2. remove the best link (set its value to 0) 3. check for overlaps • accept: add to set of link clusters • dismiss otherwise 4. continue with 1 until no more links are found (or all values are below a certain threshold)

  19. Clue Alignment (3) Idag började FN-konferensen The United Nations conference has started today 0.5 0.5 0.5 0 0 0 0.7 0.7 0 0 0.4 0.4 0 0.787 0.57 0 0 0.2 0.3 0.3 0 0.72 0.58 0 Best link: conference FN-konferensen 0.57 Best link: started började 0.72 Best link: Nations FN-konferensen 0.787 Best link: has började 0.2 Best link: United FN-konferensen 0.7 Best link: today idag 0.58 Best link: The FN-konferensen 0.5 Link clusters: The United Nations conference FN-konferensen has started började today idag Link clusters: The United Nations conference FN-konferensen started började today idag Link clusters: United Nations conferenceFN-konferensen started började today idag Link clusters: United Nations FN-konferensen started började Link clusters: United Nations FN-konferensen started började today idag Link clusters: Nations FN-konferensen started började Link clusters: Nations FN-konferensen

  20. Bootstrapping • again: clues can be estimated from training data • self-training: use available links as training data • goal: learn new clues for the next step • risk: increased noise (lower precision)

  21. Learning Clues • POS-clue: • assumption: word pairs with certain POS-tags are more likely to be translations of each other than other word pairs • features: POS-tag sequences • position clue: • assumption: translations are relatively close to each other (esp. in related languages) • features: relative word positions

  22. So much for the theory! Results?! • The setup: Corpus and basic tools: • Saul Bellow’s “To Jerusalem and back: a personal account ”, English/Swedish, about 170,000 words • English POS-tagger (Grok), trained on Brown, PTB • English shallow parser (Grok), trained on PTB • English stemmer, suffix truncation • Swedish POS-tagger (TnT), trained on SUC • Swedish CFG parser (Megyesi), rule-based • Swedish lemmatiser, database taken from SUC

  23. Results!?! … not yet • basic clues: • Dice coefficient ( 0.3) • LCSR (0.4),  3 characters/string • learned clues: • POS clue • position clue • clue alignment threshold = 0.4 • uniform normalisation (0.5)

  24. Results!!! Come on! Preliminary results (… work in progress …) • Evaluation: 500 random samples have been linked manually (Gold standard) • Metrics: precisionPWA & recallPWA (Ahrenberg et al, 2000)

  25. Give me more numbers! • The impact of parsing. • How much do we gain? • Alignment results with n-grams, (shallow) parsing, and both:

  26. One more thing. • Stemming, lemmatisation and all that … • Do we need morphological analyses for Swedish and English?

  27. Conclusions • Combining clues helps to find links • Linguistic knowledge helps • POS tags are valuable clues • word position gives hints for related languages • parsing helps with the segmentation problem • lemmatisation gives higher recall • We need more experiments, tests with other language pairs, more/other clues • recall & precision is still low

  28. POS clues - examples score source target ---------------------------------------------------------- 0.915479582146249 VBZ V@IPAS 0.91304347826087 WRB RH0S 0.761904761904762 VBP V@IPAS 0.701943844492441 RB RG0S 0.674033149171271 VBD V@IIAS 0.666666666666667 DT NNP NN NCUSN@DS 0.647058823529412 PRP VBZ PF@USS@S V@IPAS 0.625 NNS NNP NP00N@0S 0.611859838274933 VB V@N0AS 0.6 RBR RGCS 0.5 DT JJ JJ NN DF@US@S AQP0SNDS NCUSN@DS

  29. Position clues - examples score mapping ------------------------------------ 0.245022348638765 x -> 0 0.12541095637398 x -> -1 0.0896900742491966 x -> 1 0.0767611096745595 x -> -2 0.0560378264563555 x -> -3 0.0514572790070555 x -> 2 0.0395256916996047 x -> 6 7 8

  30. Open Questions • Normalisation! • How do we estimate the wi’s? • Non-contiguous phrases • Why not allow long distance clusters? • Independence assumption • What is the impact of dependencies? • Alignment clues • What is a bad clue, what is a good one? • Contextual clues

  31. Clue alignment - example be ko var ställ scher min fru undrar road för jag de en lunch . amused 0 0 0 0 0 0 0 0 0 0 , 0 0 0 0 0 0 0 0 0 48 my 81 63 0 0 0 0 0 0 0 0 wife 58 80 0 0 0 0 0 0 0 0 asks 0 0 42 0 0 0 0 0 0 0 why 0 0 0 0 74 0 0 0 0 0 i 0 0 0 0 0 0 0 0 0 0 ordered 0 0 0 0 0 0 36 0 0 0 the 0 0 0 0 0 0 0 70 70 0 kosher 0 34 0 0 0 0 0 53 86 0 lunch 0 34 0 0 0 0 0 41 81 0 . 0 0 0 0 0 0 0 0 0 76

  32. Alignment - examples the Middle East Mellersta Östern afford kosta på at least åtminstone an American satellite en satellit common sense sunda förnuftet Jerusalem area Jerusalemområdet kosher lunch koscherlunch leftist anti-Semitism vänsterantisemitism left-wing intellectuals vänsterintellektuella literary history litteraturhistoriska manuscript collection handskriftsamling Marine orchestra marinkårsorkester marionette theater marionetteatern mathematical colleagues matematikkolleger mental character mentalitet far too alldeles

  33. Alignment - examples a banquet en bankett a battlefield ett slagfält a day dagen the Arab states arabstaterna the Arab world arabvärlden the baggage carousel bagagekarusellen the Communist dictatorships kommunistdiktaturerna The Fatah terrorists Al Fatah-terroristerna the defense minister försvarsministern the defense minister försvarsminister the daughter dotter the first President förste president

  34. Alignment - examples American imperial interests amerikanska imperialistintressenas Chicago schools Chicagos skolor decidedly anti-Semitic avgjort antisemitiska his identity sin identitet his interest sitt intresse his interviewer hans intervjuare militant Islam militanta muhammedanismen no longer inte längre sophisticated arms avancerade vapen still clearly uppenbarligen ännu dozen Russian dussin ryska exceedingly intelligent utomordentligt intelligent few drinks några drinkar goyish democracy gojernas demokrati industrialized countries industrialiserade länderna has become har blivit

  35. Gold standard - MWUs link: Secretary of State -> Utrikesminister link type: regular unit type: multi -> single source text:Secretary of State Henry Kissinger has won the Middle Eastern struggle by drawing Egypt into the American camp. target text:Utrikesminister Henry Kissinger har vunnit slaget om Mellanöstern genom att dra in Egypten i det amerikanska lägret.

  36. Gold standard - fuzzy links link: unrelated -> inte tillhör hans släkt link type: fuzzy unit type: single -> multi source text: And though he is not permitted to sit beside women unrelatedto him or to look at them or to communicate with them in any manner (all of which probably saves him a great deal of trouble), he seems a good-hearted young man and he is visibly enjoying himself. target text: Och fastän han inte får sitta bredvid kvinnor som inte tillhör hans släkt eller se på dem eller meddela sig med dem på något sätt (alltsammans saker som utan tvivel besparar honom en mängd bekymmer) verkar han vara en godhjärtad ung man, och han ser ut att trivas gott.

  37. Gold standard - null links link: do -> link type: null unit type: single -> null source text:"How is it that you do not know English?" target text:"Hur kommer det sig att ni inte talar engelska?"

  38. Gold standard - morphology link: the masses -> massorna link type: regular unit type: multi -> single source text: Arafat was unable to complete the classic guerrilla pattern and bring the masses into the struggle. target text: Arafat har inte kunnat fullborda det klassiska gerillamönstret och föra in massorna i kampen.

  39. Evaluation metrics • Csrc – number of overlapping source tokens in (partially) correct link proposals, Csrc=0 for incorrect link proposals • Ctrg – number of overlapping target tokens in (partially) correct link proposals, Ctrg=0 for incorrect link proposals • Ssrc – number of source tokens proposed by the system • Strg – number of target tokens proposed by the system • Gsrc – number of source tokens in the gold standard • Gtrg – number of target tokens in the gold standard

  40. Evaluation metrics - example

  41. Corpus markup (Swedish) <s lang="sv" id="9"> <c id="c-1" type="NP"> <w span="0:3" pos="PF@NS0@S" id="w9-1" stem="det">Det</w> </c> <c id="c-2" type="VC"> <w span="4:2" pos="V@IPAS" id="w9-2" stem="vara">är</w> </c> <c id="c-3"> <w span="7:3" pos="CCS" id="w9-3" stem=”som">som</w> </c> <c id="c-4" type="NPMAX"> <c id="c-5" type="NP"> <w span="11:3" pos="DI@NS@S" id="w9-4" stem="en">ett</w> <w span="15:5" pos="NCNSN@IS" id="w9-5">besök</w> </c> <c id="c-6" type="PP"> <c id="c-7"> <w span="21:1" pos="SPS" id="w9-6" stem="1">i</w> </c> <c id="c-8" type="NP"> <w span="23:9" pos="NCUSN@DS" id="w9-7" stem="barndom">barndomen</w> </c> </c> </c> </s>

  42. Corpus markup (English) <s lang="en" id="9"> <chunk type="NP" id="c-1"> <w span="0:2" pos="PRP" id="w9-1">It</w> </chunk> <chunk type="VP" id="c-2"> <w span="3:2" pos="VBZ" id="w9-2” stem="be">is</w> </chunk> <chunk type="NP" id="c-3"> <w span="6:2" pos="PRP$" id="w9-3">my</w> <w span="9:9" pos="NN" id="w9-4">childhood</w> </chunk> <chunk type="VP" id="c-4"> <w span="19:9" pos="VBD" id="w9-5">revisited</w> </chunk> <chunk id="c-5"> <w span="28:1" pos="." id="w9-6">.</w> </chunk> </s>

  43. … is that all? • How good are the new clues? • Alignment results with learned clues only: (neither LCSR nor Dice)

More Related