1 / 38

Using bilingual LSA for FN annotation of French text from generic resources

Using bilingual LSA for FN annotation of French text from generic resources. Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund. Outline. The (small) FR.FrameNet project The projection problem Realizations French Frames database Annotated reference sub-corpus

ehren
Download Presentation

Using bilingual LSA for FN annotation of French text from generic resources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund

  2. Outline • The (small) FR.FrameNet project • The projection problem • Realizations • French Frames database • Annotated reference sub-corpus • English semantic clusters from FEs • Projection into French • Other potential applications Guillaume Pitel - LORIA - Nancy

  3. The (small) FR.FrameNet project • A Berkeley-Nancy collab. Funded by France-Berkeley Fund - ICSI, ATILF, LORIA • French participants : Susanne Alt, Benoît Crabbé, Christiane Jadelot, Guillaume Pitel, Laurent Romary • Setting the foundations for a cheap bootstrapping of aFrench FrameNet • Reusing existing French Lexical Semantic resources • Reusing any available resources • Focus on automatic methods Guillaume Pitel - LORIA - Nancy

  4. The projection problem • Use a semantic lexicon in language A to annotate a corpus in language B • Resulting data is expected to be of much lower quality than a handcrafted lexicon • It is a bootstrapping process : requires manual correction • Important question : does it really speed up the final production ? Guillaume Pitel - LORIA - Nancy

  5. Pado & Lapata approach • Using a Source language/Target language parallel corpus • The Source-side of the corpus must be FN-annotated, • The roles are projected in the Target corpus • Train a statistical semantic role parser for Target language • Automatic annotation of any corpus in Target language Guillaume Pitel - LORIA - Nancy

  6. Pado & Lapata approach • Problems • translation is not frame-conserving in many cases (20-30%) • parallel corpora are a rare resource • Berkeley’s FrameNet is not built on the English side of a // corpus :( • But very useful with a resource like Europarl Guillaume Pitel - LORIA - Nancy

  7. The main bottleneck • Existence of parallel AND annotated corpora : rare and expensive to build • But… • Annotated corpora are available • Parallel, aligned corpora are available Guillaume Pitel - LORIA - Nancy

  8. The Semantic Space based approach (using LSA) • Pure semantic annotation • no grammatical function • no POS • Use a bilingual LSA space to make the projection • Preparation : • Find the lexical units in the Target language that fits for each frame • Use an available resource • Compute them automatically • Compute the semantic clusters of each frame element Guillaume Pitel - LORIA - Nancy

  9. The Semantic Space based approach (using LSA) • Usage : Automatic preannotation (or selection) • For each sentence in Target corpus • Find potential frames from LUs • Compare each word (or head of constituent) of the sentence with to computed semantic clusters of the (core) roles of the candidate frames (or the corresponding roles in parents if training data missing) • Candidate Frames and FEs are rated by the semantic distance • What we can expect • Can’t deal with anaphora, • Can’t deal with FEs not semantically narrow Guillaume Pitel - LORIA - Nancy

  10. Subprojects • Convert frames to French • Using the ISC Semantic Atlas (built from 2 synonym dictonaries + a minimal FR//EN corpus) • Annotation of reference subcorpus • 1000 sentences from Europarl • Projection using LSA Guillaume Pitel - LORIA - Nancy

  11. Convert Frames to French Guillaume Pitel - LORIA - Nancy

  12. English LUs to French LUs • For each Frame in Berkeley FrameNet • For each LU, find potential translations in French. Using Semantic ATLAS (Ploux & Ji, 2003) - other languages ? • Compute the French “profile” of the Frame • Manually check that a lemma can actually evoke the frame (pure subjective judgment) • Frame-by-frame procedure • Must be validated later by corpus evidence Guillaume Pitel - LORIA - Nancy

  13. Lexical units in “Filling” Frame • adorn.v, anoint.v, asphalt.v, brush.v, butter.v, coat.v, cover.v, cram.v, crowd.v, dab.v, daub.v, douse.v, drape.v, drizzle.v, dust.v, embellish.v, fill.v, flood.v, gild.v, glaze.v, hang.v, heap.v, inject.v, jam.v, load.v, pack.v, paint.v, panel.v, pave.v, pile.v, plant.v, plaster.v, pump.v, scatter.v, seed.v, shower.v, smear.v, sow.v, spatter.v, splash.v, splatter.v, spray.v, spread.v, sprinkle.v, squirt.v, strew.v, stuff.v, suffuse.v, surface.v, tile.v, varnish.v, wallpaper.v, wrap.v Guillaume Pitel - LORIA - Nancy

  14. Translations 1/4 • Adorn : Chamarrer, embellir, enjoliver, orner, parer, revêtir • Anoint : Oindre • Asphalt : Asphalter, bitumer • Brush: Badigeonner, brosser, effleurer • Butter : Beurrer • Coat : Empâter, enduire, enrober, revêtir • Cover : badigeonner, barbouiller, couvrir, franchir, gainer, garnir, habiller, monter, parcourir, quadriller, recouvrir, revêtir, saillir, se couvrir, subvenir, tapisser • Cram : bachoter, bâfrer, bûcher, chauffer, engraisser, lester, potasser • Crowd : foule (should be also peupler) • Dab : bassiner, tamponner, toucher • Daub : badigeonner, barbouiller, peinturlurer • Douse : ??? • Drape : Draper • Drizzle : brouillasser, bruiner, crachiner, pleuvasser, pleuviner • Dust : enlever la poussière, essuyer, poussière, saupoudrer, épousseter • Embellish : broder, embellir, enjoliver, orner • Fill : appliquer un enduit, boucher, bourrer, calfeutrer, combler, devenir plein, emplir, enfler, fourrer, garnir, gonfler, gorger, imprégner, lester, mastiquer, meubler, obturer, occuper, peupler, plomber, pourvoir, pourvoir à, pénétrer, remplir, s'enfler, se gonfler, se peupler, se remplir Guillaume Pitel - LORIA - Nancy

  15. Manual selection 1/4 • Adorn : Chamarrer, embellir, enjoliver, orner, parer, revêtir • Anoint : Oindre • Asphalt : Asphalter, bitumer • Brush: Badigeonner, brosser, effleurer • Butter : Beurrer • Coat : Empâter, enduire, enrober, revêtir • Cover : badigeonner, barbouiller, couvrir, franchir, gainer, garnir, habiller, monter, parcourir, quadriller, recouvrir, revêtir, saillir, se couvrir, subvenir, tapisser • Cram : bachoter, bâfrer, bûcher, chauffer, engraisser, lester, potasser • Crowd : foule (should be also peupler) • Dab : bassiner, tamponner, toucher • Daub : badigeonner, barbouiller, peinturlurer • Douse : ??? • Drape : Draper • Drizzle : brouillasser, bruiner, crachiner, pleuvasser, pleuviner • Dust : enlever la poussière, essuyer, poussière, saupoudrer, épousseter • Embellish : broder, embellir, enjoliver, orner • Fill : appliquer un enduit, boucher, bourrer, calfeutrer, combler, devenir plein, emplir, enfler, fourrer, garnir, gonfler, gorger, imprégner, lester, mastiquer, meubler, obturer, occuper, peupler, plomber, pourvoir, pourvoir à, pénétrer, remplir, s'enfler, se gonfler, se peupler, se remplir Guillaume Pitel - LORIA - Nancy

  16. Frame building : Conclusion • Quite inexpensive compared to an approach of introspection from scratch or corpus-based (Filling is a big frame with a lot of LUs, it took me ~ 30min to select good instances - with manual color setting) • Probably far from perfect coverage, low precision • Need several annotators to duplicate the work Guillaume Pitel - LORIA - Nancy

  17. Our approach to cross-language semantic annotation • The goal : • A lemma can be related to several Frames • We want to disambiguate between the possible choices, • And also try to attribute roles (at least core roles) once we have made the choice • All of this in French, while we have the training data in English Guillaume Pitel - LORIA - Nancy

  18. Bilingual LSA approach Guillaume Pitel - LORIA - Nancy

  19. Latent Semantic Analysis • Improvement of cooccurrence matrices • Reduce the number of dimensions • Example : • A occurs in documents (or contexts) 1,2,3 • B in 2,3,4,5 • C in 4,5,6 • A and C never occur in the same document • LSA would allow to reduce documents 1-6 into one dimension Guillaume Pitel - LORIA - Nancy

  20. Evaluating the semantic position of Frame Elements in LSA • Computing an English LSA space • Tools : Treetagger + Infomap-nlp • Corpus : BNC+English part of Europarl + translation of Balzac • POS+lemma : “NNyear” • Keep only Verbs, Adjectives, Nouns, Adverbs • Other combinations (no POS, all POS, raw form) don’t perform as well Guillaume Pitel - LORIA - Nancy

  21. Example • FE’s annotations for Filling.Theme • with water. • with a fungicide such as green or yellow sulphur. • with a soft brush and malathion dust. • with a little cayenne pepper. • … • Terms used for the FE’s representation • NNwater;NNfungicide;JJsuch;JJgreen;JJyellow;NNsulphur;JJsoft;NNbrush;NNmalathion;NNdust;JJlittle;NNcayenne;NNpepper Guillaume Pitel - LORIA - Nancy

  22. Evaluating FE’s semantic coherence • Compute the semantic center of the FE = center of each FE term’s position • Find the N nearest neighbors of this center • If the center is in a semantically coherent region, the average similarity between neighbors and center is high. Guillaume Pitel - LORIA - Nancy

  23. FEs de Filling Frame.FE Average Std Min Max Nb annot Filling.Agent 0.604941 0.0413504 0.563591 0.717469 279 Filling.Cause Filling.Degree 0.595513 0.0431123 0.552401 0.697830 4 Filling.Depictive 0.683302 0.0502735 0.633029 0.804053 1 Filling.Goal 0.6483 0.0510976 0.597202 0.793063 543 Filling.Instrument 0.646028 0.0715617 0.574466 0.844308 4 Filling.Manner 0.647012 0.0795992 0.567413 0.896142 25 Filling.Means 0.67356 0.0502949 0.623265 0.820630 1 Filling.Path 0.708096 0.069683 0.638413 0.925448 2 Filling.Place 0.562765 0.0364663 0.526299 0.683526 2 Filling.Purpose 0.631099 0.0585047 0.572594 0.761788 5 Filling.Result 0.734567 0.0585102 0.676057 0.825459 37 Filling.Source 0.611222 0.0447367 0.566485 1.000000 1 Filling.Subregion 0.782659 0.0756196 0.707039 0.944916 2 Filling.Theme 0.747146 0.0485786 0.698567 0.890307 450 Filling.Time 0.474269 0.0474972 0.426772 0.628049 16 Guillaume Pitel - LORIA - Nancy

  24. Neighbors of Filling.Theme • powder 0.890307 • spray 0.836283 • dry 0.821666 • crushed 0.820905 • charcoal 0.813571 • plastic 0.806768 • copper 0.804459 • paste 0.802643 • foam 0.802201 • brush 0.799847 • … • Computed from : with fake diamonds. with pictures of cute white bunnies. with jewels and fine gowns. with one of these pegs. with pictures , flowers , and messages of peace. with wreaths of flowers and garlands of feathers. with the finest furniture from a firm in London 's New Bond Street. with a crown. with beautifully hooked melodies and harmonies. with chrism , the sacred ointment ,. with gel. with such a leaden armour of expectations. with the poison. with these substances. with vaseline. with his pungent urine. with holy oil. in bulb fibre. in whipped cream and honey. with a foot of topsoil. with her hand. … Guillaume Pitel - LORIA - Nancy

  25. Neighbors of Filling.Agent • oliver 0.717469 • jack 0.696716 • joe 0.691628 • marie 0.686812 • harry 0.684113 • charlie 0.681887 • billy 0.680378 • tom 0.678887 • jane 0.676179 • rose 0.669748 • … • Computed from :Your man. I. They. The priests. He. the wife of Cnut 's henchman Tofi the Proud. The Reclusiarch. she. What father. The Indians. Over 200 species of birds. He. He. Father Peter. Viktor. by ecclesiastics. We. One girl. She. she. he. the white gravel. the reluctant soldier. I. Eva. he. Two people. he. the good beachcombers. Sylvester. he. He. Two girls. you. Cecil Beaton. you. Larsen. you. He. you. you. He. he. she. Mina and K. She. you. she. the programme that turns the cameras on teenagers and let's them do the talking and the interviews. Baldwin. by Molly Fletcher. She. I. They. she. Endill. They. He. the BBC and official propaganda… Guillaume Pitel - LORIA - Nancy

  26. FEs’ clusters • Grouping terms of the FE by minimal distance (arbitrarily set) i.e. 0.8 = 74° • Keeping clusters with more than 5% of terms • http://guillaume.work.free.fr/Frames.en.3 Guillaume Pitel - LORIA - Nancy

  27. Clusters of Filling frame • Agent : 2 cluster(s) • Degree : 4 cluster(s) • Depictive : 6 cluster(s) • Goal : 2 cluster(s) • Instrument : 6 cluster(s) • Manner : 2 cluster(s) • Means : 2 cluster(s) • Path : 1 cluster(s) • Place : 5 cluster(s) • Purpose : 1 cluster(s) • Result : 2 cluster(s) • Source : 1 cluster(s) • Subregion : 1 cluster(s) • Theme : 2 cluster(s) • Time : 0 cluster(s) Guillaume Pitel - LORIA - Nancy

  28. Clusters Filling.Agent • rachel 0.867663 • sara 0.863332 • ellen 0.856612 • lily 0.855513 • sally 0.853933 • alice 0.849205 • emily 0.847480 • dad 0.845598 • jenny 0.844003 • kate 0.839664 • maggie 0.836391 tom 0.924026 john 0.908828 hugh 0.898049 michael 0.897622 scott 0.892861 sir 0.891623 david 0.889539 frank 0.889324 murray 0.879660 anthony 0.879149 geoffrey 0.876748 Guillaume Pitel - LORIA - Nancy

  29. Clusters Filling.Goal • tin 0.924426 • pot 0.908988 • jar 0.908169 • cake 0.893367 • bottle 0.888083 • bag 0.871596 • jug 0.866099 • bowl 0.860658 • basket 0.858857 • plastic 0.852992 • dish 0.846176 • peel 0.834313 wall 0.911646 wooden 0.864492 entrance 0.851708 front 0.846124 floor 0.834214 porch 0.834039 staircase 0.827131 roof 0.823297 rear 0.815847 corner 0.815765 rear 0.813187 front 0.813136 Guillaume Pitel - LORIA - Nancy

  30. Clusters Filling.Theme • powder 0.913015 • salt 0.907773 • dry 0.900202 • aromatic 0.886529 • vegetable 0.870903 • spray 0.867004 • bean 0.860508 • herb 0.858321 • meat 0.852165 • apple 0.848998 • vinegar 0.848045 • pea 0.845492 shiny 0.915945 red 0.908281 pink 0.905748 tint 0.900729 grey 0.899490 yellow 0.882565 blue 0.882097 white 0.877434 ribbon 0.876266 brown 0.875334 pale 0.875016 silk 0.865824 Guillaume Pitel - LORIA - Nancy

  31. Projection • Compute French clusters from English clusters • Corpus collection • Europarl (French-English) • // French-English Balzac from Project Gutenberg • French//English : 50M lemmas • Shakespeare, Hansard Corpus to be included Guillaume Pitel - LORIA - Nancy

  32. Training data • Lemmas interleaved on a sentence alignment basis • Training with a larger window • Only parallel corpus, experiments that introduce bits of pure monolingual corpus show a quality loss Guillaume Pitel - LORIA - Nancy

  33. Similarity between translations in the Biling. Sem. Space Results : • eat / manger : 0,98 (32°) • fleuve / river : 0,94 (55°) • green / vert : 0,83 (92°) • bleu / blue : 0,87 (81°) • eat / fleuve : 0,77 (107°) • drink / écran : 0,82 (96°) Guillaume Pitel - LORIA - Nancy

  34. Neighborhood in Bilingual Semantic Space • Eat/Manger Guillaume Pitel - LORIA - Nancy

  35. Neighborhood in Bilingual Semantic Space • Fleuve/River Guillaume Pitel - LORIA - Nancy

  36. Neighborhood in Bilingual Semantic Space • Vert/Green Guillaume Pitel - LORIA - Nancy

  37. Projection: Conclusion • Projecting whole clusters gives variable results • Results in the projection are very disappointing • Unusable in this state • Seems that it may simply come from alignment mistakes • Can we improve the projected clusters with a bilingual dictionary ? • Relating clusters to Synsets ? Not necessarily a good idea : Champagne and Caviar are not related in WN • More generally “simple” translation may cause undesired broadening of the cluster Guillaume Pitel - LORIA - Nancy

  38. Potential application • Statistical processing is interesting because it can capture “usage-based” regularities • Clusters built with LSA can be interesting information sources for the lexicographer • They can also more simply be used to automatically find new semantic types/selectional preferences emerging from the annotation of a new domain (metaphors occuring frequently for instance) • In a multilingual, collaborative annotation task, could be useful in order to transfer work between languages without requiring annotation of a parallel corpus. Guillaume Pitel - LORIA - Nancy

More Related