1 / 18

DCU@FIRE 2012: Monolingual and Crosslingual SMS-based FAQ Retrieval

Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland. DCU@FIRE 2012: Monolingual and Crosslingual SMS-based FAQ Retrieval. Outline. Motivation System Setup and Changes Monolingual Experiments Crosslingual Experiments SMT system Training data Translation results

Download Presentation

DCU@FIRE 2012: Monolingual and Crosslingual SMS-based FAQ Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland DCU@FIRE 2012: Monolingual and Crosslingual SMS-based FAQ Retrieval

  2. Outline Motivation System Setup and Changes Monolingual Experiments Crosslingual Experiments SMT system Training data Translation results OOV Reduction FAQ Retrieval Results Conclusions and Future Work

  3. Motivation • Task: • Given a SMS query, find FAQ documents answering the query • Last year’s DCU system: • SMS correction and normalisation • In-Domain retrieval: • Three approaches (SOLR, Lucene, Term Overlap) • Out-of-domain (OOD) detection: • Three approaches (term overlap, normalized BM25 scores, ML) • Combination of ID retrieval and OOD results

  4. Motivation This year’s system: Same SMS correction and normalisation one more spelling correction resource (manually created) Single retrieval approach: Lucene with BM25 retrieval model Single OOD detection approach: IB-1 classification using Timbl (Machine Learning) additional features for term overlap and normalized BM25 scores Trained statistical machine translation system for document translation (Hindi to English)

  5. Questions Investigate the influence of OOD detection on system performance the influence of out-of-vocabulary (OOV) words on crosslingual performance

  6. Collection Statistics

  7. Monolingual Experiments (Setup) Experiments for English and Hindi Processing steps: Normalize SMS and FAQ documents Correct SMS queries Retrieve answers Detect OOD queries (or not), e.g. “NONE” queries Produce final result

  8. Crosslingual Experiments (Setup) Experiments for English to Hindi Additional translation step to translate Hindi FAQ documents into English Translation is based on newly trained statistical machine translation system (SMT) Problem: sparse training data → combination of different training resources out of vocabulary (OOV) words → OOV reduction

  9. Crosslingual Experiments (SMT System) Training an SMT system Data preparation tokenization/normalization scripts Data alignment Giza++ for word-level alignment Phrase extraction Moses MT toolkit Training a language model SRILM for trigram LM with Kneser-Ney smoothing Tuning Minimum error rate tuning (MERT)

  10. Crosslingual Experiments (Training Data) Agro (agricultural domain): 246 sentences Crowdsourced HI-EN data: 50k sentences EILMT (tourism domain): 6700 sentences ICON: 7000 sentences TIDES: 50k sentences FIRE ad-hoc queries: 200 titles, 200 descriptions Interlanguage Wikipedia links: 27k entries OPUS/KDE: 97k entries UWdict: 128k entries

  11. Translation Results (Hindi to English)

  12. OOV Reduction Problem: 15.4% untranslated words in translation output Idea: modify untranslated words to obtain a translation OOV reduction is based on two resources UWdict Manually created transliteration lexicon (TRL): 639 entries

  13. OOV Reduction Word modifications: Character normalization, e.g. replace Chandrabindu with Bindu delete Virama character replace long with short vowels Stemming Lucene Hindi stemmer Transliteration ITRANS transliteration rules rules for cleaning up ITRANS results Decompounding word split at every position into candidate constituents word is decompounded if both constituents have a translation

  14. OOV Reduction Results (Hindi to English)

  15. FAQ Retrieval Results

  16. Conclusions Monolingual experiments: Good performance for English and Hindi OOD detection improves MRR (but reduces number of correct ID queries) Crosslingual experiments: Lower performance OOD detection reduces MRR OOV reduction reduces MRR

  17. Future work Further analysis of our results needed Normalization issues for MT training data? Unbalanced OOD training data for Hindi and English? Is there Hindi textese (e.g. abbreviations etc.)? Does the training data match the test data? manually or automatically created Improve transliteration approach Comparison to other submissions

  18. 10q 4 ur @ensn

More Related