1 / 24

Automatic Keyphrase Extraction by Bridging Vocabulary Gap

Automatic Keyphrase Extraction by Bridging Vocabulary Gap. Xinxiong Chen Tsinghua University 2013-04-26. Main Idea. Vocabulary gap: Appropriate keyphrases are not always statistically significant or even do not appear in the given document.

Download Presentation

Automatic Keyphrase Extraction by Bridging Vocabulary Gap

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Keyphrase Extraction by Bridging Vocabulary Gap Xinxiong Chen Tsinghua University 2013-04-26 THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn

  2. Main Idea • Vocabulary gap: Appropriate keyphrases are not always statistically significant or even do not appear in the given document. • Use word alignment models in statistical machine translation to learn translation probabilities between the words in documents and the words in keyphrases. THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn

  3. Introduction – Keyphrase • What is keyphrase • a set of terms selected from a document as a short summary of the document. THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn

  4. Introduction – Keyphrase Extraction • Why keyphrase extraction • Digital libraries • Information Retrieval • Goal : automatically extract keyphrasesfrom documents • Unsupervised THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn

  5. Example • A News article: (translated from Chinese) THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn

  6. Example • Existing unsupervised method: • TFIDF : Nuclear bombs , Iran , Israeli , enriched uranium , speech • TextRank : Iran , Israeli , chief , Nuclear bombs , Military • Use a window whose size is a constant to build a word graph • Use PageRank to decide which word is more important THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn

  7. Example • LDA : Iran , England , America , Nation , Speech • Learn topics from documents • ExpandRank : Iran , enriched uranium , Israeli , atomic energy, Lebanon • Find k nearest neighbor documents to build word graphs THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn

  8. Idea - Association • If a word is mentioned, it remind people of other words. • iPhone – Apple • Nuclear bombs – Nuclear Weapon • What is the probability between “Nuclear bombs” and “Nuclear Weapon”? THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn

  9. Idea – SMT for Keyphrase Extraction • Both the content and the keyphrase are parallel summaries of a news • Unsupervised : Use title or summarization instead • Estimate the translation probabilities between the words in content and title • word alignment models Translation THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn

  10. Translation Probability • Example: • Nuclear bombs: • Nuclear bombs : 0.515757 • Liquid : 0.0871815 • Nuclear Weapon : 0.0808868 • Military Action: 0.0239178 • Israeli Military : 0.0215988 • Miniaturization : 0.0118 • Possible : 0.0113688 • enriched uranium : 0.0100252 THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn

  11. Keyphrase Extraction Using WAM • Given news, rank keyphrasesby computing the scores • Iran , Israeli , chief, Nuclear bombs , Military … • Iran , Israeli , chief, Nuclear bombs , Nuclear weapon , Military , speech THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn

  12. Word Trigger Method (WTM) • Three Steps : • Preparing translation pairs • Learning a translation model • IBM Model-1 • Extracting keyphrasegiven a resource THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn

  13. Translation Pairs • Length unbalance problem • Unable to list all tags on the annotation side • Tags may have different importance for the resource THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn

  14. Content-Title Pairs • Length unbalanced problem • Unable to list all tags on the annotation side • Tags may have different importance for the resource • Sampling Method • Tag weighting type • TFt, TF-IRFt • Length ratio THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn

  15. Learning Translation Probabilities • IBM Model-1 as WAM algorithms • Asymmetric: Prd2a(t|w), Pra2d(t|w) • Linear Combination • Prd2a(t|w) • Pra2d(t|w) • When λ = 1 or λ = 0, it simply uses model Prd2a(t|w) or Pra2d(t|w) correspondingly THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn

  16. Tag Suggestion Using Triggered Words • Given description, rank tags by computing the scores THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn

  17. Tag Suggestion Using Triggered Words • Given description, rank tags by computing the scores • Trigger power of the word w in the content • TF-IRFw • TextRank • Their product THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn

  18. Keyphrase Extraction Using Triggered Words • Given description, rank tags by computing the scores • Translation probabilities from words in description to keyphraes THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn

  19. Emphasize Tags Appearing In Content for WTM (EWTM) • Emphasize tags appearing in description • It(w): indicator function to emphasize the tags appearing in content • Gets 1 when t = w • Gets 0 when t != w THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn

  20. Experiments • Datasets • 13702 news from www.163.com • Evaluation Metrics • Precision, recall and F-measure THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn

  21. Experiment Results THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn

  22. Parameters – Length Ratio • The length ratio: content/title THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn

  23. Application SINA APP(http://app.thunlp.org/weibo) Now we have more than 2 million registered users THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn

  24. Thank you ! Q & A

More Related