1 / 17

Automatic Summarization using IE Technologies and Pattern Discovery

This paper explores the use of Information Extraction (IE) technologies for text summarization by combining Named Entity recognition and automatic pattern discovery. The goal is to extract important phrases and sentences from a document to create a concise summary. The paper also discusses the evaluation and optimization of scores for sentence position, length, TF/IDF, and similarity to headline.

rhuerta
Download Presentation

Automatic Summarization using IE Technologies and Pattern Discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NYU/CRL system for DUCandProspect for Single Document Summaries September 14, 2001 DUC2001 Workshop Satoshi Sekine (New York University) Chikashi Nobata (CRL – Japan)

  2. Objective • Use IE technologies for Summarization • Named Entity • Automatic pattern discovery Find important phrases (patterns) of the domain • Combine with Summarization technologies • Important Sentence Extraction • Sentence position, length, TF/IDF, Headline

  3. Important Sentence Extraction • Combining 5 scores • Sentence position • Sentence length • TF/IDF • Similarity to Headline • Pattern • Optimize functions/weights on training data

  4. Alternative scores forSentence position 1 (i<T) 0 (otherwise) max(1/i, 1/(n-i+1)) Score 1/i n 1 T Sentence position

  5. Alternative scores forSentence length & TF/IDF • Sentence length 1. Score = Length 2. Score = Length (if L>C) Length – C (other wise) • TF/IDF TF = tf(w), (tf(w)-1)/tf(w), tf(w)/(tf(w)+1)

  6. Alternative scores for Headline • TF/IDF ratio between words overlapping words in headline and all words in sentence • TF ratio between overlapping Named Entities (NE), and all NE’s in sentence TF = tf(e)/(1+tf(e))

  7. Pattern • Assumption Patterns (phrases) that appear often in the domain are important • Strategy • Intended to use IR to find a larger set of documents in the domain, but used the given document set • NE’s were treated as class rather than the literal

  8. Pattern discovery • Procedure • Analyze sentences (NE, dependency) • Extract all sub-trees from the dependency trees in the domain • Score the trees based on frequency of the tree and TF/IDF of the words • High score trees are regarded as important patterns

  9. Optimal weight • Optimal weights are found on training set • Contribution

  10. Evaluation Result • Subjective evaluation (V; out of 12) • Average over all documents

  11. Prospect for Single Document Summaries Important Sentence Extraction CAN be Summarization but Summarization is NOT Important Sentence Extraction

  12. DUC • We are aiming for Document understanding • How can understanding be instantiated? • Make summary • Extract essential point, principle relations • Answer questions • Comprehension test

  13. Example Earthquake jolts Los Angeles area LOS ANGELES (AP) — An earthquake shook the greater Los Angeles area Sunday, but there were no immediate reports of damage or injuries. The quake had a preliminary magnitude of 4.2 and was centered about one mile southeast of West Hollywood, said Lucy Jones of the U.S. Geological Survey. The quake was felt in downtown Los Angeles where it rolled for about four seconds and also shook in the suburban areas of Van Nuys, Whittier and Glendale.

  14. Essential points • Event (Earthquake) • When: Sunday, September 9, 2001 • Where: greater Los Angeles area • Magnitude: 4.2 • Injury: No • Death: No • Damage: No

  15. How can we make it • IE is a hint (a step) • IE is a version of document understanding limited to a specific domain and task which are given in advance • Document understanding can be achieved by upgrading IE technologies by deleting “specific” and “given in advance”

  16. Our approach • Essential points can be found by searching frequently mentioned patterns in the same domain • Strategy • Given a document, find its domain by IR • Find frequently mentioned patterns • Extract information matching those patterns

  17. Single Document Summarization • Has to be continued • To pursue researches on “Understanding” • Tofind something more than sentence extraction • To observe human in summary task • To have new comers (like us)

More Related