1 / 15

CSC 9010: Text Mining Applications Document Summarization

CSC 9010: Text Mining Applications Document Summarization. Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851. Document Summarization. Document Summarization Provide meaningful summary for each document Examples: Search tool returns “context”

tia
Download Presentation

CSC 9010: Text Mining Applications Document Summarization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSC 9010: Text Mining ApplicationsDocument Summarization Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851

  2. Document Summarization • Document Summarization • Provide meaningful summary for each document • Examples: • Search tool returns “context” • Monthly progress reports from multiple projects • Summaries of news articles on the human genome • Often part of a document retrieval system, to enable user judge documents better • Surprisingly hard to make sophisticated • Surprisingly easy to make effective

  3. Document Summarization -- How Three general approaches: • Extract predefined summary. • Useful in highly structured environments where you can specify format. Typically very good summaries. • Capture in abstract representation, generate summary • Useful in well-defined domains with clearcut information needs. • Extract representative sentences/clauses. • Useful in arbitrarily complex and unstructured domains; broadly applicable, and gets "general feel".

  4. Extract Predefined Summary • Documents have a well-defined format. • Format includes a summary or abstract explicitly written by document author. • Text mining may reorganize, regroup, restructure summaries. • Example: • People working on multiple projects write monthly reports based on what they have done, one sentence/project. • Reporting system collects person-level reports and reorganizes into project-level reports.

  5. Extract Predefined Summary: Methods • Extraction using some or all of • NLP for document parsing/chunking (finding abstract) • standard computer science: database retrieval, string processing, etc. • Reorganizing may be done using • explicit fields specified by author • keywords searched for in documents • business rules which capture knowledge about who is working on what tasks and projects • Grouping can shade into document classification for long summaries, ill-defined match to categories

  6. Extracting Predefined Summaries: Advantages and Disadvantages • Advantages • Summaries reflect intent of author. • If part of an overall reporting system can actually make it simpler for author. • Incremental effort for author not large. • Disadvantages • Incremental effort for author not zero either. • Only feasible in structured situation where requirement can be defined ahead of time. • Can't be used to summarize a group of documents. • Not all authors write good summaries.

  7. Capture and Generate • Documents can have arbitrary format • Knowledge needed is well-defined. • Often information need is for summarizations across multiple documents • Example: • Summarizing restaurant reviews. Take newspaper articles and produce price range, kind of food, atmosphere, quality, service.

  8. Capture and Generate: Methods • State of the art: • Create "template" or "frame" • Represent the knowledge you want to capture • Extract Information to fill in frame • Standard information extraction problem • Typically relatively large frames with relatively few relations; mostly facts. • Generate based on template • Relatively simple "fill-in-the-blank" • More complex based on parse tree. • Still basically research: parse entire document into parse tree tied to rich semantic net; apply rules to trim tree; generate continuous narrative.

  9. Capture and Generate: Advantages and Disadvantages • Advantages: • Produces very focused summaries. • Can readily incorporate multiple documents. • Not dependent on authors • Disadvantages • Assumes information need is clearly defined. • Information extraction component development time is significant • Document parsing slow; probably not real-time. • Comment: • Makes no attempt to capture author's intent

  10. Extract Representative Sentence • Document format can be arbitrary • Document content can also be arbitrary; information need not clearcut • Summarization consists of text extracted directly from document. • Examples: • Context returned by Google for each hit • Google News summaries.

  11. Find Representative Sentences: Method • Typically, choose representative individual terms, then broaden to capture sentence containing terms. The more terms contained, the more important the sentence. • If in response to a search or other information request, the search terms are representative • If no prior query, TF*IDF and other BOW approaches. May use pairs or n-ary groups of words. • May add a layer of rules using position, some specific phrases such as "In summary,".

  12. Find Representative Sentences: Advantages and Disadvantages • Advantages • Can be applied anywhere. • Relatively fast (compared to full parse) • Provides a good general idea or feel for content. • Can do multiple-document summaries. • Disadvantages • Often choppy or hard to read • Does poorly when document doesn't contain good summary sentences. • Can miss major information

  13. Summary • Appropriate approach depends on what is known about the documents, the domain, and the information need. • All of the major approaches in use provide useful information in a reasonable time frame. • None of the automated methods is yet close to a good human summarizer. Research in this area is advancing fast, though.

  14. Some Useful References • This is been a seriously simplified presentation; I am focusing mostly on applications. Here are some references for more detail: • http://www.cs.unm.edu/~storm/TSPresent.html. Detailed overview of text summarization history, methods and current state. • http://www.summarization.com/. Bibliography, tools, conferences, research. Some good resources. • http://clg.wlv.ac.uk/help/summarisation.php. Relatively simple overview with some good links. • http://citeseer.nj.nec.com/525002.html. Paper on summarization using GATE.

More Related