Csc 9010 text mining applications document summarization
Download
1 / 15

CSC 9010: Text Mining Applications Document Summarization - PowerPoint PPT Presentation


  • 221 Views
  • Uploaded on

CSC 9010: Text Mining Applications Document Summarization. Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851. Document Summarization. Document Summarization Provide meaningful summary for each document Examples: Search tool returns “context”

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'CSC 9010: Text Mining Applications Document Summarization' - tia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Csc 9010 text mining applications document summarization l.jpg

CSC 9010: Text Mining ApplicationsDocument Summarization

Dr. Paula Matuszek

Paula_A_Matuszek@glaxosmithkline.com

(610) 270-6851


Document summarization l.jpg
Document Summarization

  • Document Summarization

    • Provide meaningful summary for each document

  • Examples:

    • Search tool returns “context”

    • Monthly progress reports from multiple projects

    • Summaries of news articles on the human genome

  • Often part of a document retrieval system, to enable user judge documents better

  • Surprisingly hard to make sophisticated

  • Surprisingly easy to make effective


Document summarization how l.jpg
Document Summarization -- How

Three general approaches:

  • Extract predefined summary.

    • Useful in highly structured environments where you can specify format. Typically very good summaries.

  • Capture in abstract representation, generate summary

    • Useful in well-defined domains with clearcut information needs.

  • Extract representative sentences/clauses.

    • Useful in arbitrarily complex and unstructured domains; broadly applicable, and gets "general feel".


Extract predefined summary l.jpg
Extract Predefined Summary

  • Documents have a well-defined format.

  • Format includes a summary or abstract explicitly written by document author.

  • Text mining may reorganize, regroup, restructure summaries.

  • Example:

    • People working on multiple projects write monthly reports based on what they have done, one sentence/project.

    • Reporting system collects person-level reports and reorganizes into project-level reports.


Extract predefined summary methods l.jpg
Extract Predefined Summary: Methods

  • Extraction using some or all of

    • NLP for document parsing/chunking (finding abstract)

    • standard computer science: database retrieval, string processing, etc.

  • Reorganizing may be done using

    • explicit fields specified by author

    • keywords searched for in documents

    • business rules which capture knowledge about who is working on what tasks and projects

  • Grouping can shade into document classification for long summaries, ill-defined match to categories


Extracting predefined summaries advantages and disadvantages l.jpg
Extracting Predefined Summaries: Advantages and Disadvantages

  • Advantages

    • Summaries reflect intent of author.

    • If part of an overall reporting system can actually make it simpler for author.

    • Incremental effort for author not large.

  • Disadvantages

    • Incremental effort for author not zero either.

    • Only feasible in structured situation where requirement can be defined ahead of time.

    • Can't be used to summarize a group of documents.

    • Not all authors write good summaries.


Capture and generate l.jpg
Capture and Generate Disadvantages

  • Documents can have arbitrary format

  • Knowledge needed is well-defined.

  • Often information need is for summarizations across multiple documents

  • Example:

    • Summarizing restaurant reviews. Take newspaper articles and produce price range, kind of food, atmosphere, quality, service.


Capture and generate methods l.jpg
Capture and Generate: Methods Disadvantages

  • State of the art:

    • Create "template" or "frame"

      • Represent the knowledge you want to capture

    • Extract Information to fill in frame

      • Standard information extraction problem

      • Typically relatively large frames with relatively few relations; mostly facts.

    • Generate based on template

      • Relatively simple "fill-in-the-blank"

      • More complex based on parse tree.

  • Still basically research: parse entire document into parse tree tied to rich semantic net; apply rules to trim tree; generate continuous narrative.


Capture and generate advantages and disadvantages l.jpg
Capture and Generate: Advantages and Disadvantages Disadvantages

  • Advantages:

    • Produces very focused summaries.

    • Can readily incorporate multiple documents.

    • Not dependent on authors

  • Disadvantages

    • Assumes information need is clearly defined.

    • Information extraction component development time is significant

    • Document parsing slow; probably not real-time.

  • Comment:

    • Makes no attempt to capture author's intent


Extract representative sentence l.jpg
Extract Representative Sentence Disadvantages

  • Document format can be arbitrary

  • Document content can also be arbitrary; information need not clearcut

  • Summarization consists of text extracted directly from document.

  • Examples:

    • Context returned by Google for each hit

    • Google News summaries.


Find representative sentences method l.jpg
Find Representative Sentences: Method Disadvantages

  • Typically, choose representative individual terms, then broaden to capture sentence containing terms. The more terms contained, the more important the sentence.

    • If in response to a search or other information request, the search terms are representative

    • If no prior query, TF*IDF and other BOW approaches. May use pairs or n-ary groups of words.

  • May add a layer of rules using position, some specific phrases such as "In summary,".


Find representative sentences advantages and disadvantages l.jpg
Find Representative Sentences: Advantages and Disadvantages Disadvantages

  • Advantages

    • Can be applied anywhere.

    • Relatively fast (compared to full parse)

    • Provides a good general idea or feel for content.

    • Can do multiple-document summaries.

  • Disadvantages

    • Often choppy or hard to read

    • Does poorly when document doesn't contain good summary sentences.

    • Can miss major information


Summary l.jpg
Summary Disadvantages

  • Appropriate approach depends on what is known about the documents, the domain, and the information need.

  • All of the major approaches in use provide useful information in a reasonable time frame.

  • None of the automated methods is yet close to a good human summarizer. Research in this area is advancing fast, though.


Some useful references l.jpg
Some Useful References Disadvantages

  • This is been a seriously simplified presentation; I am focusing mostly on applications. Here are some references for more detail:

  • http://www.cs.unm.edu/~storm/TSPresent.html. Detailed overview of text summarization history, methods and current state.

  • http://www.summarization.com/. Bibliography, tools, conferences, research. Some good resources.

  • http://clg.wlv.ac.uk/help/summarisation.php. Relatively simple overview with some good links.

  • http://citeseer.nj.nec.com/525002.html. Paper on summarization using GATE.