Introduction

Introduction What is Text Summarization?

Introduction What is Text Summarization? A summary.

Introduction What is Text Summarization? An automatically generated summary.

Introduction What is Text Summarization? An automatically generated summary of a document or collection.

Introduction What is Text Summarization? An automatically generated summary of a document or collection which is at least as good as a human can produce.

Introduction We do not know good ways of doing it, so what are some other fields that we can borrow from to do what we need to do? • Information Extraction • Information Retrieval • Text Mining • Text Generation

Types of Text Summarization What types of summaries are there? • indicative versus informative • extract versus abstract • generic versus query-oriented • background versus just-the-news • single-document versus multi-document source

Types of Text Summarization Summarization tasks can vary on what information is considered as the source: • Summaries can look at all the information in a document(s) or • only the information that is deemed relevant for a specific task

Types of Text Summarization This can be re-stated as: • top-down (query-driven focus) versus • bottom-up (text-driven focus)

What do Human Summarizers Do? Generally, • delete extraneous information • generalize concepts • make concepts more compact

What do Human Summarizers Do? Example: Father was washing dishes. Mother was working on her new book. The daughter was busy painting the window frames. After summarization: The whole family was busy.

What do Human Summarizers Do? Example 2: Father was washing dishes. Mother was working on her new book. The daughter was busy painting the window frames. All of a sudden, the publisher called in and told mother that he needed the manuscript a month earlier than foreseen. Father left the dishes and finished the drawings instead. The daughter dropped the brush and rushed to do the proofreading. Supported by her family, mother managed to finish her book in time.

What do Human Summarizers Do? • Topic of story has shifted • Example stresses importance of understanding entire story before abstracting from it • Humans read entire document before summarizing • Computational approaches can look at entire document or subpart related to task

What do Human Summarizers Do? Discourse Cues that aid in summarization: • knowledge of the topic domain • syntactic cues (topic-comment, connectives (but, however, because, for example)) • stylistic and rhetorical cues (The most pressing thing to do was, I conclude that) • structural cues (narrative structure) • context or situational cues

What do Human Summarizers Do? General strategies: • What to keep: facts, items relating to the topic, items that discuss purpose, items that are stated positively, items that contrast other items, items that are stressed • What to delete: reasons, comments, examples

What do Human Summarizers Do? Studies on consistency found that when abstracting documents: • Single human subjects vary widely in consistency using the same article over two different periods of time • Variation among different abstractors was even more significant • Even without a lot of consistency, all abstracts produced were adequate

Computational Approaches How do we do Text Summarization? • Knowledge-based • Selection-based

Historical Approaches First text summarization algorithm by Luhn (1958): 1. words are input from the text; 2. common/non-substantive words are deleted through table look-up; 3. content words are stored, along with their position in the text, as well as any punctuation that is located immediately to the left and/or right of the word; 4. content words are sorted alphabetically

Luhn Algorithm (cont.) 5. similar spellings are consolidated into word types (a rough approximation of a stemmer) 5a. any token with less than seven letter non-matches are considered to be of the same word type: frequently frequent 10 letters, 8 match, 2 non-match Historical Approaches

Historical Approaches Luhn Algorithm (cont.): 5b. the frequencies of word types are compared 5c. low frequencies deleted 5d. remaining words were considered significant Problems: anaphora white elephant those big animals they are big and white

Luhn Algorithm (cont.) 6. remaining word types are sorted into location order; 7. sentence representativeness determined by dividing sentences into substrings defined by distances between significant words Better to see you with, my dear Better to to see you with, my with, my dear Historical Approaches

Substring 1: Better (2) to Substring 2: to see(4) Substring 3: you(6) with, my Substring 4: with, my dear(1) Better to see you with, my dear. Substring 1: 2/2=1 Substring 2: 4/2=2 Substring 3: 6/3=2 Substring 4: 1/3=0.333 Total value for sentence = 5.33 Historical Approaches 8. for each substring, a representativeness value was calculated by dividing the number of representative tokens in the cluster by the total number of tokens in the cluster; 9. sentences reaching a value above a preset threshold were selected for inclusion

Historical Approaches TRW (1960s) builds upon Luhn model by: • adding weights for words that occurred in the title or subtitles of the document • sentences earlier or later in a paragraph were given higher weights than those in the middle However, largest drawback at this point is that whole sentences are extracted, not rewritten.

Historical Approaches Models Influenced by Cognitive Science • make use of frames and scripts to simulate schemas, which are formats of knowledge representations • FRUMP • PAULINE

Historical Approaches FRUMP: • expectation driven model • knowledge base are sketchy scripts • looks for instances of the knowledge-base in the text to be summarized • Full parsing is not necessary for this method to work

Historical Approaches PAULINE: • pragmatically driven • can generate 100 different summaries from 1 original • initially asks user for information to help guide its behavior • asks user for conversation topics • collects information on the topic and then creates sentences • pragmatics that are used include: make listener like me, use a "highfalutin" tone of voice, persuade the listener to change their opinion

Current Approaches Newer methods are characterized by: • stochastic methods • integration of corpus linguistics • shallow parsing methods • lexical semantics knowledge through use of WordNet • integration of different methods in one model • summarization from structured knowledge • integration of information from different media

Current Approaches Using related fields: IE DB Compression Text Generation

Current Approaches Think Smaller! [Sentence Compression]

Current Approaches Sentence Compression Noisy Channel

Current Approaches Sentence Compression Source Channel Decoder

Current Approaches Sentence Compression Focus of the Compression

Current Approaches Sentence Compression Sentences or Trees?

Current Approaches Sentence Compression Q: So, how do we do it? A: Probability that original sentence is an expansion of generated sentence

Current Approaches Example Beyond that basic level, the operations of the three products vary widely (1514588)

Current Approaches Example Beyond that level, the operations of the three products vary widely (1430374)

Current Approaches Example Beyond that level, the operations of the three products vary (1249223)

Current Approaches Example Beyond that basic level, the operations of the products vary (1181377)

Current Approaches Example The operations of the three products vary widely (939912)

Current Approaches Example The operations of the products vary widely (872066)

Current Approaches Example The operations of the products vary (748761)

Current Approaches Example The operations of products vary (809158)

Current Approaches Example The operations vary (522402)

Current Approaches Example Operations vary (662642)

Current Approaches Example Finally, another advantage of broadband is distance.

Current Approaches Example Finally another advantage of broadband is distance.

Current Approaches Example Another advantage of broadband is distance.

Current Approaches Example Advantage of broadband is distance.

Current Approaches Example Another advantage is distance.

Current Approaches Example Advantage is distance.

Introduction

Introduction

Presentation Transcript

Introduction to introduction to introduction to … Optimization

INTRODUCTION/ INTRODUCTION

Introduction

INTRODUCTION

Introduction

Introduction