807 text analytics n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
807 - TEXT ANALYTICS PowerPoint Presentation
Download Presentation
807 - TEXT ANALYTICS

Loading in 2 Seconds...

play fullscreen
1 / 63

807 - TEXT ANALYTICS - PowerPoint PPT Presentation


  • 137 Views
  • Uploaded on

807 - TEXT ANALYTICS. Massimo Poesio Lecture 10: Summarization. What is summarization?. To take an information source, extract content from it, and present the most important content to the user in a condensed form and in a manner sensitive to the user’s application needs.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '807 - TEXT ANALYTICS' - minor


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
807 text analytics

807 - TEXT ANALYTICS

Massimo PoesioLecture 10: Summarization

what is summarization
What is summarization?

To take an information source, extract content from it, and present the most important content to the user in a condensed form and in a manner sensitive to the user’s application needs

slide3

Single-document summarization

Flu stopper

A new compound is set for human testing(Times)

Running nose. Raging fever. Aching joints. Splitting headache. Are there any poor souls suffering from the flu this winter who haven’t longed for a pill to make it all go away? Relief may be in sight. Researchers at Gilead Sciences, a pharmaceutical company in Foster City, California, reported last week in the Journal of the American Chemical Society that they have discovered a compound that can stop the influenza virus from spreading in animals. Tests on humans are set for later this year.

The new compound takes a novel approach to the familiar flu virus. It targets an enzyme,

called neuraminidase, that the virus needs in order to scatter copies of itself throughout the

body. This enzyme acts like a pair of molecular scissors that slices through the protective

mucous linings of the nose and throat. After the virus infects the cells of the respiratory

system and begins replicating, neuraminidase cuts the newly formed copies free to invade

other cells. By blocking this enzyme, the new compound, dubbed GS 4104, prevents the

infection from spreading.

3

slide4

Single-document summarization

Flu stopper

A new compound is set for human testing(Times)

Running nose. Raging fever. Aching joints. Splitting headache. Are there any poor souls suffering from the flu this winter who haven’t longed for a pill to make it all go away? Relief may be in sight. Researchers at Gilead Sciences, a pharmaceutical company in Foster City, California, reported last week in the Journal of the American Chemical Society that they have discovered a compound that can stop the influenza virus from spreading in animals. Tests on humans are set for later this year.

The new compound takes a novel approach to the familiar flu virus. It targets an enzyme,

called neuraminidase, that the virus needs in order to scatter copies of itself throughout the

body. This enzyme acts like a pair of molecular scissors that slices through the protective

mucous linings of the nose and throat. After the virus infects the cells of the respiratory

system and begins replicating, neuraminidase cuts the newly formed copies free to invade

other cells. By blocking this enzyme, the new compound, dubbed GS 4104, prevents the

infection from spreading.

4

multi document summarization
Multi-document summarization

MULTI-DOCUMENT summarization (doing this from a large number of news items) a particularly popular application

human summarization and abstracting
Human summarization and abstracting
  • What professional abstractors do
  • Ashworth:
      • “To take an original article, understand it and pack it neatly into a nutshell without loss of substance or clarity presents a challenge which many have felt worth taking up for the joys of achievement alone. These are the characteristics of an art form”.
cremmins 82 96
Original version:There were significant positive associations between the concentrations of the substance administered and mortality in rats and mice of both sexes.There was no convincing evidence to indicate that endrin ingestion induced and of the different types of tumors which were found in the treated animals.

Edited version:Mortality in rats and mice of both sexes was dose related.No treatment-related tumors were found in any of the animals.

Cremmins 82, 96
computational approach basics
Computational Approach: Basics
  • Bottom-Up:
  • I’m dead curious: what’s in the text?
  • User needs:anything that’s important
  • System needs: genericimportance metrics, used to rate content

Top-Down:

  • I know what I want! — don’t confuse me with drivel!
  • User needs: only certain types of info
  • System needs: particular criteria of interest, used to focus search

13

query driven vs text driven focus
Query-Driven vs. Text-DRIVEN Focus
  • Top-down: Query-driven focus
    • Criteria of interest encoded as search specs.
    • System uses specs to filter or analyze text portions.
    • Examples: templates with slots with semantic characteristics; termlists of important terms.
  • Bottom-up: Text-driven focus
    • Generic importance metrics encoded as strategies.
    • System applies strategies over rep of whole text.
    • Examples: degree of connectedness in semantic graphs; frequency of occurrence of tokens.

14

types of summaries
Types of summaries
  • Extracts
    • Sentences from the original document are displayed together to form a summary
  • Abstracts
    • Materials is transformed: paraphrased, restructured, shortened
ideal stages of summarization
Ideal stages of summarization
  • Analysis
    • Input representation and understanding
  • Transformation
    • Selecting important content
  • Realization
    • Generating novel text corresponding to the gist of the input
what current systems do
What current systems do
  • Most work bottom-up
  • Typically use shallow analysis methods
    • Rather than full understanding
  • Work by sentence extraction
    • Identify important sentences and piece them together to form a summary
  • More advanced work: move towards more abstractive summarization
shallow approaches
Shallow approaches
  • Relying on features of the input documents that can be easily computes from statistical analysis
  • Word statistics
  • Cue phrases
  • Section headers
  • Sentence position
what is the input
What is the input?
  • News, or clusters of news
    • a single article or several articles on a related topic
  • Email and email thread
  • Scientific articles
  • Health information: patients and doctors
  • Meeting summarization
  • Video
what is the output
What is the output
  • Keywords
  • Highlight information in the input
  • Chunks or speech directly from the input or paraphrase and aggregate the input in novel ways
  • Modality: text, speech, video, graphics
supervised methods
Supervised methods
  • Ask people to select sentences
  • Use these as training examples for machine learning
    • Each sentence is represented as a number of features
    • Based on the features distinguish sentences that are appropriate for a summary and sentences that are not
  • Run on new inputs
edmundson 69
Cue method:

stigma words (“hardly”, “impossible”)

bonus words (“significant”)

Key method:

similar to Luhn

Title method:

title + headings

Location method:

sentences under headings

sentences near beginning or end of document and/or paragraphs (also [Baxendale 58])

Edmundson 69
edmundson 691
Linear combination of four features:1C + 2K + 3T + 4L

Manually labelled training corpus

Key not important!

Edmundson 69
  •  1 

C + T + L

C + K + T + L

LOCATION

CUE

TITLE

KEY

RANDOM

0 10 20 30 40 50 60 70 80 90 100 %

kupiec et al 95
Extracts of roughly 20% of original text

Feature set:

sentence length

|S| > 5

fixed phrases

26 manually chosen

paragraph

sentence position in paragraph

thematic words

binary: whether sentence is included in manual extract

uppercase words

not common acronyms

Corpus:

188 document + summary pairs from scientific journals

Kupiec et al. 95
kupiec et al 951
Kupiec et al. 95
  • Uses Bayesian classifier:
  • Assuming statistical independence:
kupiec et al 952
Kupiec et al. 95
  • Performance:
    • For 25% summaries, 84% precision
    • For smaller summaries, 74% improvement over Lead
a typical modern supervised summarization system
A typical modern supervised summarization system
  • Or, what you could do if asked to do one …
features
Features
  • Location
    • Absolute location of the sentence
    • Section structure: first sentence, last sentence, other
    • Paragraph structure
  • What section the sentence appeared in
    • Introduction, implementation, example, conclusion, result, evaluation, experiment etc
more features
More features
  • Sentence length
    • Very long and very short sentences are unusual
  • Title word overlap
  • Tf.idf word content
    • Binary feature
    • “yes” if the sentence contains one of the 18 most important words
    • “no” otherwise
more features1
More features
  • Presence and type of citation
  • Formulaic expressions
    • “in traditional approaches”, “a novel method for”
problems with supervised methods for summarization
Problems with supervised methods for summarization
  • Annotation is expensive
    • Here---relevance and rhetorical status judgments
  • People don’t agree
    • So more annotators are necessary
    • And/or more training of the annotators
u nsupervised methods for extractive summarization basic idea
Unsupervised methods for (extractive) summarization: basic idea
  • Compute word probability from input
  • Compute sentence weight as function of word probability
  • Pick best sentence
sentence ranking options
Sentence ranking options
  • Based on word probability
    • S is sentence with length n
    • Pi is the probability of the i-th word in the sentence
  • Based on word tf.idf
centrality measures
Centrality measures
  • How representative is a sentence of the overall content of a document
    • The more similar are sentence is to the document, the more representative it is
beyond word based sentence extraction
Beyond word-based sentence extraction
  • Discourse information
    • Resolve anaphora, text structure
  • Use external lexical resources
    • Wordnet, adjective polarity lists, opinion
  • Using machine learning
the role of discourse structure
The role of discourse structure
  • Claim: The multi-sentence coherence structure of a text can be constructed, and the ‘centrality’ of the textual units in this structure reflects their importance.
  • Tree-like representation of texts in the style of Rhetorical Structure Theory (Mann and Thompson,88).
  • Use the discourse representation in order to determine the most important textual units. Attempts:
    • (Ono et al., 94) for Japanese.
    • (Marcu, 97) for English.

42

rhetorical parsing marcu 97
Rhetorical parsing (Marcu,97)

[With its distant orbit {– 50 percent farther from the sun than Earth –} and slim atmospheric blanket,1] [Mars experiences frigid weather conditions.2] [Surface temperatures typically average about –60 degrees Celsius (–76 degrees Fahrenheit) at the equator and can dip to –123 degrees C near the poles.3] [Only the midday sun at tropical latitudes is warm enough to thaw ice on occasion,4] [but any liquid water formed that way would evaporate almost instantly5] [because of the low atmospheric pressure.6]

[Although the atmosphere holds a small amount of water, and water-ice clouds sometimes develop,7] [most Martian weather involves blowing dust or carbon dioxide.8] [Each winter, for example, a blizzard of frozen carbon dioxide rages over one pole, and a few meters of this dry-ice snow accumulate as previously frozen carbon dioxide evaporates from the opposite polar cap.9] [Yet even on the summer pole, {where the sun remains in the sky all day long,} temperatures never warm enough to melt frozen water.10]

43

rhetorical parsing 3
Rhetorical parsing (3)

2

Elaboration

2

Elaboration

8

Example

10

Antithesis

8

Concession

2

Background

Justification

3

Elaboration

7

8

10

1

2

3

4 5

Contrast

9

Summarization = selection of the

most important units

2 > 8 > 3, 10 > 1, 4, 5, 7, 9 > 6

4

5

Evidence

Cause

5

6

44

argumentative zoning
Argumentative zoning
  • What is the purpose of the sentence? To communicate
    • Background
    • Aim
    • Basis (related work)
  • How can we know which sentence serves each aim?
selecting important sentences relevance
Selecting important sentences (relevance)
  • How well can it be performed by people?
    • Rather subjective; depends on prior knowledge and interests
  • Even the same person would select 50% different sentences if she performs the task at different times
  • Still, judgments can be solicited by several people to mitigate the problem
  • For each sentence in at article---say if it is important and interesting enough to be included in a summary
multi document summarization1
Multi-document summarization
  • Very useful for presenting and organizing search results
    • Many results are very similar, and grouping closely related documents helps cover more event facets
    • Summarizing similarities and differences between documents
standard approaches
Standard Approaches
  • Salient information = similarities
      • Pairwise similarity between all sentences
      • Cluster sentences using similarity score (Themes)
      • Generate one sentence for each theme
        • Sentence extraction (one sentence/cluster)
        • Sentence fusion: intersect sentences within a theme and choose the repeated phrases. Generate sentence from phrases
  • Salient information = important words
      • Important words are simply the most frequent in the document set
      • SumBasic simply chooses sentences with the most frequent words. Conroy expands on this
mead radev et al 00
MEAD

Centroid-based

Based on sentence utility

Topic detection and tracking initiative [Allen et al. 98, Wayne 98]

MEAD (Radevet al. 00)

TIME

slide46

ARTICLE 18853: ALGIERS, May 20 (AFP)

ARTICLE 18854: ALGIERS, May 20 (UPI)

1. Eighteen decapitated bodies have been found in a mass grave in northern Algeria, press reports said Thursday, adding that two shepherds were murdered earlier this week.2. Security forces found the mass grave on Wednesday at Chbika, near Djelfa, 275 kilometers (170 miles) south of the capital.3. It contained the bodies of people killed last year during a wedding ceremony, according to Le Quotidien Liberte.4. The victims included women, children and old men.5. Most of them had been decapitated and their heads thrown on a road, reported the Es Sahafa.6. Another mass grave containing the bodies of around 10 people was discovered recently near Algiers, in the Eucalyptus district.7. The two shepherds were killed Monday evening by a group of nine armed Islamists near the Moulay Slissen forest.8. After being injured in a hail of automatic weapons fire, the pair were finished off with machete blows before being decapitated, Le Quotidien d'Oran reported.9. Seven people, six of them children, were killed and two injured Wednesday by armed Islamists near Medea, 120 kilometers (75 miles) south of Algiers, security forces said.10. The same day a parcel bomb explosion injured 17 people in Algiers itself.11. Since early March, violence linked to armed Islamists has claimed more than 500 lives, according to press tallies.

1. Algerian newspapers have reported that 18 decapitated bodies have been found by authorities in the south of the country.2. Police found the ``decapitated bodies of women, children and old men,with their heads thrown on a road'' near the town of Jelfa, 275 kilometers (170 miles) south of the capital Algiers.3. In another incident on Wednesday, seven people -- including six children -- were killed by terrorists, Algerian security forces said.4. Extremist Muslim militants were responsible for the slaughter of the seven people in the province of Medea, 120 kilometers (74 miles) south of Algiers.5. The killers also kidnapped three girls during the same attack, authorities said, and one of the girls was found wounded on a nearby road.6. Meanwhile, the Algerian daily Le Matin today quoted Interior Minister Abdul Malik Silal as saying that ``terrorism has not been eradicated, but the movement of the terrorists has significantly declined.''7. Algerian violence has claimed the lives of more than 70,000 people since the army cancelled the 1992 general elections that Islamic parties were likely to win.8. Mainstream Islamic groups, most of which are banned in the country, insist their members are not responsible for the violence against civilians.9. Some Muslim groups have blamed the army, while others accuse ``foreign elements conspiring against Algeria.’’

slide47
MEAD
  • INPUT: Cluster of d documents with n sentences (compression rate = r)
  • OUTPUT: (n * r) sentences from the cluster with the highest values of SCORE

SCORE (s) = Si (wcCi + wpPi + wfFi)

scientific article summarization
Scientific article summarization
  • Not only what the article is about, but also how it relates to work it cites
  • Determine which approaches are criticized and which are supported
    • Automatic genre specific summaries are more useful than original paper abstracts
other uses
Other uses
  • Document indexing for information retrieval
  • Automatic essay grading, topic identification module
evaluating summarization the problem
Evaluating summarization: the problem
  • Which human summary makes a good gold standard? Many summaries are good
  • At what granularity is the comparison made?
  • When can we say that two pieces of text match?
evaluation
Evaluation
  • Many measures for extractive summarization
    • E.g., ROUGE
  • New ones for abstractive summarization
    • E.g., Pyramids
radev cluster based sentence utility

Ideal

Ideal

System 1

System 1

System 2

System 2

S1

S1

+

10(+)

+

10(+)

-

5

S2

+

S2

8(+)

+

9(+)

+

8(+)

S3

S3

-

2

-

3

-

4

S4

S4

-

7

-

6

+

9(+)

S5

-

-

-

S6

-

-

-

S7

-

-

-

S8

-

-

-

S9

-

-

-

S10

-

-

-

Radev: Cluster-Based Sentence Utility

CBSU method

CBSU(system, ideal)= % of ideal utility

covered by system summary

Summary sentence extraction method

relative utility2
Relative utility

13

RU =

= 0.765

17

rouge r ecall o riented u nderstudy for g isting e valuation
ROUGE: Recall-Oriented Understudy for Gisting Evaluation

Rouge – Ngram co-occurrence metrics measuring content overlap

Counts of n-gram overlaps between candidate and model summaries

Total n-grams in summary model

rouge
ROUGE
  • Experimentation with different units of comparison: unigrams, bigrams, longest common substring, skip-bigams, basic elements
  • Automatic and thus easy to apply
  • Important to consider confidence intervals when determining differences between systems
    • Scores falling within same interval not significantly different
    • Rouge scores place systems into large groups: can be hard to definitively say one is better than another
  • Sometimes results unintuitive:
    • Multilingual scores as high as English scores
    • Use in speech summarization shows no discrimination
  • Good for training regardless of intervals: can see trends
pyramids
Pyramids
  • Human evaluation of content: Nenkova & Passonneau (2004)
  • based on the distribution of content in a pool of summaries
  • Summarization Content Units (SCU):
    • fragments from summaries
    • identification of similar fragments across summaries
      • “13 sailors have been killed” ~ “rebels killed 13 people”
  • SCU have
    • id, a weight, a NL description, and a set of contributors
  • SCU1 (w=4) (all similar/identical content)
    • A1 - two Libyans indicted
    • B1 - two Libyans indicted
    • C1 - two Libyans accused
    • D2 – two Libyans suspects were indicted
pyramids1

w=n

w=n-1

Pyramids
  • a “pyramid” of SCUs of height n is created for n gold standard summaries
  • each SCU in tier Ti in the pyramid has weight i
  • with highly weighted SCU on top of the pyramid
  • the best summary is one which contains all units of level n, then all units from n-1,…
  • if Di is the number of SCU in a summary which appear in Ti for summary D, then the weight of the summary is:

w=1

pyramids score
Pyramids score
  • let X be the total number of units in a summary
  • it is shown that more than 4 ideal summaries are required to produce reliable rankings
human performance best sys
Human performance/Best sys

Pyramid Modified Resp ROUGE-SU4

B: 0.5472 B: 0.4814 A: 4.895 A: 0.1722

A: 0.4969 A: 0.4617 B: 4.526 B: 0.1552

~~~~~~~~~~~~~~~~~

14: 0.2587 10: 0.2052 4: 2.85 15: 0.139

Best system ~50% of human performance on manual metrics

Best system ~80% of human performance on ROUGE

acknowledgments
ACKNOWLEDGMENTS
  • Many slides borrowed from AniNenkova (Penn), DragoRadev (Uni Michigan) and Daniel Marcu (ISI)