1 / 29

MEAD 3.09 A platform for multidocument multilingual text summarization

MEAD 3.09 A platform for multidocument multilingual text summarization.

Download Presentation

MEAD 3.09 A platform for multidocument multilingual text summarization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MEAD 3.09 A platform for multidocument multilingual text summarization University of Michigan, Smith College, Columbia UniversityUniversity of Pennsylvania, Johns Hopkins UniversityChinese University of Hong Kong, University of AlabamaUniversity of Sheffield, University of CambridgeJHU Summer School 2004 - Baltimore

  2. Text summarization • Identifying the “most important” information from a document or set of documents. • Extractive/abstractive • Single-document/multi-document • Informative/Indicative MEAD - JHU 2004 2

  3. MEAD • Multi-document, multilingual, extractive summarization platform • Open-source (Perl & Java), well documented API and utilities • v. 1.0-2.0 (Michigan 2000), v. 3.0 (JHU 2001) • Latest release is v. 3.09 (Michigan 2001-2004) MEAD - JHU 2004 3

  4. Four stages • Preprocessing and clustering • CIDR, XML representation • Feature extraction • Default + custom • Score extraction • Feature combination • Sentence reranking • Cross-sentence relationships: repetitions, chronology, source preferences MEAD - JHU 2004 4

  5. Sample .config file <MEAD-CONFIG TARGET='GA3' LANG='ENG‘ CLUSTER-PATH='/clair4/mead/data/GA3' DATA-DIRECTORY='/clair4/mead/data/GA3/docsent'> <FEATURE-SET BASE-DIRECTORY='/clair4/mead/data/GA3/feature/'> <FEATURE NAME='Centroid‘ SCRIPT='/clair4/mead/bin/feature-scripts/Centroid.pl HK-WORD-enidf ENG'/> <FEATURE NAME='Position‘ SCRIPT='/clair4/mead/bin/feature-scripts/Position.pl'/> <FEATURE NAME='Length‘ SCRIPT='/clair4/mead/bin/feature-scripts/Length.pl'/> </FEATURE-SET> <CLASSIFIER COMMAND-LINE='/clair4/mead/bin/default-classifier.pl \ Centroid 1 Position 1 Length 9' SYSTEM='MEADORIG' RUN='10/09'/> <RERANKER COMMAND-LINE='/clair4/mead/bin/default-reranker.pl MEAD-cosine 0.7'/> <COMPRESSION BASIS='sentences' PERCENT='20'/> </MEAD-CONFIG> MEAD - JHU 2004 5

  6. Sample .sentfeature file <SENT-FEATURE> <S DID="87" SNO="1" > <FEATURE N="Centroid" V="0.2749" /> </S> <S DID="87" SNO="2" > <FEATURE N="Centroid" V="0.8288" /> </S> <S DID="81" SNO="1" > <FEATURE N="Centroid" V="0.1538" /> </S> <S DID="81" SNO="2" > <FEATURE N="Centroid" V="1.0000" /> </S> <S DID="41" SNO="1" > <FEATURE N="Centroid" V="0.1539" /> </S> <S DID="41" SNO="2" > <FEATURE N="Centroid" V="0.9820" /> </S> </SENT-FEATURE> MEAD - JHU 2004 6

  7. Sample .extract file <!DOCTYPE EXTRACT SYSTEM '/clair/tools/mead/dtd/extract.dtd'> <EXTRACT QID='GA3' LANG='ENG' COMPRESSION='7' SYSTEM='MEADORIG' RUN='Sun Oct 13 11:01:19 2002'> <S ORDER='1' DID='41' SNO='2' /> <S ORDER='2' DID='41' SNO='3' /> <S ORDER='3' DID='41' SNO='11' /> <S ORDER='4' DID='81' SNO='3' /> <S ORDER='5' DID='81' SNO='7' /> <S ORDER='6' DID='87' SNO='2' /> <S ORDER='7' DID='87' SNO='3' /> </EXTRACT> MEAD - JHU 2004 7

  8. Sample .query <!DOCTYPE QUERY SYSTEM "/clair4/mead/dtd/query.dtd" > <QUERY QID="Q-551-E" QNO="551" TRANSLATED="NO"> <TITLE> Natural disaster victims aided </TITLE> <DESCRIPTION> The description is usually a few sentences describing the cluster. </DESCRIPTION> <NARRATIVE> The narrative often describes exactly what the user is looking for in the summary. </NARRATIVE> </QUERY> MEAD - JHU 2004 9

  9. Features • Centroid: cosine overlap with the centroid vector of the cluster • SimWithFirst: cosine overlap with the first sentence in the document (or with the title, if it exists) • Length: 1 if the length of the sentence is above a given threshold and 0 otherwise • RealLength: the length of the sentence in words • Position: the position of the sentence in the document • QueryOverlap: cosine overlap with a query sentence or phrase • KeywordMatch: full match from a list of keywords • CosineCentrality: eigenvector centrality of the sentence on the lexical connectivity matrix with a defined threshold MEAD - JHU 2004 11

  10. Centrality in summarization • Motivation: capture the most central words in a document or cluster • Centroid score [Radev & al. 2000, 2004a] • Alternative methods for computing centrality? MEAD - JHU 2004 12

  11. Social networks • Induced by a relation r • Prestige (centrality) in social networks: • Degree centrality: number of friends • Geodesic centrality: bridge quality • Eigenvector centrality: who your friends are MEAD - JHU 2004 13

  12. Eigenvectors of stochastic graphs • Square connectivity matrix • Directed vs. undirected • An eigenvalue for a square matrix A is a scalar  such that there exists a vector x0 such that Ax = x • The normalized eigenvector associated with the largest  is called the principal eigenvector of A • A matrix is called a stochastic matrix when the sum of entries in each row sum to 1 and none is negative. All stochastic matrices have a principal eigenvector • The connectivity matrix used in PageRank [Page & al. 1998] is irreducible [Langville & Meyer 2003] • An iterative method (power method) can be used to compute the principal eigenvector • That eigenvector corresponds to the stationary value of the Markov stochastic process described by the connectivity matrix • This is also equivalent to performing a random walk on the matrix MEAD - JHU 2004 14

  13. Eigenvectors of stochastic graphs • The stationary value of the Markov stochastic matrix can be computed using an iterative power method: • PageRank adds an extra twist to deal with dead-end pages. With a probability 1-, a random starting point is chosen. This has a natural interpretation in the case of Web page ranking su = successor nodes pr = predecessor nodes MEAD - JHU 2004 15

  14. LexPageRank (Cosine centrality) Example (cluster d1003t) 1 (d1s1) Iraqi Vice President Taha Yassin Ramadan announced today, Sunday, that Iraq refuses to back down from its decision to stop cooperating with disarmament inspectors before its demands are met. 2 (d2s1) Iraqi Vice president Taha Yassin Ramadan announced today, Thursday, that Iraq rejects cooperating with the United Nations except on the issue of lifting the blockade imposed upon it since the year 1990. 3 (d2s2) Ramadan told reporters in Baghdad that "Iraq cannot deal positively with whoever represents the Security Council unless there was a clear stance on the issue of lifting the blockade off of it. 4 (d2s3) Baghdad had decided late last October to completely cease cooperating with the inspectors of the United Nations Special Commission (UNSCOM), in charge of disarming Iraq's weapons, and whose work became very limited since the fifth of August, and announced it will not resume its cooperation with the Commission even if it were subjected to a military operation. 5 (d3s1) The Russian Foreign Minister, Igor Ivanov, warned today, Wednesday against using force against Iraq, which will destroy, according to him, seven years of difficult diplomatic work and will complicate the regional situation in the area. 6 (d3s2) Ivanov contended that carrying out air strikes against Iraq, who refuses to cooperate with the United Nations inspectors, ``will end the tremendous work achieved by the international group during the past seven years and will complicate the situation in the region.'' 7 (d3s3) Nevertheless, Ivanov stressed that Baghdad must resume working with the Special Commission in charge of disarming the Iraqi weapons of mass destruction (UNSCOM). 8 (d4s1) The Special Representative of the United Nations Secretary-General in Baghdad, Prakash Shah, announced today, Wednesday, after meeting with the Iraqi Deputy Prime Minister Tariq Aziz, that Iraq refuses to back down from its decision to cut off cooperation with the disarmament inspectors. 9 (d5s1) British Prime Minister Tony Blair said today, Sunday, that the crisis between the international community and Iraq ``did not end'' and that Britain is still ``ready, prepared, and able to strike Iraq.'' 10 (d5s2) In a gathering with the press held at the Prime Minister's office, Blair contended that the crisis with Iraq ``will not end until Iraq has absolutely and unconditionally respected its commitments'' towards the United Nations. 11 (d5s3) A spokesman for Tony Blair had indicated that the British Prime Minister gave permission to British Air Force Tornado planes stationed in Kuwait to join the aerial bombardment against Iraq. MEAD - JHU 2004 16

  15. Cosine centrality MEAD - JHU 2004 17

  16. Cosine centrality (t=0.3) d3s3 d2s3 d3s2 d3s1 d1s1 d4s1 d5s1 d2s1 d5s2 d5s3 d2s2 MEAD - JHU 2004 18

  17. Cosine centrality (t=0.2) d3s3 d2s3 d3s2 d3s1 d1s1 d4s1 d5s1 d2s1 d5s2 d5s3 d2s2 MEAD - JHU 2004 19

  18. d3s3 d2s3 d3s2 d3s1 d1s1 d4s1 d5s1 d2s1 d5s2 d5s3 d2s2 Cosine centrality (t=0.1) d4s1 Sentences vote for the most central sentence! MEAD - JHU 2004 20

  19. ID LPR (0.1) LPR (0.2) LPR (0.3) Centroid d1s1 0.6007 0.6944 0.0909 0.7209 d2s1 0.8466 0.7317 0.0909 0.7249 d2s2 0.3491 0.6773 0.0909 0.1356 d2s3 0.7520 0.6550 0.0909 0.5694 d3s1 0.5907 0.4344 0.0909 0.6331 d3s2 0.7993 0.8718 0.0909 0.7972 d3s3 0.3548 0.4993 0.0909 0.3328 d4s1 1.0000 1.0000 0.0909 0.9414 d5s1 0.5921 0.7399 0.0909 0.9580 d5s2 0.6910 0.6967 0.0909 1.0000 d5s3 0.5921 0.4501 0.0909 0.7902 Cosine centrality vs. centroid centrality MEAD - JHU 2004 21

  20. Classifiers • Default: linear combination (possibly using thresholds) • Lead-based: positional and chronological • Random • Decision-tree: trainable MEAD - JHU 2004 22

  21. Rerankers • Identity: trivial • Default: remove sentences that are too similar • Time-based: use chronology • Source-based: source preference • Novelty: • CST-based: cross-document structure theory [Radev 2000, Zhang&al. 2002, Zhang&Radev 2004] • MMR: maximal marginal relevance [Carbonell & Goldstein 1998] MEAD - JHU 2004 23

  22. Evaluation methods • Precision/recall/f-measure: baseline • Kappa: interjudge agreement and difficulty • Relative utility: non-binary judgements [Radev 2000] • Relevance correlation: IR-based • Cosine: default or TF*IDF • Longest-common subsequence [Saggion&al. 2002] • Word overlap • BLEU: n-gram precision [Papineni&al. 2002] • ROUGE: n-gram recall and lcs [Lin 2004] MEAD - JHU 2004 24

  23. Recent applications • NewsInEssence (www.newsinessence.com) • DUC 2001-2004 • WapMEAD • Java-MEAD interface • Chronological fact extraction • Novelty detection • Protein interaction extraction MEAD - JHU 2004 26

  24. MEAD - JHU 2004 27

  25. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 MEAD - JHU 2004 28

  26. MEAD - JHU 2004 29

  27. More recent additions • MEAD “addons” – conversion from plain text, HTML, PDF, etc. to MEAD XML • Client + server • Summary to sentjudge conversion • Trainable version of MEAD using decision trees, maxent, and SVM MEAD - JHU 2004 30

  28. Successes • Large-scale effort (more than 20 people have participated in it) • Open architecture • Downloaded more than 1,000 times in the last 2 years • Used in teaching • Novel models of centrality: centroid, degree, cosine centrality • Currently in five languages: English, Chinese, Korean, Spanish, Japanese • DUC (including several first-place rankings in 2003, 2004) MEAD - JHU 2004 31

  29. Sample .meadrc file compression_basis sentences compression_absolute 1 classifier \ /clair4/projects/mead307/source/mead/bin/default-classifier.pl \ Centroid 3.0 Position 1.0 Length 15 SimWithFirst 2.0 reranker \ /clair4/projects/mead307/source/mead/bin/default-reranker.pl \ MEAD-cosine 0.9 enidf MEAD - JHU 2004 33

More Related