1 / 36

Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe

Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe. DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI). Preview. Introduction Incremental Hierarchical Clustering Based Document Update Summarization

mika
Download Presentation

Advisor: Koh , Jia -Ling Presenter: Nonhlanhla Shongwe

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI)

  2. Preview • Introduction • Incremental Hierarchical Clustering Based Document Update Summarization • Incremental Hierarchical Sentence Clustering (IHSC) • The COBWEB algorithm • COBWEB for text • Algorithm • Evaluation measures • Experiments and results

  3. Introduction • Document summarization has been receiving much attention due to • Increasing number of documents on the internet • Helping readers to extract their interested information efficiently • Most document summarization techniques perform in a batch mode

  4. Introduction cont’s • Two most widely used summarization methods • Firstly: Clustering based • Term sentence matrices formed from the document • Sentences are grouped into different clusters • Score is attached to each sentence using average cosine similarity • Sentences with the highest score in each cluster form the summary

  5. Introduction cont’s • Secondly: Graph-ranking based • Constructs a sentence graph, each node is a sentence in a document collection • An edge is formed between sentence pairs if • The similarity between a pair of sentence is above the threshold • They belong to the same document • Sentences are selected to form the summary by voting from their neighbors

  6. Introduction cont’s • With the rapid growth of document, • There is a necessity to update the existing summaries when new documents arrives. • Traditional methods are not suitable for this task • Most of the methods work in batch way: • Meaning that all the documents need to be process again once new documents come, which causes inefficiency

  7. Introduction cont’s • In this paper • To integrate document summarization techniques into an incremental hierarchical clustering framework • To be able to re-organize sentence clusters immediately after new documents arrive so that their corresponding summaries can be updated efficiently.

  8. Framework Preprocessing Incremental Hierarchical Sentence Clustering (IHSC) The COBWEB algorithm COBWEB for text Representative Sentence Selection for Each Node of the Hierarchy The Algorithm INCREMENTAL HIERARCHICAL CLUSTERING BASED DOCUMENT UPDATE SUMMARIZATION

  9. Framework

  10. Preprocessing • Data preprocessing • Given a collection of documents • Decompose the documents into sentences • Stop words are removed • Word stemming is performed • Sentence matrix is constructed and each element is the term frequency

  11. Incremental Hierarchical Sentence Clustering (IHSC) • For update summarization system • Used an Incremental Hierarchical Clustering (IHC) • Benefits of IHC method • The method can efficiently process the dynamic documents, new documents are added • A hierarchy is built to facilitate users • The number of clusters is not pre-defined

  12. The COBWEB algorithm • Used COBWEB, most popular incremental hierarchical clustering algorithms • Based on the heuristic measures called Category Utility (CU) • Clusters • Probability of a document belong to a cluster • Total number of clusters K

  13. The COBWEB algorithm cont’s • Ai = The ith attribute of the items being clustered • Vij = jth value of the ith attribute For example: A1 Є {male, female} , A2 Є {Red, Green, Blue} V12= female V22= Green • Probability matching guessing strategy • Expected number of times we can correctly guess the value of multinomial variable Ai to be Vij for an item in a cluster k A good cluster, in which the attributes of the items take similar values will have high values COBWEB maximizes sum score over all possible assignment of a document to a cluster

  14. The COBWEB algorithm cont’s • The COBWEB algorithm can perform • Insert: add the sentence into an existing cluster • Create: create a new cluster • Merge: combine two clusters into a single cluster • Split: divide an existing cluster into several clusters

  15. The COBWEB algorithm cont’s Example:

  16. COBWEB for text • The COBWEB algorithm • Using normal attributes distribution is not suitable for text data • Documents • Are represented in the “bag of words” where terms are attributes • Best method • Calculating CU using Katz’s distribution

  17. COBWEB for text cont’s • Katz’s model • Assuming word i occurs k times in document then = 1 – (df/N) df = document frequency N = total number of documents p = (cf - df) / cf cf = collection frequency = Pr(the word repeats | the word occurs ) Therefore: (1 - p) = the probability of the word occurring only once

  18. COBWEB for text cont’s Substitute with p K=0, using p δk =1 Adding both formulas p(0) = 1- αp α = (1-p(0))/p

  19. COBWEB for text cont’s to the contribution of the attribute i towards the category utility of the cluster k Where attribute value f=Vij

  20. Representation sentence selection for Each Node of the Hierarchy • Update summarization system • Select the most representative sentences to summarize each node and subtrees • Once a new sentence arrives, the sentence hierarchy is changed by either of the four operations

  21. Representation sentence selection for Each Node of the Hierarchy cont’s • Case 1 : Insert a sentence into cluster k • Recalculate the representative sentence Rk of cluster K • Where • K : number of sentences in the cluster • Sim() : similarity function between sentence pairs • Cosine similarity • α= parameter • α = 0.6

  22. Representation sentence selection for Each Node of the Hierarchy cont’s • Case 2: Create a new clusterk • Newly sentence represents a new cluster • Rk = snew • Case 3: Merge two clusters (clustera and clusterb ) into a new cluster (clusterc) • Sentence obtaining the higher similarity with the query is selected as the representative sentence at the new merged node

  23. Representation sentence selection for Each Node of the Hierarchy cont’s • Case 4: split cluster into a set of clusters • (clustera into cluster1, cluster2,…clustern) • Remove node a • Substitute it using the roots of its sub-trees • Corresponding representative sentences are the representative sentences for the original sub-tree roots

  24. The Algorithm • Input: a query/topic the user is interested in a sequence of documents/sentences • Read one sentence and check if it is relevant to the given topic i.e., checkrelevance(sentence,topic)

  25. The Algorithm cont’s 2. If relevant :initialize the hierarchy tree, sentence as the root Otherwise: remove it and read in the next sentence and repeat Step1 : until root node is formed 3. repeat

  26. The Algorithm cont’s 4. Read in the next sentence, start from the root node • If the node is a leaf, go to Step 5 otherwise choose one of the following with the highest CU score • Insert a node and conduct case 1 summarization • Create a node and conduct case 2 summarization • Merge a node and conduct case 3 summarization • Split a node and conduct case 4 summarization • If a leaf node is reached, create a new leaf node and merge the old leaf and the new leaf into a node and case 2 and case 3 are conducted

  27. The Algorithm cont’s 6. Until the stopping condition is satisfied 7. Cut the hierarchy tree at one layer to obtain a summary with the corresponding length. • Output: A sentence hierarchy The updated summary

  28. Data Description Baselines Evaluations Measures Experimental Results EXPERIMENTS

  29. Data Description • Hurricane Wilman Releases(Hurricane) • 1700 documents divided into 3 phases • TAC 2008 Update Summarization Track (TAC08) • Benchmark dataset from update summarization • 48 topics and 20 newswire articles in each topic

  30. Baselines • Implemented the following used multi-document summarization methods as the baseline systems

  31. Evaluations Measures • Rouge toolkit • To compare with the human summaries • Count match(gram n) maximum number of n-grams co-occurring in a candidate summary • Count(gram n) number of n-grams in the reference summaries

  32. Experimental Results

  33. Experimental Results cont’s

  34. Experimental Results cont’s

  35. Conclusion • Traditional methods perform in batch way and are not suitable of incrementing summaries • Incremental Hierarchical Clustering Based Document Update Summarization • Incremental Hierarchical Sentence Clustering (IHSC) • Algorithm called COBWEB for text • Can perform Insert, Create, Merge, Split • IHSC outperforms the traditional methods and its more efficient.

  36. Thank you!

More Related