Text-Based Measures of Document Diversity

Text-Based Measures of Document Diversity Date：2014/02/12 Source：KDD’13 Authors：Kevin Bache, David Newman, and Padhraic Smyth Advisor：Dr.Jia-Ling,Koh Speaker：Shun-Chen,Cheng

Introduction • Method • Experiment • Conclusions Outline

the hypothesis： Introduction (Interdisciplinary) (single disciplines) interdisciplinary research can lead to new discoveries ata rate faster than that of traditional research projects conductedwithin single disciplines

Task： Introduction assign Diversity score Goal quantifying how diverse a document is in terms of its content

corpus LDA Learn T for D Framework D x T matrix T：topic D：document Topic co-occurrence similarity measures Rao’s Diversity measure Diversity score of each document

LDA：collapsed Gibbs sampler • Using the topic-word assignments from the final iteration of the Gibbs sampler • ndj corresponding to the number of word tokens in document d that are assigned to topic j. • Example of create D x T matrix： Topic-based Diversity(1) t1 t2 t3 9 0 1 n13 d1 d2 d3 d4 0 10 6 2 15 8 1 2 16

Rao’s Diversity for a document d： Topic-based Diversity(2) ndj：the value of entry (d,j) in DxT matrix nd：the number of word tokens in d measure of the distance between topic i and topic j

Example of Rao’s diversity： Topic-based Diversity(3) t1 t2 t3 9 0 1 d1 d2 d3 d4 0 10 6 div(1) =1.26 div(2) = 0.04688 div(3) = 0.09344 div(4) = 1.557895 2 15 8 1 2 16

Cosine similarity： • Probabilistic-based： Topicco-occurranceSimilarity ndj：the value of entry (d,j) in DxT matrix N：number of word tokens in the corpus.

Cosine similarity Similarity toDistance Similarity measures Probability based Similarity to Distance

Dataset • PubMed Central Open Access dataset (PubMed ) • NSF Awards from 2007 to 2012 (NSF) • Association of Computational Linguistics Anthology Network (ACL) • Topic Modeling (LDA) • MALLET • α：0.05*(N/D*T)，β：0.01 • 5,000 iterations. Keep only the final sample in the chain. • T = 10, 30, 100 and 300 topics. Experiment

Reason： no ground-truth measure for a document's diversity. • Half of which were designed to have high diversity and half of which were designed to have low diversity. • High diversity pseudo-document： • manually selecting • Randomly select an article from A and one from B. Pseudo-Documents Relatively unrelated Journal A Journal B Randomly select Pseudo-document

Experiment ROCCurve AUC：Area under the ROC curve

Experiment AUC scores for different diversity measures based on 1000 pseudo-documents from PubMed

Experiment Evaluating transformations

Experiment most diverse NSF grant proposals

Presented an approach for quantifying the diversity of individual documents in a corpus based on their text content. • More data-driven, performing the equivalent of learning journal categories by learning topics from text. • Can be run on any collection of text documents, even without a prior categorization scheme. • A possible direction for future work is that of temporal document diversity. Conclusions

Text-Based Measures of Document Diversity

Text-Based Measures of Document Diversity

Presentation Transcript

Document (Text) Visualization

Concept based Multi-Document Text Summarization

Document Based Questions

Document Based Questions

Text Document Clustering

Similarity Measures for Text Document Clustering

Preference Based Evaluation Measures for Novelty and Diversity

DOCUMENT BASED WRITING

Document Similarity Measures

Qualitative Measures of Text Complexity

Document Based Question

Text Based Information Retrieval Document Clustering / Classification Lecture 3

Document Based Questions

Document Based Question

Document Based Questions

Document Based Questions

Document Based Questions

Document Based Question

Document Based Questions

Document Based Question

Measures of Text Similarity