200 likes | 300 Views
This study introduces a method for quantifying the diversity of documents based on text content. By utilizing techniques such as LDA and Rao's Diversity measure, the research explores the interdisciplinary nature of document analysis. Experimenting with various topic models and datasets, the study evaluates the effectiveness of the proposed approach in measuring diversity within documents, showcasing its potential for broader applications in text analysis. The conclusions highlight the importance of data-driven methodologies and suggest future directions for enhancing temporal document diversity analysis.
E N D
Text-Based Measures of Document Diversity Date:2014/02/12 Source:KDD’13 Authors:Kevin Bache, David Newman, and Padhraic Smyth Advisor:Dr.Jia-Ling,Koh Speaker:Shun-Chen,Cheng
Introduction • Method • Experiment • Conclusions Outline
the hypothesis: Introduction (Interdisciplinary) (single disciplines) interdisciplinary research can lead to new discoveries ata rate faster than that of traditional research projects conductedwithin single disciplines
Task: Introduction assign Diversity score Goal quantifying how diverse a document is in terms of its content
corpus LDA Learn T for D Framework D x T matrix T:topic D:document Topic co-occurrence similarity measures Rao’s Diversity measure Diversity score of each document
Introduction • Method • Experiment • Conclusions Outline
LDA:collapsed Gibbs sampler • Using the topic-word assignments from the final iteration of the Gibbs sampler • ndj corresponding to the number of word tokens in document d that are assigned to topic j. • Example of create D x T matrix: Topic-based Diversity(1) t1 t2 t3 9 0 1 n13 d1 d2 d3 d4 0 10 6 2 15 8 1 2 16
Rao’s Diversity for a document d: Topic-based Diversity(2) ndj:the value of entry (d,j) in DxT matrix nd:the number of word tokens in d measure of the distance between topic i and topic j
Example of Rao’s diversity: Topic-based Diversity(3) t1 t2 t3 9 0 1 d1 d2 d3 d4 0 10 6 div(1) =1.26 div(2) = 0.04688 div(3) = 0.09344 div(4) = 1.557895 2 15 8 1 2 16
Cosine similarity: • Probabilistic-based: Topicco-occurranceSimilarity ndj:the value of entry (d,j) in DxT matrix N:number of word tokens in the corpus.
Cosine similarity Similarity toDistance Similarity measures Probability based Similarity to Distance
Introduction • Method • Experiment • Conclusions Outline
Dataset • PubMed Central Open Access dataset (PubMed ) • NSF Awards from 2007 to 2012 (NSF) • Association of Computational Linguistics Anthology Network (ACL) • Topic Modeling (LDA) • MALLET • α:0.05*(N/D*T),β:0.01 • 5,000 iterations. Keep only the final sample in the chain. • T = 10, 30, 100 and 300 topics. Experiment
Reason: no ground-truth measure for a document's diversity. • Half of which were designed to have high diversity and half of which were designed to have low diversity. • High diversity pseudo-document: • manually selecting • Randomly select an article from A and one from B. Pseudo-Documents Relatively unrelated Journal A Journal B Randomly select Pseudo-document
Experiment ROCCurve AUC:Area under the ROC curve
Experiment AUC scores for different diversity measures based on 1000 pseudo-documents from PubMed
Experiment Evaluating transformations
Experiment most diverse NSF grant proposals
Introduction • Method • Experiment • Conclusions Outline
Presented an approach for quantifying the diversity of individual documents in a corpus based on their text content. • More data-driven, performing the equivalent of learning journal categories by learning topics from text. • Can be run on any collection of text documents, even without a prior categorization scheme. • A possible direction for future work is that of temporal document diversity. Conclusions