1 / 27

Jaegul Choo *, Barry L. Drake † , and Haesun Park* *Georgia Institute of Technology

Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization. Jaegul Choo *, Barry L. Drake † , and Haesun Park* *Georgia Institute of Technology † Georgia Tech Research Institute Big Data Innovators Gathering (BIG) 2014.

thyra
Download Presentation

Jaegul Choo *, Barry L. Drake † , and Haesun Park* *Georgia Institute of Technology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization JaegulChoo*, Barry L. Drake†, and Haesun Park* *Georgia Institute of Technology †Georgia Tech Research Institute Big Data Innovators Gathering (BIG) 2014

  2. What is Visual Analytics? Data Mining Visualization 2

  3. What is Visual Analytics?Leveraging Both Worlds Visual Analytics Data Mining Visualization + 3

  4. Visual Analytics forLarge-Scale Documents UTOPIAN: User-driven Topic Modeling based on Interactive NMF Topic merging Keyword-induced topic creation Doc-induced topic creation Topic splitting VisIRR: Information Retrieval and Personalized Recommender System 4

  5. Motivation: Too Many Documents to Read 5 Product reviews • Which tablet to buy? • iPad (2,000 reviews) vs. Galaxy Tab (1,300 reviews) Research papers • Which sub-area in data mining to focus on? • >Thousands of new papers every year Patent search Many other applications

  6. Topic Modeling: Summarizing Documents … Document 1 Document 2 Document 3 Document 4 … brain evolve dna gene nerve neuron life organism 6 6

  7. Topic Modeling: Summarizing Documents … Document 1 Document 2 Document 3 Document 4 Topic 1 Topic 2 Topic 3 Topic: distribution over keywords … brain evolve dna gene nerve neuron life organism 7 7

  8. Topic Modeling: Summarizing Documents … Document 1 Document 2 Document 3 Document 4 Document: distribution over topics Topic 1 Topic 2 Topic 3 Topic: distribution over keywords … brain evolve dna gene nerve neuron life organism 8 8

  9. Nonnegative Matrix Factorization (NMF) H • min || A – WH ||F W>=0, H>=0 ~ = A W Low-rank approximation via matrix factorization Why nonnegativity constraints? Better interpretation(vs. better approximation, e.g., SVD) 9

  10. NMF as Topic Modeling H H ~ = A W W … Document 1 Document 2 Document 3 Document 4 Document: distribution over topics Topic 1 Topic 2 Topic 3 Topic: distribution over keywords … brain evolve dna gene nerve neuron life organism 10

  11. Why NMF (instead of LDA)?Consistency from Multiple Runs 20 newsgroup data set InfoVis/VAST paper data set Documents’ topical membership changes among 10 runs 11

  12. Why NMF (instead of LDA)?Empirical Convergence InfoVis/VAST paper data set 10 minutes 48 seconds NMF LDA Documents’ topical membership changes between iterations 12

  13. NMF vs. LDATopic Summary (Top Keywords) InfoVis/VAST paper data set 13 • Topics are more consistent in NMF than in LDA. • Topic quality is comparable between NMF and LDA.

  14. UTOPIAN:User-Driven Topic Modeling Based on Interactive NMF[Choo et al., TVCG’13] Keyword-induced topic creation Topic merging Doc-induced topic creation Topic splitting 14

  15. Visualization Example: Car Reviews Topic summaries are NOT perfect. • UTOPIAN allows user interactions for improving them.

  16. Weakly Supervised NMF: Supporting User Interactions Weakly supervised NMF [Choo et al., DMKD, accepted with rev.] min ||A – WH ||F2+ α||(W – Wr)MW||F2 + β||MH(H – DHHr)||F2 W>=0, H>=0 Wr, Hr: reference matrices for W and H (user-input) MW, MH: diagonal matrices for weighting/masking columns and rows of Wand H Algorithm: block-coordinate descent framework 16

  17. http://tinyurl.com/UTOPIAN2013 Interaction Demo Video InfoVis-VAST Paper Data Before interaction After topic splitting (triangle) and topic merging (circle) 17

  18. VisIRR: Information Retrieval and Personalized Recommender System 18

  19. FeaturesEfficient Large-scale Data Processing 19 Document corpus: ~400,000 academic papers in CS Data management Structured data: author, year, venue, keywords, citation/reference count Unstructured data: bag-of-words vectors of title, abstract, keywords Graph data: content, citation, and co-authorship Efficient data handling Dynamic loading from disk to memory via Cache-like strategy Scalable data expansion in O(n)

  20. FeaturesPersonalized Recommendation 20 Works based on user preference on document • Preference scale of 1 (highly dislike) to 5 (highly like) • Various recommendation schemes • Based on content, citation network, and co-authorship Algorithm • Preference propagation on graph using heat kernel rα = α ∑k (1- α)kfWk • rα is a recommendation score vector with a control parameter α, and fis a user-assigned rating, and W is an input graph

  21. http://tinyurl.com/VisIRR VisIRR DemoCitation-based Recommendation • Preference-assigned item as ‘highly like’ : • ‘Automatic Classification System for the Diagnosis of Alzheimer Disease Using Component-Based SVM Aggregations’ • Most of the recommended items are highly cited. • Computational zoom-in shows sub-areas relevant to the article. 21

  22. http://tinyurl.com/VisIRR VisIRR DemoCo-authorship-based Recommendation • Preference-assigned item as ‘highly like’ : • ‘Automatic Classification System for the Diagnosis of Alzheimer Disease Using Component-Based SVM Aggregations’ • It shows other areas of the authors of this paper. Retrieved + recommended items Computational zoom-in on recommended items 22

  23. Interested in learning Micro-Financing Analysis in Kiva.org? Check out my presentation at Room 104, Wed 4pm 23

  24. Thank you! JaegulChoojaegul.choo@cc.gatech.edu (Currently on the Academic Job Market) Topic merging Keyword-induced topic creation UTOPIAN: User-driven Topic Modeling based on Interactive NMF Doc-induced topic creation Topic splitting VisIRR: Information Retrieval and Personalized Recommender System Micro-Financing Analysis in Kiva.org, : Room 104, Wed 4pm 24 Selected Papers • Choo et al., Document Topic Modeling and Discovery in Visual Analytics via Nonnegative Matrix Factorization, TVCG, 2013 • Choo et al., VisIRR: Interactive Visual Information Retrieval and Recommendation for Large-scale Document Data, Tech Report, Georgia Tech, 2013

  25. UTOPIANInteractions and Key Techniques • Visualization • Supervisedt-SNE • Topic modeling • NMF Interaction • Refining topic keywords • Merging topics • Splitting a topic • Creating new topics from seed documents/keywords Weakly-supervised NMF Per-iteration Visualization Framework

  26. Supervised t-SNE: Visualizing documents Supervised t-SNE • d(xi, xj) ← α•d(xi, xj) if xi and xjbelong to the same topic. (e.g., α=0.3) Original t-SNE • Documents do not have clear topic clusters.

  27. PIVE: (Per-iteration Visualization Environment) Standard approach PIVE approach Integration methodology of Iterative Methods for Real-TimeInteractive Visualization[Choo et al., VAST’14, to submit] 27

More Related