Text and Documents - PowerPoint PPT Presentation

text and documents n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Text and Documents PowerPoint Presentation
Download Presentation
Text and Documents

play fullscreen
1 / 141
Text and Documents
128 Views
Download Presentation
tam
Download Presentation

Text and Documents

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Text and Documents CS 4460 - Information Visualization Jim Foley, some material courtesy John Stasko. Some examples from Marti Hearst, Search User Interfaces, Cambridge University Press, 2009

  2. Text is Everywhere • We (may) use documents as primary information artifact in our lives • Our access to documents/information has grown tremendously in recent years • Internet infrastructure • WWW • Google, Yahoo, Bing • Digital libraries • And the amount of information has grown! CS 4460

  3. The Key Question for InfoVis • How can InfoVis help users in gathering, understanding, using information from • Document collections (macro-level)? • Such as everything on the web • Individual documents (micro-level)? • Such as a thesaurus, or a book or speech • Shakespeare, Bible, Koran, Torah, …. CS 4460

  4. Example Macro-level Tasks • Which documents contain text on topic XYZ? • Are there other documents that might be close enough to be worthwhile? • How do documents fit into a larger context? • What documents might be of interest to me? • Which documents have a negative/angry tone? CS 4460

  5. Example Micro-level Tasks What are the main themes of a document? How are certain words or themes distributed through a document? How does one document compare to or relate to other documents? In what contexts is the word “inflation” used with the word “spending?” CS 4460

  6. Related Topic – IR • Information Retrieval • The search process that locates particular entities based on selection criteria • Google search algorithms • Library catalog search • We will not discuss IR algorithms • We will discuss how InfoViscan help • Understand what can be retrieved • Understand what has been retrieved • Browse • Formulate more precise queries • Etc. CS 4460

  7. Related Topic – Sensemaking • Making ‘sense’ of a collection of docs • Relate facts/info in document collection to create an understanding of a topic, or to ‘tell a story’ based on the facts • Discussed more in visual analytics lecture • InfoVis can help sensemaking be more rapid than without CS 4460

  8. Challenge • Text is nominal data with a hugh (infinite) cardinality • Does not map to geometric representations as easily as ordinal and quantitative data • The step “Raw data --> Data Table” mapping now becomes very important – indeed, becomes central CS 4460

  9. Process for Text/Doc InfoVis Vectors Keywords Etc. Data tables For InfoVis Raw Data (documents) Analysis Algorithms Visualization Decomposition Statistics Similarity Clustering Relevance Thesaurus Word count KWIC Etc. 2D, 3D display CS 4460

  10. Challenge (Cont’d) • Unstructured text does NOT have any explicit meta-data. • Just that infinitely big collection of nominal data • Meta-data is sometimes extracted from raw text • What Jigsaw calls “entity extraction” • Google News extracts dates • Contrast to structured text of an on-line library with explicit meta-data such as • Author name • Year of publication • Title • ISBN number • Library of Congress umber • Publisher name • Etc • This meta-information is itself mostly nominal but has much lower cardinality than for Google-style free text search, which simplifies and structures the retrieval process. • We will look at a few examples in the structured meta-data space CS 4460

  11. Document Collections • Problem or challenge is how to present the contents/semantics/themes/etc of the documents to someone who does not have time to read them all • Who are the users? • How often do YOU use Google/Yahoo/Bing?? • Students, researchers, news people, everyday people, CIA/FBI CS 4460

  12. Outline • Macro-level – searching larger document collections • Unstructured – no meta-data • Structured – explicit meta-data • Search history • Micro-level • Inter-document methods for smaller document collections • How do retrieved documents relate to a query? • How do retrieved documents relate to one another? • Intra-document methods • Word usage, grammatical style, … • With the caveat that some methods can be used in multiple ways CS 4460

  13. Macro-Level: Large Unstructured • LARGE does not mean entire WWW!! • A number of systems endeavor to give a “big picture view” – the “gist” of a large collection of documents • Themescape • WebThemes • Galaxies • Feature Maps/WEBSOM • (Kohonen SOM-Self Organizing Maps) • ThemeRiver CS 4460

  14. Group has developed a number of visualization techniques for document collections Galaxies ThemeScapes ThemeRiver WebTheme PNNL Wise et al InfoVis ‘95 www.pnl.gov/infoviz CS 4460

  15. Themescape Height/color encode document density CS 4460

  16. ThemeRiver CS 4460

  17. ThemeRiver Video CS 4460

  18. WebTheme CS 4460

  19. Galaxies Presentation of documents where similar ones cluster together CS 4460

  20. Geo-like Maps But not very useful; no longer offered as a product CS 4460

  21. Feature Maps (SOMs) • SOMs = Self Organizing Maps • Developed by TeuvoKohonen • Thus sometimes called Kohonen Maps • Expresses complex, non-linear relationships between high dimensional data items into simple geometric relationships on a 2-d display • Creates clusters of like things • Uses neural network techniques LinVisualization ‘92 CS 4460

  22. WEBSOM Self-organizing map of Net newsgroups and Postings Think of as a top view of a ThemeScape, but organized with a different method http://websom.hut.fi/websom/milliondemo/html/root.html(dead link) CS 4460

  23. Another SOM CS 4460 ai2.bpa.arizona.edu/ent/ dead link

  24. Another SOM faculty.cis.drexel.edu/Sitemap/ dead link Xia Lin CS 4460

  25. Another SOM CS 4460

  26. ThemeScapes vs. SOMs • Self-organizing maps (Kohonen) don’t reflect density of regions all that well • Themescapes uses 3D representation • Height represents density or number of documents in region • Could think of SOM as top view of Themescape  CS 4460

  27. Basic Idea to Create Maps • Break each document into its words • Two documents are “similar” if they share many words • See later aside on Vector Space Analysis • Use mass-spring graph-like algorithm for clustering similar documents together and pushing dissimilar documents far apart CS 4460

  28. Map Attributes – What Have we Seen? • Colored areas correspond to different concepts in collection • Size of area corresponds to importance of concept relative to other concepts • Neighboring regions indicate commonalities in concepts • Adjacencies and sizes are computed from the documents themselves • Dots in regions can be used to represent documents • ResultMaps that will see later share some of these properties – but their structure is predefined CS 4460

  29. Are these techniques useful? Strengths/weaknesses? Useful for entire set of docs on WWW? So how large is large? What determines viable size for each system/method? Map Review CS 4460

  30. Aside - How to Characterize Documents – Vector Space Analysis • How compare similarity of two documents? Here’s one way: • Step 1, for each document • Make list of each unique word in document • Throw out common words (a, an, the, …) • Make different forms the same (bake, bakes, baked) • Store count of how many times each word appeared • Alphabetize, make into a vector • One per document CS 4460

  31. Aside - Vector Space Analysis • To compare two doc’s, determine how closely two vectors go in same direction • Step 2 - form inner (dot) product of each doc’s vector with every other vector • Gives similarity of each document to every other one • Step 3 - use mass-spring layout algorithm to position representations of each document • Dot product => closeness • Themescape makesmountains from clusters • Some similarities tohow search engines work CS 4460

  32. Aside – But not all Words Equal • Not all terms or words are equally useful • Often apply TFIDF • Term Frequency, Inverse Document Frequency • Weight of a word goes up if it appears often in a document, but not often in the collection CS 4460

  33. CS 4460

  34. Understanding Small Information Spaces • SMART – System for the Mechanical Analysis and Retrieval of Text • VIBE • Text Themes • SQWID CS 4460

  35. SMART System • Uses vector space model for documents • May break document into chapters and sections and deal with those as atoms • Plot document atoms on circumference of circle • Atom - document, or section, or paragraph • Draw line between items if their similarity exceeds some threshold value Salton et al, Automatic Analysis, Theme Generation, and Summarization of Machine-Readable Texts, Science June 1994 CS 4460

  36. SMART System • Four documents shown • Lines give similarity between documents, if above .20 • Items evenly spaced • Doesn’t give viewer idea of how big each section/document is • Very early system by Jerry Salton, the father of Information Retrieval CS 4460

  37. SMART – another example • Connections between paragraphs in a single document • No weights shown • Clutter problem • How about dynamic query on weights? CS 4460

  38. SMART – Another Example Quoting Salton: “The convexgraph structure reflects a homogeneoustreatment of the topic; in this case, the "Smoking" article emphasizes the health problems connected with smoking and the difficulties that arise when people attempt to quit smoking. For a homogeneous map such as this, it should be easy to determine the basic text content by looking at only a few carefully chosen paragraphs.” CS 4460

  39. SMART – Another Example Again quoting Salton: In contrast, consider … (this graph with) … the same similarity threshold of 0.30. This map is much less dense; there are many outliers consisting of a single node only, and there is a disconnected component that includes paragraphs 2 and 3 of section 5. Clearly, the "Symphony" topic does not receive the same homogeneous treatment in the encyclopedia as "Smoking,” and a determination of text content by selectively looking at particular text excerpts is much more problematic in this case. CS 4460

  40. SMART – Refined Design • Four documents depicted by arcs • Arc length => document length • Paragraph-level similarities indicated by lines • Par. position shown within doc. arc Proportional to document length Links at correct position in document CS 4460

  41. SMART- Text Themes • Look for sets of regions in a document (or sets of documents) that all have common theme • Closely related to each other, but different from rest • Need to run clustering process CS 4460

  42. Algorithm • Recognize triangles in relation maps • Group of 3 atoms, each related, with edges above threshold • Make a new vector that is average of 3 • Triangles merged whenever averaged vectors are sufficiently similar (ie, heading in the same direction) CS 4460

  43. SMART – Text Theme Example • Using the preceding example, four themes emerge • Shown as four differently-shaded regions of (in some cases multiple) triangles Key to document names CS 4460

  44. Helpful • What do you think? • Ways to improve?? CS 4460

  45. VIBE System • Smaller sets of documents than whole library • Example: Set of 100 documents retrieved from a web search • Idea is to understand how contents of documents relate to each other Olsen et al Info Process & Mgmt ‘93 CS 4460

  46. Visualize Keywords and Doc’s • Show relation of each Doc to Keywords • “Similar” Doc’s cluster together CS 4460

  47. Algorithm • Example: 2 Keywords • Document 1 vector • D1(K1, K2) = (0.4, 0.8) P1 0.4 0.4+0.8 0.333 D1 P2 1/3 of way from K2 to K1 CS 4460

  48. A VIBE Visualization CS 4460

  49. Effectively communications relationships Straightforward methodology and vis are easy to follow Can show relatively large collections Not showing much about a document Could encode info in Doc Marks Single items lose “detail” in the presentation Starts to break down with large number of terms VIBE Pro’s and Con’s CS 4460

  50. SQWID: Search Query Weighted Info Display (VIBE-like) • Keywords “pull” Doc’s • (University, Visualization, Tools) • Doc’s can go outside convex hull of keywords (unlike some other approaches) McCrickard and Kehoe, Visualizing Search Results using SQWID, Poster paper in Proceedings of the 6th World Wide Web Conference (WWW6), Santa Clara CA, April 1997 CS 4460