visualizing large text collections sims 296a 3 current topics in information access n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Visualizing Large Text Collections SIMS 296a-3: Current Topics in Information Access PowerPoint Presentation
Download Presentation
Visualizing Large Text Collections SIMS 296a-3: Current Topics in Information Access

Loading in 2 Seconds...

play fullscreen
1 / 46

Visualizing Large Text Collections SIMS 296a-3: Current Topics in Information Access - PowerPoint PPT Presentation


  • 123 Views
  • Uploaded on

Visualizing Large Text Collections SIMS 296a-3: Current Topics in Information Access. Owen McGrath mcgrath@socrates.berkeley.edu www.itp.berkeley.edu/~owen. Review. Interface Principles (from 9/30) - Select among available sources - Understand search results

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Visualizing Large Text Collections SIMS 296a-3: Current Topics in Information Access' - evan-hopkins


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
visualizing large text collections sims 296a 3 current topics in information access

Visualizing Large Text CollectionsSIMS 296a-3:Current Topics in Information Access

Owen McGrath

mcgrath@socrates.berkeley.edu

www.itp.berkeley.edu/~owen

review
Review

Interface Principles (from 9/30)

- Select among available sources

- Understand search results

- Follow trails with unanticipated results

Visualization Techniques (from 9/30)

- icons and color highlighting

- brushing and linking

- panning and zooming

- focus-plus-context

- animation

review1
Review
  • Interface Principles (from 9/30)

- Unsupervised Groupings

Clustering

Kohonen Feature Maps

- Supervised Categories

Yahoo!

Superbook

HiBrowse

Cat-a-Cone

- Combinations

DynaCat

SONIA

review2
Review

Data Mining ( from 11/4)

- KDD: The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. (Fayyad, Shapiro, & Smyth, CACM 96)

- Fitting models to or determining patterns from very large datasets.

- A “regime” which enables people to interact effectively with massive data stores.

- Deriving new information from data.

finding patterns across large datasets

discovering heretofore unknown information

review3
Review

Data Mining ( from 11/4)

- Text Data Mining often confused with retrieval

- Classification vs. Clustering

preface
Preface

What Information Visualization (IV) hopes to bring to text retrieval:

- Make a complex system more understandable (Mukherjea et al. 95)

- Visual Accessibility of otherwise hidden information (Ball & Eick 96)

- Broader content understanding through focus-related reorganization and progressive refinement of details (Rennison et al. 94)

- Enhanced visual browsing and analysis that reduce mental workload by avoiding language processing (Wise et al. 95)

overview
Overview
  • Dimensions of IV
    • Reduced Representation w/High Information Density
    • 2D displayed on 2D
    • 3(+)D transformed/collapsed onto 2d
  • Visual Metaphors of IV
    • [None explicit] depicting sequential files (SeeSoft)
    • Star-gazing (SPIRE Galaxies)
    • Terrain gliding (SPIRE - ThemeScapes/WEBSOM)
    • Hierarchical overview graph (Mukherjea et al.)
    • Space Travel (Galaxy of News)
    • Particles in Motion (Chiang, Marks, & Shieber)
iv of large text collections
IV of Large Text Collections

SeeSoft: Reduced Representation/High Information Density (Ball & Eick 96)

- Used for tracking code revision/text change

- Displays indentation, line lengths, and paragraphs in miniature (iconographically?)

- Color coding shows the distribution of the salient properties

- Column lengths indicates relative file sizes

- Supplemental tree hierarchy view

seesoft color time
SeeSoft: Color & Time

Seesoft display showing code in colors tied to age of each line (recent in red, oldest in blue).

seesoft color time1
SeeSoft: Color & Time

As mouse moves along color scale, correspondingly aged revisions light up in their color.

seesoft color author
SeeSoft: Color & Author

Rows corresponding to each programmer are shown in that programmer’s color.

seesoft color runtime profile
SeeSoft: Color & Runtime Profile

Lines executed by test suite are in blue, non-executed in red, non-executable in gray.

seesoft
SeeSoft

SeeSoft: Reduced Representation/High Information Density (Ball & Eick 96)

Generalizable Features?

- Reduced representation: display files as as columns and lines of code as thin rows.

- Coloring by statistic (e.g. revision date, programmer/author)

- Direct manipulation

- Capability to read actual code/text (pan/zoom)

seesoft king james bible
SeeSoft: King James Bible

Locations of selected words are indicated by color (e.g. angel = yellow).

iv of large text collections1
IV of Large Text Collections

SPIRE: Reduction of High Dimension Representation Information Density (Wise et al. 95)

SPIRE (Spatial Paradigm for Information Retrieval & Exploration)

- Accepts large volumes of text in almost any format

- Determines the relationships within the text

- Presents them in a visual format that is ‘natural for the human mind’

- Allows users to rapidly discover known and hidden information relationships by reading only the pertinent documents rather than wading through large volumes of text.

iv of large text collections2
IV of Large Text Collections

SPIRE Galaxies:

- Computes word similarities and patterns in documents

- Displays the documents on a computer screen to look like a universe of "docustars"

- Closely related documents will cluster together in a tight group

- Unrelated documents will be separated by large spaces

slide17

SPIRES Galaxies: Color & Runtime Profile

Galaxies visualization of documents and document clusters.

iv of large text collections3
IV of Large Text Collections

SPIRE ThemeScapes:

- Themes within the document spaces appear on computer screen as a relief map of natural terrain

- Mountains indicate where themes are dominant; valleys indicate weak themes

- Their shapes--a broad butte or high pinnacle reflect how the thematic information is distributed and related across documents.

- Themes close in content will be close visually

slide19

SPIRES ThemeScapes

ThemeScape of an entire week of CNN newstories.

iv of large text collections4
IV of Large Text Collections

SPIRE

Aspiration: A researcher could use SPIRE to find out what direction the United States was heading in breast cancer research. Drawing from a large, unstructured document base of information, the researcher uses SPIRE's visualization tools to automatically organize the documents into clusters according to their content similarities and into thematic terrains according to the themes in the text. Looking at these thematic spaces over time will enable the human mind to understand vast interrelated dynamic changes simply not possible to detect using traditional approaches.

iv of large text collections5
IV of Large Text Collections

Hypermedia Networks through Multiple Hierarchical Views:

Content analysis

(Mukherjea et al. 95)

- Visualizing the information space of hypermedia systems using multiple hierarchical views

-Similar issues we saw with Overviews via Category Hierarchies

HIBROWSE (Pollitt 97)

Cat-A-Cone (Hearst 97)

iv of large text collections6
IV of Large Text Collections

Hierarchization Algorithm :

For content analysis: for each attribute, the nodes of the graph are partitioned into branches based on the attribute values by Content-based Clustering. If too many or too few branches are formed, the attribute is not suitable for forming a pre-tree. Otherwise a new pre-tree is formed with these branches. The root of the pre-tree is a cluster representing all the nodes of the graph.

For Structural analysis: a pre-tree is formed for nodes in the graph which can reach all other nodes. These nodes are designated as the roots of the pre-trees. The branches are the branches of the spanning tree formed by doing a

breadth-first search from the designated root node.

iv of large text collections7
IV of Large Text Collections

Role of User :

The user can guide the process both during the translation of the graph to a tree and during the visualization of the tree.

Translation phase: the users can control the various variables that are used in the translation process. For example, they can control the variable which specifies the maximum possible depth of the tree (the recursion stops when this depth is reached). The user can control the relative importance of the various submetrics in the overall metric that is used to rank a given pre-tree. For example the user can specify that the "goodness" of a root is not a useful criteria for judging pre-trees. The user can also assign different weights to different link types

to influence the submetric calculating the amount of information lost.

multiple hierarchical views
Multiple Hierarchical Views

An overview diagram of an automobile database. The diagram is very difficult to comprehend.

multiple hierarchical views1
Multiple Hierarchical Views

Top-level partitioning is by the attribute Price. The right hand screen shows the tree formed if the top-level partitioning I done by the attribute Country..

multiple hierarchical views2
Multiple Hierarchical Views

At each level various pre-trees can be used. A metric ranks these pre-trees. By default the pre-tree with the best metric is selected. However, the user can select others .

multiple hierarchical views3
Multiple Hierarchical Views

Content and Structural analysis for forming pre-trees. The left hand screen represents the nodes for Japan; root is a cluster representing all Japanese cars. The nodes are partitioned by the attribute Manufacturer. The right hand screen is for Swedish cars.

multiple hierarchical views4
Multiple Hierarchical Views

Left hand screen shows the top level of the default hierarchy formed with research.html as the root and the major research areas shown. The right hand screen shows a book view of a portion of this hierarchy.

multiple hierarchical views5
Multiple Hierarchical Views

Treemap view of a hierarchy w/ partitioning by the attribute Topic . Colors represent different types of authors. [Note the similarity to Concept Landscapes.

iv of large text collections8
IV of Large Text Collections

Galaxy of News (Rennison et al.)

Associative relational network visualized using:

- visual clustering based on content

- pyramidal structuring

- semantic zooming

- fluid interaction in 3D info space

iv of large text collections9
IV of Large Text Collections

Within a SemioMap, concepts are represented as nodes. Nodes are linked to each other in clusters.

iv of large text collections10
IV of Large Text Collections

Users can fly through the Web using Apple's HotSauce fly-through, a navigation technology.

iv of large text collections11
IV of Large Text Collections

Apple’s HotSauce Project

Related topics are spatially close to each other. Users can steer toward areas of interest by moving the mouse,

and they control speed and direction by holding down modifier keys. As users zoom past one level of content,

other levels come into view. To get more information about a subject, users can double-click the link, which will

either reveal more levels or open a Web page.

One problem with the navigation paradigm is that it's easy to become disoriented in the free-form representation. Apple has added shortcuts to the viewer that allow users to quickly navigate back to the top of the hierarchy and to center the current topic on their screen, but the company is soliciting more ways to prevent users from getting lost in X Space.

- Scott Rubin ( MacWeek)

iv of large text collections12
IV of Large Text Collections

WEBSOM: Reduction of High Dimension Representation Information Density

- WEBSOM is a means for organizing miscellaneous text documents into meaningful maps for exploration and

search

- WEBSOM is based on SOM (Self-Organizing Map) that automatically organizes documents into a two-dimensional grid so that related documents appear close to each other

websom
WEBSOM

Explanation of the symbols on the map

aiphil - AI and philosophy

assoc - process of association

assum - basic assumptions

beha - behaviourism, AI

conf - conferences, positions

cons1 - consciousness

cons2 - consciousness, society of mind

dna - DNA (pattern, order, etc.)

form - formality and AI

read - reading before kindergarten

will - free will

websom1
WEBSOM

Explanation of the symbols on the map

assum - basic assumptions

will - free will

websom2
WEBSOM

Re: Seeking help on Similarity Measurement David Pautler, Fri, 03 Nov 1995, Lines: 6.

Re: Base-rate neglect James Magnuson, Wed, 22 Nov 1995, Lines: 43.

Re: 20th cen. I.Q. rise "Rolf Marvin Bøe Lindgren", 07 Jan 1996, Lines: 15.

Re: BRAIN AND CONSCIOUSNESS ken collins, Wed, 03 Jul 1996, Lines: 24.

Re: BRAIN AND CONSCIOUSNESS Bobbie Johnson, Sat, 06 Jul 1996, Lines: 29.

Re: Jesus = Lucifer = THE Devil ilias kastanas 08-14-90, 15 Jul 1996, Lines: 96.

Re: Two Dogmas of Empiricism: Implications for Cognitive Science David Longley, Wed, 31 Jul

96, Lines: 19.

Re: Evolutionics Rolf Marvin B|e Lindgren, 05 Aug 1996, Lines: 11.

Re: Just a bunch of chemicals with the illusion of life Samson, Sun, 15 Dec 1996, Lines: 22.

Re: is AA Medical Informatics "Leslie L. Cortes, M.D.", 2 Mar 97, Lines: 13.

Advice on Choosing a Research Advisor in Laboratory Sciences Marshall Dermer, 3 Mar 1997,

Re: What is consciousness? The biology of our experienced inner life. "Mark P. Line", Sat, 22

Mar 1997, Lines: 25.

Re: What is consciousness? The biology of our experienced inner life. Anthony F. Badalamenti,

23 Mar 1997, Lines: 43.

websom3
WEBSOM

NNTP-Posting-Host: sea-wa11-07.ix.netcom.com

Content-Transfer-Encoding: 7bit

X-NETCOM-Date: Sat Mar 22 9:09:36 PM CST 1997

Xref: nntp.hut.fi alt.consciousness:39348 comp.ai:42887 comp.ai.philosophy:52220 sci.cognitive:14872 sci.philosophy.meta:41934 sci.psychology.misc:12817

Mark:

>>How do you know it's destroyed? Maybe the brain is only a vehicle.

>>Then

>>it wouldn't be surprising for the passenger to leave if the vehicle is

>>too badly damaged. If we can't observe consciousness _in_ the brain,

>>then we can't expect to see it leave the brain, nor to observe it ex

>>vivo.

Anthony:

> There is no evidence that consciousness survives certain forms of brain

> damage or death. Perhaps it does survive, I for sure do not know. What

> I do know is that there is no verfiable evidence.

If you don't know (one way or the other), then I guess you don't have

any good evidence that consciousness is destroyed by certain forms of

brain damage or death. That's what I was wondering about when you spoke

of "the FACT that certain interventions in the brain can destroy

consciousness" (emphasis added).

-- Mark

websom4
WEBSOM

WEBSOM: Reduction of High Dimension Representation Information Density ...

(Wise et al. 5)

The documents related to each other are often in the same or nearby map nodes. If you find an interestingdiscussion, it is worth while to check also the neighbors. The most specific discussions are mostly found in the clearest "clusters", i.e. light regions surrounded by darker color. Near the edges of the map you typically find the most "different" documents represented on the map. In the middle areas, the discussions are more "typical", or they may concern many different topics found on the map.

websom5
WEBSOM

Explanation of the symbols on the map

assum - basic assumptions

will - free will

websom6
WEBSOM

Explanation of the symbols on the map

will - free will

websom7
WEBSOM

Flat intonation in schizophrenic speech Minorkeys, 6 Jun 1995, Lines: 12.

Re: FIRST order? was: why Ginsberg grouses David Longley, Tue, 18 Jul 95, Lines: 61.

Re: Adaptive Chaos Oliver Sparrow, Tue, 30 Jan 96, Lines: 19.

listening related topics: bibliography LatComm, 6 Apr 1996, Lines: 8.

Re: A logic of discovery Burt Voorhees, 23 May 1996, Lines: 36.

medical question djensen@olympus.net, 25 May 1996, Lines: 13.

Re: SYMPOSIUM: Can Computers Compose Creatively? Dragomir R. Radev, 27 May 1996,

Lines: 20.

amusia Jules, Sat, 06 Jul 1996, Lines: 3.

Re: use of the word PSYCHOLOGIST John Sproule, Fri, 02 Aug 1996, Lines: 74.

Re: use of the word PSYCHOLOGIST John G., Fri, 02 Aug 1996, Lines: 79.

WWW: Perspectives, Sept-Oct 1996 Issue Now Available Perspectives, 23 Sep 1996, Lines: 37.

Sensation seeking/risk taking activities Adam Satterthwaite, Thu, 7 Nov 1996, Lines: 22.

Re: Just a bunch of chemicals with the illusion of life G K GRAY, Sat, 14 Dec 1996, Lines: 32.

Re: Re: Just a bunch of chemicals with the illusion of life Ton Maas, Sun, 05 Jan 1997, Lines: 19.

Re: Re: Just a bunch of chemicals with the illusion of life Ton Maas, Sun, 05 Jan 1997, Lines: 17.

the effect of self-awareness on cognition Arapurakal, 30 Jan 1997, Lines: 36.

Re: Heisenberg: was he blind? Alexander Weber, Sat, 15 Feb 1997, Lines: 61.

websom8
WEBSOM

nntp.hut.fi!news.funet.fi!news.eunet.fi!EU.net!howland.erols.net!newsxfer3.itd.umich.edu!portc01.blue.aol.com!

Newsgroups: sci.cognitive

Subject: the effect of self-awareness on cognition

Date: 30 Jan 1997 04:21:33 GMT

The difficulty of focusing attention/awareness inwardly vis a vis

outwardly comes from a outmoded and defective conceptual framework that

recognizes the internal/external schism. This model establishes a

perimeter between the internal and external and locates attention in

between. This puts internal content BEHIND attention, and external

content in front. It is more difficult to look behind than it is to look

in front. Hence the problem

When we correct the conceptual framework to acknowledge that all content of

awareness/attention must, by definition, lie in front of it, there is

nothing left behind attention. Such a framework removes the fundamental

difficulty involved in observing non-sensory phenomena such as cognitive,

imaginative, emotive objects and events. However, for those of us who are

used to the old flawed conceptual framework, the momentum of habit might

yet leave some residue of resistance to the focusing of attention on such

non-sensory content. We just have to work at it.

Reducing sensory input for periods of time makes it easier to notice

non-sensory content. Such reduction of sensory input used to be called

meditation.

Ravi

websom9
WEBSOM

WEBSOM: Reduction of High Dimension Representation Information Density ...

(Wise et al. 5)

(from 9/30) Criticism of Browsing tested with Kohonen SOM

subjects who started with Yahoo were less successful in repeating the task with the SOM than vice versa

useful more for broad exploring than for searching

conclusions
Conclusions
  • Dimensions of IV
    • Reduced Representation w/High Information Density
    • 2D displayed on 2D
    • 3(+)D transformed/collapsed onto 2d
  • Visual Metaphors of IV
    • [None explicit] depicting sequential files (SeeSoft)
    • Star-gazing (SPIRE Galaxies)
    • Terrain gliding (SPIRE - ThemeScapes/WEBSOM)
    • Hierarchical overview graph (Mukherjea et al.)
    • Space Travel (Galaxy of News)
    • Particles in Motion (Chiang, Marks, & Shieber)
conclusions cont
Conclusions Cont.

What Information Visualization (IV) seems to bring to text retrieval:

- Make a complex system more understandable (Mukherjea et al. 95)

- Visual Accessibility of otherwise hidden information (Ball & Eick 96)

- Broader content understanding through focus-related reorganization and progressive refinement of details (Rennison et al. 94)

- Enhanced visual browsing and analysis that reduce mental workload by avoiding language processing (Wise et al. 95)