using memex to archive and mine community web browsing experience
Download
Skip this Video
Download Presentation
Using Memex to archive and mine community Web browsing experience

Loading in 2 Seconds...

play fullscreen
1 / 20

Using Memex to archive and mine community Web browsing experience - PowerPoint PPT Presentation


  • 121 Views
  • Uploaded on

Using Memex to archive and mine community Web browsing experience. Soumen Chakrabarti Sandeep Srivastava Mallela Subramanyam Mitul Tiwari Indian Institute of Technology Bombay. Information sources on the Web. Web page contents Early keyword search engines Hyperlink structure

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Using Memex to archive and mine community Web browsing experience' - arnon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
using memex to archive and mine community web browsing experience

Using Memex to archive and mine community Web browsing experience

Soumen ChakrabartiSandeep SrivastavaMallela SubramanyamMitul Tiwari

Indian Institute of Technology Bombay

information sources on the web
Information sources on the Web
  • Web page contents
    • Early keyword search engines
  • Hyperlink structure
    • Later engines: Google, Raging Search
  • Searching behavior
    • Search site monitor clicks on search results
  • Browsing behavior
    • Easily captured in stand-alone hypermedia
    • Need software infrastructure for the Web
personal memex
Personal Memex
  • Archiving is feasible
    • ~25 GB in a lifetime
  • Why archive?
    • Recall past events
    • Create a ‘profile’
    • Correlate with sites, directories, searches
  • Challenges
    • Flexible architecture
    • Analyses techniques

Your husband died,but here is his Memex

(From Jim Gray’s Turing Award Lecture)

searching the personal memex
Searching the personal Memex
  • Keyword search (never lose a page)
  • Advanced queries
    • Recreate my recent surfing history w.r.t. the topic ‘bicycling’
    • Extract from the MIT Web site all pages that match my ‘compiler research’ profile
  • Topic taxonomy plays a central role
    • Characterized by bookmark folders
    • More familiar than ‘universal’ directories
archiving architecture choices
Archiving architecture choices
  • Bookmarks only or all click history
  • Installed application or plug-in
    • Closer integration, e.g. with COM
  • CGI and Javascript
    • Slow, hard to monitor all clicks
  • Applet-servlet
    • Portable, better UI compared to HTML
  • Proxy or wiretap
    • Proxy involves configuring browser
memex block diagram
Memex block diagram

Browser

Memex server

Visit

Client JAR

Taxonomy synthesis

Resource discovery

Search

Attach

Recommendation

Folder

Download

Context

Classification

Mining demons

Running

client applet

Event-handler servlets

Archive

Clustering

Relational

metadata

Text

index

Topic

models

Memex client-server

protocol and workload

sharing negotiations

document workflow

X

Document workflow

Page visit and

bookmarking

events logged

NODE

table

Browser

Memex

client

Push new

version

Per-document version queue

Crawler

Pop and

discard

old version

Demon Registry

Search

indexer

Classifier

service

Clustering

service

Garbage

collector

folder tab
Folder tab
  • Valuable user input and feedback on topics and example documents

User cuts and

pastes to correct

or reinforce the

Memex classifier

‘?’ indicates

automatic

placement by

Memex classifier

File manager-

like interface

Privacy

choice

context tab
Context tab

Replay of recent

browsing context

restricted to

chosen topic

Choice of

topic context

Better mobility than one-

dimensional history provided

by popular browsers

Active browser monitoring

and dynamic layout of new

incremental context graph

search tab

Search using

keyword and

visit statistics

Search tab
  • “Find the paper about collaborative filtering I was reading a month back”
mining issues
Mining issues
  • Two relations
    • occurs_in(term, document)
    • bookmarked_into(document, folder)
    • (Ignore hyperlinks for now)
  • Document classification and clustering
    • Exploit ‘bookmarked_into’
  • Taxonomy synthesis
    • Reconcile folders from a community of users into coherent themes
taxonomy synthesis motivation
Taxonomy synthesis: motivation
  • Autonomy vs collaboration
    • Personalizationpicking folders from Yahoo
    • Complex relations between users’ interests
  • Need the “simplest common ground”

User1

User2

User3

Yahoo

Cycling

Sports

Biz

Sports

Sports

Shops

Hiking

Cycling

Bikeshops

Bikeshops

Subsumption

Tree ‘inversion’

taxonomy synthesis intuition

Share documents

Share folder

Share terms

Taxonomy synthesis: intuition

kpfa.org

Media

bbc.co.uk

kron.com

Broadcasting

channel4.com

kcbs.com

Entertainment

foxmovies.com

lucasfilms.com

Studios

miramax.com

Folders

Documents

taxonomy synthesis intuition14
Taxonomy synthesis: intuition

kpfa.org

Media

Themes

bbc.co.uk

Radio

kron.com

Broadcasting

channel4.com

TV

kcbs.com

Entertainment

foxmovies.com

Movies

lucasfilms.com

Studios

miramax.com

Folders

Documents

trade off
Trade-off
  • Using theme nodes can simplify graph
    • Shannon encoding of folder or theme ID
  • Increases distortion of term distribution
    • Kullbach-Leibler (KL) distance of distorted folder w.r.t. ‘true’ folder
  • Compare cost in bits
algorithm bestsingle

Media

Entertainment

Broadcasting

Studios

HAC

Tree

Documents

Algorithm BestSingle
  • Pool all documents
  • Find bottom-up hierarchical clustering (HAC) using text only
  • Map each original folder to the one HAC node at the smallest KL distance
  • Low mapping cost, high distortion
patchhac and bicriteria
PatchHAC and Bicriteria
  • PatchHAC:
    • Start with BestSingle
    • Greedily introduce additional mappings from folders to HAC nodes
  • Bicriteria:
    • Start with each document a theme
    • Collapse greedily while total code length decreases
conclusion
Conclusion
  • Recording history is feasible and useful
    • Few kilobytes per day per user
  • Bookmark taxonomies are a valuable source of information; can be…
    • Integrated into dynamic community-specific taxonomies
    • Used to drive discovery and collaboration
  • Memex can guide peer proxy caches
    • Cooperative caching between departments
software
Software
  • Demo: www.cs.berkeley.edu/~soumen
  • Client: Signed Swing/JFC applet
    • Netscape4.5+ (IE, HotJava planned)
  • Server: DB2 + Berkeley DB + Servlets
  • Infrastructure for plugging in research prototypes using the Demon API
    • Clustering, classification, visualization
    • Collaborative filtering and recommendation
related work
Related work
  • Archiving, searching, categorization
    • Vistabar (Alta Vista)
    • Bookmark organizer (IBM Haifa)
    • PowerBookmarks (NEC)
    • Purple Yogi
    • Netscape roaming access, Backflip
  • Mining
    • Attribute similarity via external probes
    • Non-linear dynamical systems
ad