Using memex to archive and mine community web browsing experience l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 20

Using Memex to archive and mine community Web browsing experience PowerPoint PPT Presentation


  • 94 Views
  • Uploaded on
  • Presentation posted in: General

Using Memex to archive and mine community Web browsing experience. Soumen Chakrabarti Sandeep Srivastava Mallela Subramanyam Mitul Tiwari Indian Institute of Technology Bombay. Information sources on the Web. Web page contents Early keyword search engines Hyperlink structure

Download Presentation

Using Memex to archive and mine community Web browsing experience

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Using memex to archive and mine community web browsing experience l.jpg

Using Memex to archive and mine community Web browsing experience

Soumen ChakrabartiSandeep SrivastavaMallela SubramanyamMitul Tiwari

Indian Institute of Technology Bombay


Information sources on the web l.jpg

Information sources on the Web

  • Web page contents

    • Early keyword search engines

  • Hyperlink structure

    • Later engines: Google, Raging Search

  • Searching behavior

    • Search site monitor clicks on search results

  • Browsing behavior

    • Easily captured in stand-alone hypermedia

    • Need software infrastructure for the Web


Personal memex l.jpg

Personal Memex

  • Archiving is feasible

    • ~25 GB in a lifetime

  • Why archive?

    • Recall past events

    • Create a ‘profile’

    • Correlate with sites, directories, searches

  • Challenges

    • Flexible architecture

    • Analyses techniques

Your husband died,but here is his Memex

(From Jim Gray’s Turing Award Lecture)


Searching the personal memex l.jpg

Searching the personal Memex

  • Keyword search (never lose a page)

  • Advanced queries

    • Recreate my recent surfing history w.r.t. the topic ‘bicycling’

    • Extract from the MIT Web site all pages that match my ‘compiler research’ profile

  • Topic taxonomy plays a central role

    • Characterized by bookmark folders

    • More familiar than ‘universal’ directories


Archiving architecture choices l.jpg

Archiving architecture choices

  • Bookmarks only or all click history

  • Installed application or plug-in

    • Closer integration, e.g. with COM

  • CGI and Javascript

    • Slow, hard to monitor all clicks

  • Applet-servlet

    • Portable, better UI compared to HTML

  • Proxy or wiretap

    • Proxy involves configuring browser


Memex block diagram l.jpg

Memex block diagram

Browser

Memex server

Visit

Client JAR

Taxonomy synthesis

Resource discovery

Search

Attach

Recommendation

Folder

Download

Context

Classification

Mining demons

Running

client applet

Event-handler servlets

Archive

Clustering

Relational

metadata

Text

index

Topic

models

Memex client-server

protocol and workload

sharing negotiations


Document workflow l.jpg

X

Document workflow

Page visit and

bookmarking

events logged

NODE

table

Browser

Memex

client

Push new

version

Per-document version queue

Crawler

Pop and

discard

old version

Demon Registry

Search

indexer

Classifier

service

Clustering

service

Garbage

collector


Folder tab l.jpg

Folder tab

  • Valuable user input and feedback on topics and example documents

User cuts and

pastes to correct

or reinforce the

Memex classifier

‘?’ indicates

automatic

placement by

Memex classifier

File manager-

like interface

Privacy

choice


Context tab l.jpg

Context tab

Replay of recent

browsing context

restricted to

chosen topic

Choice of

topic context

Better mobility than one-

dimensional history provided

by popular browsers

Active browser monitoring

and dynamic layout of new

incremental context graph


Search tab l.jpg

Search using

keyword and

visit statistics

Search tab

  • “Find the paper about collaborative filtering I was reading a month back”


Mining issues l.jpg

Mining issues

  • Two relations

    • occurs_in(term, document)

    • bookmarked_into(document, folder)

    • (Ignore hyperlinks for now)

  • Document classification and clustering

    • Exploit ‘bookmarked_into’

  • Taxonomy synthesis

    • Reconcile folders from a community of users into coherent themes


Taxonomy synthesis motivation l.jpg

Taxonomy synthesis: motivation

  • Autonomy vs collaboration

    • Personalizationpicking folders from Yahoo

    • Complex relations between users’ interests

  • Need the “simplest common ground”

User1

User2

User3

Yahoo

Cycling

Sports

Biz

Sports

Sports

Shops

Hiking

Cycling

Bikeshops

Bikeshops

Subsumption

Tree ‘inversion’


Taxonomy synthesis intuition l.jpg

Share documents

Share folder

Share terms

Taxonomy synthesis: intuition

kpfa.org

Media

bbc.co.uk

kron.com

Broadcasting

channel4.com

kcbs.com

Entertainment

foxmovies.com

lucasfilms.com

Studios

miramax.com

Folders

Documents


Taxonomy synthesis intuition14 l.jpg

Taxonomy synthesis: intuition

kpfa.org

Media

Themes

bbc.co.uk

Radio

kron.com

Broadcasting

channel4.com

TV

kcbs.com

Entertainment

foxmovies.com

Movies

lucasfilms.com

Studios

miramax.com

Folders

Documents


Trade off l.jpg

Trade-off

  • Using theme nodes can simplify graph

    • Shannon encoding of folder or theme ID

  • Increases distortion of term distribution

    • Kullbach-Leibler (KL) distance of distorted folder w.r.t. ‘true’ folder

  • Compare cost in bits


Algorithm bestsingle l.jpg

Media

Entertainment

Broadcasting

Studios

HAC

Tree

Documents

Algorithm BestSingle

  • Pool all documents

  • Find bottom-up hierarchical clustering (HAC) using text only

  • Map each original folder to the one HAC node at the smallest KL distance

  • Low mapping cost, high distortion


Patchhac and bicriteria l.jpg

PatchHAC and Bicriteria

  • PatchHAC:

    • Start with BestSingle

    • Greedily introduce additional mappings from folders to HAC nodes

  • Bicriteria:

    • Start with each document a theme

    • Collapse greedily while total code length decreases


Conclusion l.jpg

Conclusion

  • Recording history is feasible and useful

    • Few kilobytes per day per user

  • Bookmark taxonomies are a valuable source of information; can be…

    • Integrated into dynamic community-specific taxonomies

    • Used to drive discovery and collaboration

  • Memex can guide peer proxy caches

    • Cooperative caching between departments


Software l.jpg

Software

  • Demo: www.cs.berkeley.edu/~soumen

  • Client: Signed Swing/JFC applet

    • Netscape4.5+ (IE, HotJava planned)

  • Server: DB2 + Berkeley DB + Servlets

  • Infrastructure for plugging in research prototypes using the Demon API

    • Clustering, classification, visualization

    • Collaborative filtering and recommendation


Related work l.jpg

Related work

  • Archiving, searching, categorization

    • Vistabar (Alta Vista)

    • Bookmark organizer (IBM Haifa)

    • PowerBookmarks (NEC)

    • Purple Yogi

    • Netscape roaming access, Backflip

  • Mining

    • Attribute similarity via external probes

    • Non-linear dynamical systems


  • Login