Using memex to archive and mine community web browsing experience
Download
1 / 20

Using Memex to archive and mine community Web browsing experience - PowerPoint PPT Presentation


  • 120 Views
  • Uploaded on

Using Memex to archive and mine community Web browsing experience. Soumen Chakrabarti Sandeep Srivastava Mallela Subramanyam Mitul Tiwari Indian Institute of Technology Bombay. Information sources on the Web. Web page contents Early keyword search engines Hyperlink structure

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Using Memex to archive and mine community Web browsing experience' - arnon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Using memex to archive and mine community web browsing experience l.jpg

Using Memex to archive and mine community Web browsing experience

Soumen ChakrabartiSandeep SrivastavaMallela SubramanyamMitul Tiwari

Indian Institute of Technology Bombay


Information sources on the web l.jpg
Information sources on the Web experience

  • Web page contents

    • Early keyword search engines

  • Hyperlink structure

    • Later engines: Google, Raging Search

  • Searching behavior

    • Search site monitor clicks on search results

  • Browsing behavior

    • Easily captured in stand-alone hypermedia

    • Need software infrastructure for the Web


Personal memex l.jpg
Personal Memex experience

  • Archiving is feasible

    • ~25 GB in a lifetime

  • Why archive?

    • Recall past events

    • Create a ‘profile’

    • Correlate with sites, directories, searches

  • Challenges

    • Flexible architecture

    • Analyses techniques

Your husband died,but here is his Memex

(From Jim Gray’s Turing Award Lecture)


Searching the personal memex l.jpg
Searching the personal Memex experience

  • Keyword search (never lose a page)

  • Advanced queries

    • Recreate my recent surfing history w.r.t. the topic ‘bicycling’

    • Extract from the MIT Web site all pages that match my ‘compiler research’ profile

  • Topic taxonomy plays a central role

    • Characterized by bookmark folders

    • More familiar than ‘universal’ directories


Archiving architecture choices l.jpg
Archiving architecture choices experience

  • Bookmarks only or all click history

  • Installed application or plug-in

    • Closer integration, e.g. with COM

  • CGI and Javascript

    • Slow, hard to monitor all clicks

  • Applet-servlet

    • Portable, better UI compared to HTML

  • Proxy or wiretap

    • Proxy involves configuring browser


Memex block diagram l.jpg
Memex block diagram experience

Browser

Memex server

Visit

Client JAR

Taxonomy synthesis

Resource discovery

Search

Attach

Recommendation

Folder

Download

Context

Classification

Mining demons

Running

client applet

Event-handler servlets

Archive

Clustering

Relational

metadata

Text

index

Topic

models

Memex client-server

protocol and workload

sharing negotiations


Document workflow l.jpg

X experience

Document workflow

Page visit and

bookmarking

events logged

NODE

table

Browser

Memex

client

Push new

version

Per-document version queue

Crawler

Pop and

discard

old version

Demon Registry

Search

indexer

Classifier

service

Clustering

service

Garbage

collector


Folder tab l.jpg
Folder tab experience

  • Valuable user input and feedback on topics and example documents

User cuts and

pastes to correct

or reinforce the

Memex classifier

‘?’ indicates

automatic

placement by

Memex classifier

File manager-

like interface

Privacy

choice


Context tab l.jpg
Context tab experience

Replay of recent

browsing context

restricted to

chosen topic

Choice of

topic context

Better mobility than one-

dimensional history provided

by popular browsers

Active browser monitoring

and dynamic layout of new

incremental context graph


Search tab l.jpg

Search using experience

keyword and

visit statistics

Search tab

  • “Find the paper about collaborative filtering I was reading a month back”


Mining issues l.jpg
Mining issues experience

  • Two relations

    • occurs_in(term, document)

    • bookmarked_into(document, folder)

    • (Ignore hyperlinks for now)

  • Document classification and clustering

    • Exploit ‘bookmarked_into’

  • Taxonomy synthesis

    • Reconcile folders from a community of users into coherent themes


Taxonomy synthesis motivation l.jpg
Taxonomy synthesis: motivation experience

  • Autonomy vs collaboration

    • Personalizationpicking folders from Yahoo

    • Complex relations between users’ interests

  • Need the “simplest common ground”

User1

User2

User3

Yahoo

Cycling

Sports

Biz

Sports

Sports

Shops

Hiking

Cycling

Bikeshops

Bikeshops

Subsumption

Tree ‘inversion’


Taxonomy synthesis intuition l.jpg

Share documents experience

Share folder

Share terms

Taxonomy synthesis: intuition

kpfa.org

Media

bbc.co.uk

kron.com

Broadcasting

channel4.com

kcbs.com

Entertainment

foxmovies.com

lucasfilms.com

Studios

miramax.com

Folders

Documents


Taxonomy synthesis intuition14 l.jpg
Taxonomy synthesis: intuition experience

kpfa.org

Media

Themes

bbc.co.uk

Radio

kron.com

Broadcasting

channel4.com

TV

kcbs.com

Entertainment

foxmovies.com

Movies

lucasfilms.com

Studios

miramax.com

Folders

Documents


Trade off l.jpg
Trade-off experience

  • Using theme nodes can simplify graph

    • Shannon encoding of folder or theme ID

  • Increases distortion of term distribution

    • Kullbach-Leibler (KL) distance of distorted folder w.r.t. ‘true’ folder

  • Compare cost in bits


Algorithm bestsingle l.jpg

Media experience

Entertainment

Broadcasting

Studios

HAC

Tree

Documents

Algorithm BestSingle

  • Pool all documents

  • Find bottom-up hierarchical clustering (HAC) using text only

  • Map each original folder to the one HAC node at the smallest KL distance

  • Low mapping cost, high distortion


Patchhac and bicriteria l.jpg
PatchHAC and Bicriteria experience

  • PatchHAC:

    • Start with BestSingle

    • Greedily introduce additional mappings from folders to HAC nodes

  • Bicriteria:

    • Start with each document a theme

    • Collapse greedily while total code length decreases


Conclusion l.jpg
Conclusion experience

  • Recording history is feasible and useful

    • Few kilobytes per day per user

  • Bookmark taxonomies are a valuable source of information; can be…

    • Integrated into dynamic community-specific taxonomies

    • Used to drive discovery and collaboration

  • Memex can guide peer proxy caches

    • Cooperative caching between departments


Software l.jpg
Software experience

  • Demo: www.cs.berkeley.edu/~soumen

  • Client: Signed Swing/JFC applet

    • Netscape4.5+ (IE, HotJava planned)

  • Server: DB2 + Berkeley DB + Servlets

  • Infrastructure for plugging in research prototypes using the Demon API

    • Clustering, classification, visualization

    • Collaborative filtering and recommendation


Related work l.jpg
Related work experience

  • Archiving, searching, categorization

    • Vistabar (Alta Vista)

    • Bookmark organizer (IBM Haifa)

    • PowerBookmarks (NEC)

    • Purple Yogi

    • Netscape roaming access, Backflip

  • Mining

    • Attribute similarity via external probes

    • Non-linear dynamical systems


ad