Nurturing content based collaborative communities on the web
Download
1 / 48

Nurturing content-based collaborative communities on the Web - PowerPoint PPT Presentation


  • 59 Views
  • Uploaded on

Nurturing content-based collaborative communities on the Web. Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering Indian Institute of Technology Bombay www.cse.iitb.ernet.in/~soumen www.cse.iitb.ernet.in/~cfiir. Generic search engines.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Nurturing content-based collaborative communities on the Web' - nash


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Nurturing content based collaborative communities on the web

Nurturing content-based collaborative communitieson the Web

Soumen Chakrabarti

Center for Intelligent Internet ResearchComputer Science and EngineeringIndian Institute of Technology Bombay

www.cse.iitb.ernet.in/~soumenwww.cse.iitb.ernet.in/~cfiir


Generic search engines
Generic search engines

  • Struggle to cover the expanding Web

    • 35% coverage in 1997 (Bharat and Broder)

    • 18% in 1999 (Lawrence and Lee Giles)

    • Google rebounds to 50% in 2000

    • Moore’s law vs. Web population

    • Search quality, index freshness

  • Cannot afford advanced processing

    • Alta Vista serves >40 million queries / day

    • Cannot even afford to seek on disk (8ms)

    • Limits intelligence of search engines


Scale vs quality
Scale vs. quality

Lexical networks, parsing, semantic indexing

Resourcediscovery

Quality

Focusedcrawling

Link-assistedranking

Topic distillation

Keyword-basedsearch engines

Google, Clever

HotBot, Alta Vista

Scale


The case for vertical portals
The case for vertical portals

“Portals and search pages are changing rapidly, in part because their biggest strength — massive size and reach — can also be a drawback. The most interesting trend is the growing sense of natural limits, a recognition that covering a single galaxy can be more practical — and useful — than trying to cover the entire universe.”

(San Jose Mercury News, 1/1999)


Scaling through specialization
Scaling through specialization

  • The Web shows content-based locality

    • Link-based clusters correlated with content

    • Content-based communities emerge in a spontaneous, decentralized fashion

  • Can learn and exploit locality patterns

    • Analyze page visits and bookmarks

    • Automatically construct a “focused portal” with resources that

      • Have high relevance and quality

      • Are up-to-date and collectively comprehensive


Roadmap
Roadmap

  • Hyperlink mining: a short history

  • Resource discovery

    • Content-based locality in hypertext

    • Taxonomy models, topic distillation

    • Strategies for focused crawling

  • Data capture and mining architecture

    • The Memex collaboration system

      • Collaborative construction of vertical portals

    • Link metadata management architecture

      • Surfing backwards on the Web


Historical background
Historical background

  • First generation Web search engines

    • Delete ‘stopwords’ from queries

    • Can only do syntactic matching

  • Users stopped asking good questions!

    • TREC queries: tens to hundreds of words

    • Alta Vista: at most 2–3 words

  • Crisis of abundance

    • Relevance ranking for very short queries

    • Quality complements relevance — that’s where hand-made topic directories shine


Hyperlink induced topic search

h

h

a

h

a

h

a

a

Hyperlink Induced Topic Search

Query

Expanded graph

Keyword

Search

engine

Response

a = Eh

h = ETa

‘Hubs’ and

‘authorities’


Pagerank and google

Prestige of a page is proportional to sum of prestige of citing pages

Standard bibliometric measure of influence

Simulate a random walk on the Web to precompute prestige of all pages

Sort keyword-matched responses by decreasing prestige

Follow randomoutlink from page

PageRank and Google

p1

p2

p4

p3

p4 p1 + p2 + p3

I.e., p = Ep


Observations
Observations citing pages

  • HITS

    • Uses text initially to select Web subgraph

    • Expands subgraph by radius 1 … magic!

    • h and a scores independent of content

    • Iterations required at query time

  • Google/PageRank

    • Precomputed query-independent prestige

    • No iterations needed at query time, faster

    • Keyword query selects subgraph to rank

    • No notion of hub or bipartite reinforcement


Limitations
Limitations citing pages

  • Artificial decoupling of text and links

  • Connectivity-based topic drift (HITS)

    • “movie awards”  “movies”

    • Expanders at www.web-popularity.com

  • Feature diffusion (Google)

    • “more evil than evil” www.microsoft.com

    • New threat of anchor-text spamming

  • Decoupled ranking (Google)

    • “harvard mother”  Bill Gates’s bio page!


Genealogy
Genealogy citing pages

Bibliometry

Exploiting anchor text

Google

HITS

Outlier elimination

Topic distillation

@Compaq

[email protected]

Text classification

Focusedcrawling

Hypertextclassification

Relaxationlabeling

Crawlingcontext graphs

Learningtopic paths


Reducing topic drift anchor text

Page modeled as sequence of tokens and outlinks citing pages

“Radius of influence” around each token

Query term matching token increases link weight

Favors hubs and authorities near relevant pages

Better answers than HITS

Ad-hoc “spreading activation”, but no formal model as yet

Reducing topic drift: anchor text

Query term


Reducing topic drift outlier detection

Search response is usually ‘purer’ than radius=1 expansion

Compute document term vectors

Compute centroid of response vectors

Eliminate far-away expanded vectors

Results improve

Why stop at radius=1?

Reducing topic drift: Outlier detection

Expanded graph

Keyword

searchresponse

Vector-spacedocumentmodel

Centroid

Cut-off

radius

×


Resource discovery
Resource discovery expansion

  • Given

    • Yahoo-like topic tree with example URLs

    • A selection of good topics to explore

  • Examples, not queries, define topics

    • Need 2-way decision, not ad-hoc cut-off

  • Goal

    • Start from the good / relevant examples

    • Crawl to collect additional relevant URLs

    • Fetch as few irrelevant URLs as possible


A model for relevance
A model for relevance expansion

Blocked class

Path class

All

Arts

Bus&Econ

Recreation

Companies

...

Cycling

...

Bike Shops

Clubs

Mt.Biking

Subsumed classes

Good classes


Pr c d from pr c d using bayes rule
Pr( expansionc|d) from Pr(c|d) using Bayes rule

  • Decide topic; topic c is picked with prior probability (c); c(c) = 1

  • Each c has parameters (c,t) for terms t

  • Coin with face probabilities t (c,t) = 1

  • Fix document length n(d) and toss coin

  • Naïve yet effective; can use other algos

  • Given c, probability of document is


Enhanced models for hypertext

c expansion=class, d=text, N=neighbors

Text-only model: Pr(d|c)

Using neighbors’ text to judge my topic:Pr(d, d(N) | c)

Better recursive model:Pr(d, c(N)| c)

Relaxation labeling over Markov random fields

Or, EM formulation

Enhanced models for hypertext

?


Hyperlink modeling boosts accuracy

9600 patents from 12 classes marked by USPTO expansion

Patents have text and prior art links

Expand test patent to include neighborhood

‘Forget’ and re-estimate fraction of neighbors’ classes

(Even better for Yahoo)

Hyperlink modeling boosts accuracy


Resource discovery basic approach

Topic taxonomy with examples and ‘good’ topics specified expansion

Crawler coupled to hypertext classifier

Crawl frontier expanded in relevance order

Neighbors of good hubs expanded with high priority

Resource discovery: basic approach

ExampleURLs

?

?

Radius-1 rule

Radius-2 rule


Focused crawler block diagram

Taxonomy expansion

Editor

Example

Browser

Topic

Distiller

Scheduler

Feedback

Taxonomy

Database

Crawl

Database

Workers

Hypertext

Classifier

(Learn)

Hypertext

Classifier

(Apply)

TopicModels

Focused crawler block diagram


Focused crawling evaluation
Focused crawling evaluation expansion

  • Harvest rate

    • What fraction of crawled pages are relevant

  • Robustness across seed sets

    • Perform separate crawls with random disjoint samples

    • Measure overlap in URLs, server IP addresses, and best-rated resources

  • Evidence of non-trivial work

    • Path length to the best resources


Harvest rate

Focused expansion

Harvest rate

Unfocused


Crawl robustness
Crawl robustness expansion

URL Overlap

Server Overlap

Crawl 1

Crawl 2


Robustness of resource quality

Sample disjoint sets of starting URL’s expansion

Two separate crawls

Run HITS/Clever

Find best authorities

Order by rank

Find overlap in the top-rated resources

Robustness of resource quality


Distance to best resources

Cycling: cooperative expansion

Mutual funds: competitive

Distance to best resources


A top hub on expansion‘airlines’ after

half an hourof focusedcrawling


A top hub on expansion‘bicycling’ after

one hour of

focused crawling


Learning context graphs
Learning context graphs expansion

  • Topics form connected cliques

    • “heart disease”  ‘swimming’, ‘hiking’

    • ‘cycling’  “first-aid”!

  • Radius-1 rule can be myopic

    • Trapped within boundaries of related topics

  • From short pre-crawled paths

    • Can learn frequent chains of related topics

    • Use this knowledge to circumvent local “topic traps”



Roadmap1
Roadmap expansion

  • Hyperlink mining: a short history

  • Resource discovery

    • Content-based locality in hypertext

    • Taxonomy models, topic distillation

    • Strategies for focused crawling

  • Data capture and mining architecture

    • The Memex collaboration system

      • Collaborative construction of vertical portals

    • Link metadata management architecture

      • Surfing backwards on the Web


Memex project goals
Memex project goals expansion

  • Infrastructure to support spontaneous formation of topic-based communities

  • Mining algorithms for personal and community level topic management and collaborative resource discovery

  • Extensible API for plugging in additional hypertext analysis tools


Memex project status
Memex project status expansion

  • Java applet client

    • Netscape 4.5+ (Javascript) available

    • IE4+ (ActiveX) planned

  • Server code for Unix and Windows

    • Servlets + IBM Universal Database

    • Berkeley DB lightweight storage manager

    • Simple-to-install RPMs for Linux planned

  • About a dozen alpha testers

  • First beta available 12/2000


Creating personal topic spaces
Creating personal topic spaces expansion

  • Valuable user input and feedback on topics and associated examples

User cuts and

pastes to correct

or reinforce the

Memex classifier

‘?’ indicates

automatic

placement by

Memex classifier

File manager-

like interface

Privacy

choice


Replaying topic based contexts
Replaying topic-based contexts expansion

  • “Where was I when last surfing around /Software/Programming?”

Choice of

topic context

Replay of recent

browsing context

restricted to

chosen topic

Better mobility than one-

dimensional history provided

by popular browsers

Active browser monitoring

and dynamic layout of new/

incremental context graph


Synthesis of a community taxonomy

Themes expansion

‘Radio’

Share document

‘Television’

Share folder

‘Movies’

Share terms

Synthesis of a community taxonomy

  • Users classify URLs into folders

  • How to synthesize personal folders into common taxonomy?

  • Combine multiple similarity hints

Media

kpfa.org

bbc.co.uk

kron.com

Broadcasting

channel4.com

kcbs.com

Entertainment

foxmovies.com

lucasfilms.com

Studios

miramax.com


Setting up the focused crawler
Setting up the focused crawler expansion

Current

Examples

Drag

Taxonomy

Editor

Suggested

Additional

Examples


Monitoring harvest rate
Monitoring harvest rate expansion

One URL

Relevance/Harvest rate

Moving

Average

Time


Overview of the memex system
Overview of the Memex system expansion

Browser

Memex server

Visit

Client JAR

Taxonomy synthesis

Resource discovery

Search

Attach

Recommendation

Folder

Download

Context

Classification

Mining demons

Running

client applet

Event-handler servlets

Archive

Clustering

Relational

metadata

Text

index

Topic

models

Memex client-server

protocol and workload

sharing negotiations


Surfing backwards using contexts

Who points expansion

to S2/P2?

Backlink

Database

C’

Local or on Memex server

Surfing backwards using contexts

  • Space-bounded referrer log

  • HTTP extension to query backlink data

GET /P2 HTTP/1.0

Referer: http://S1/P1

S1

S2

C

http://S1/P1

http://S2/P2






User study and analysis
User study and analysis expansion

  • (1999) Significant improvement in finding comprehensive resource lists

    • Six broad information needs, 25 volunteers

    • Find good resources within limited time

    • Backlinks faked using search engines

    • Blind-reviewed by three other volunteers

  • (2000) Average path length of undirected Web graph is much smaller compared to directed Web graph

  • (2000) Better focused crawls using backlinks

  • Proposal to W3C


Backlinks improve focused crawling

Follow forward HREF as before expansion

Also expand backlinks using ‘link:’ queries

Classify pages as before

Backlinks improve focused crawling

…but pays off in the end

Sometimes distracts in unrewarding work…


Surfing backwards summary
Surfing backwards: summary expansion

  • “Life must be lived forwards, but it can only be understood backwards” —Soren Kierkegaard

  • Hubs are everywhere!

    • To find them, look backwards

  • Bidirectional surfing is a valuable means to seed focused resource discovery

    • Even if one has to depend on search engines initially for link:… queries


Conclusion
Conclusion expansion

  • Architecture for topic-specific web resource discovery

  • Driven by examples collected from surfing and bookmarking activity

  • Reduced dependence on large crawlers

  • Modest desktop hardware adequate

  • Variable radius goal-directed crawling

  • High harvest rate

  • High quality resources found far from keyword query response nodes


ad