Focused crawling a new approach to topic specific web resource discovery
Download
1 / 34

Focused Crawling - PowerPoint PPT Presentation


  • 340 Views
  • Uploaded on

Focused Crawling A New Approach to Topic-Specific Web Resource Discovery. Soumen Chakrabarti Martin van Den Berg Byron Dom. Portals and portholes. Popular search portals and directories Useful for generic needs Difficult to do serious research

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Focused Crawling' - flora


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Focused crawling a new approach to topic specific web resource discovery l.jpg

Focused CrawlingA New Approach to Topic-SpecificWeb Resource Discovery

Soumen Chakrabarti

Martin van Den Berg

Byron Dom


Portals and portholes l.jpg
Portals and portholes

  • Popular search portals and directories

    • Useful for generic needs

    • Difficult to do serious research

  • Information needs of net-savvy users are getting very sophisticated

  • Relatively little business incentive

  • Need handmade specialty sites: portholes

  • Resource discovery must be personalized


Quote l.jpg
Quote

The emergence of portholes will be one of the major Internet trends of 1999. As people become more savvy users of the Net, they want things which are better focused on meeting their specific needs. We're going to see a whole lot more of this, and it's going to potentially erode the user base of some of the big portals.

  • Jim Hake(Founder, Global Information Infrastructure Awards)


Scenario l.jpg
Scenario

  • Disk drive research group wants to track magnetic surface technologies

  • Compiler research group wants to trawl the web for graduate student resumés

  • ____ wants to enhance his/her collection of bookmarks about ____ with prominent and relevant links

  • Virtual libraries like the Open Directory Project and the Mining Co.


Structured web queries l.jpg
Structured web queries

  • How many links were found from an environment protection agency site to a site about oil and natural gas in the last year?

  • Apart from cycling, what is the most common topic cited by pages on cycling?

  • Find Web research pages which are widely cited by Hawaiian vacation pages


Slide6 l.jpg
Goal

  • Automatically construct a focused portal (porthole) containing resources that are

    • Relevant to the user’s focus of interest

    • Of high influence and quality

    • Collectively comprehensive

  • Answer structured web queries by selectively exploring the topics involved in the query


Tools at hand l.jpg
Tools at hand

  • Keyword search engines

    • Synonymy, polysemy

    • Abundance, lack of quality

  • Hand compiled topic directories

    • Labor intensive, subjective judgements

  • Resources automatically located using keyword search and link graph distillation

    • Dependence on large crawls and indices


Estimating popularity l.jpg
Estimating popularity

  • Extensive research on social network theory

    • Wasserman and Faust

  • Hyperlink based

    • Large in-degree indicates popularity/authority

    • Not all votes are worth the same

  • Several similar ideas and refinements

    • Googol (Page and Brin) and HITS (Kleinberg)

    • Resource compilation (Chakrabarti et al)

    • Topic distillation (Bharat and Henzinger)


Topic distillation overview l.jpg
Topic distillation overview

  • Given web graph and query

  • Search engine selects sub-graph

  • Expansion, pruning and edge weights

  • Nodes iteratively transfer authority to cited neighbors

The Web

Search Engine

Query

Selected subgraph


Preliminary distillation based approach l.jpg
Preliminary distillation-based approach

  • Design a keyword query to represent a topic

  • Run topic distillation periodically

  • Refine query through trial-and-error

  • Works well if answer is partially known, e.g., European airlines

    • +swissair +iberia +klm


Problems with preliminary approach l.jpg
Problems with preliminary approach

  • Dependence on large web crawl and index

    • System = crawler + index + distiller

  • Unreliability of keyword match

    • Engines differ significantly on a given query due to small overlap [Bharat and Bröder]

    • Narrow, arbitrary view of relevant subgraph

    • Topic model does not improve over time

  • Difficulty of query construction

  • Lack of output sensitivity


Query construction l.jpg
Query construction

/Companies/Electronics/Power_Supply

+“power suppl*”

“switch* mode” smps

-multiprocessor*

“uninterrupt* power suppl*” ups

-parcel*


Query complexity l.jpg
Query complexity

  • Complex queries (966 trials)

    • Average words 7.03

    • Average operators (+*–") 4.34

  • Typical Alta Vista queries are much simpler [Silverstein, Henzinger, Marais and Moricz]

    • Average query words 2.35

    • Average operators (+*–") 0.41

  • Forcibly adding a hub or authority node helped in 86% of the queries


Query complexity15 l.jpg
Query complexity

  • Complex queries needed for distillation

  • Typical Alta Vista queries are much simpler (Silverstein, Henzinger, Marais and Moricz)

  • Forcing a hub or authority helps 86% of the time


Output sensitivity l.jpg
Output sensitivity

  • Say the goal is to find a comprehensive collection of recreational and competitive bicycling sites and pages

  • Ideally effort should scale with size of the result

  • Time spent crawling and indexing sites unrelated to the topic is wasted

  • Likewise, time that does not improve comprehensiveness is wasted


Proposed solution l.jpg
Proposed solution

  • Resource discovery system that can be customized to crawl for any topic by giving examples

  • Hypertext mining algorithms learn to recognize pages and sites about the given topic, and a measure of their centrality

  • Crawler has guidance hooks controlled by these two scores


Administration scenario l.jpg
Administration scenario

Current

Examples

Drag

Taxonomy

Editor

Suggested

Additional

Examples


Relevance l.jpg
Relevance

Path nodes

All

Arts

Bus&Econ

Recreation

Companies

...

Cycling

...

Bike Shops

Clubs

Mt.Biking

Good nodes

Subsumed nodes


Classification l.jpg
Classification

  • How relevant is a document w.r.t. a class?

    • Supervised learning, filtering, classification, categorization

  • Many types of classifiers

    • Bayesian, nearest neighbor, rule-based

  • Hypertext

    • Both text and links are class-dependent clues

    • How to model link-based features?


The bag of words document model l.jpg
The “bag-of-words” document model

  • Decide topic; topic c is picked with prior probability (c); c(c) = 1

  • Each c has parameters (c,t) for terms t

  • Coin with face probabilities t (c,t) = 1

  • Fix document length and keep tossing coin

  • Given c, probability of document is


Exploiting link features l.jpg
Exploiting link features

  • c=class, t=text, N=neighbors

  • Text-only model: Pr[t|c]

  • Using neighbors’ textto judge my topic:Pr[t, t(N) | c]

  • Better model:Pr[t, c(N)| c]

  • Non-linear relaxation

?


Improvement using link features l.jpg
Improvement using link features

  • 9600 patents from 12 classes marked by USPTO

  • Patents have text and cite other patents

  • Expand test patent to include neighborhood

  • ‘Forget’ fraction of neighbors’ classes


Putting it together l.jpg

Taxonomy

Editor

Example

Browser

Topic

Distiller

Scheduler

Feedback

Taxonomy

Database

Crawl

Database

Workers

Hypertext

Classifier

(Learn)

Hypertext

Classifier

(Apply)

TopicModels

Putting it together


Monitoring the crawler l.jpg
Monitoring the crawler

One URL

Relevance

Moving

Average

Time


Measures of success l.jpg
Measures of success

  • Harvest rate

    • What fraction of crawled pages are relevant

  • Robustness across seed sets

    • Separate crawls with random disjoint samples

    • Measure overlap in URLs and servers crawled

    • Measure agreement in best-rated resources

  • Evidence of non-trivial work

    • #Links from start set to the best resources


Harvest rate l.jpg
Harvest rate

Unfocused

Focused


Crawl robustness l.jpg
Crawl robustness

URL Overlap

Server Overlap

Crawl 1

Crawl 2


Top resources after one hour l.jpg
Top resources after one hour

  • Recreational and competitive cycling

    • http://www.truesport.com/Bike/links.htm

    • http://reality.sgi.com/billh_hampton/jrvs/links.html

    • http://www.acs.ucalgary.ca/~bentley/mark_links.html

  • HIV/AIDS research and treatment

    • http://www.stopaids.org/Otherorgs.html

    • http://www.iohk.com/UserPages/mlau/aidsinfo.html

    • http://www.ahandyguide.com/cat1/a/a66.htm

  • Purer and better than root set


Distance to best resources l.jpg

Cycling: cooperative

Mutual funds: competitive

Distance to best resources


Robustness of resource discovery l.jpg
Robustness of resource discovery

  • Sample disjoint sets of starting URL’s

  • Two separate crawls

  • Find best authorities

  • Order by rank

  • Find overlap in the top-rated resources


Related work l.jpg
Related work

  • WebWatcher, HotList and ColdList

    • Filtering as post-processing, not acquisition

  • ReferralWeb

    • Social network on the Web

  • Ahoy!, Cora

    • Hand-crafted to find home pages and papers

  • WebCrawler, Fish, Shark, Fetuccino, agents

    • Crawler guided by query keyword matches


Comparison with agents l.jpg

Agents usually look for keywords and hand-crafted patterns

Cannot learn new vocabulary dynamically

Do not use distance-2 centrality information

Client-side assistant

We use taxonomy with statistical topic models

Models can evolve as crawl proceeds

Combine relevance and centrality

Broader scope: inter-community linkage analysis and querying

Comparison with agents


Conclusion l.jpg
Conclusion

  • New architecture for example-driven topic-specific web resource discovery

  • No dependence on full web crawl and index

  • Modest desktop hardware adequate

  • Variable radius goal-directed crawling

  • High harvest rate

  • High quality resources found far from keyword query response nodes


ad