focused crawling l.
Skip this Video
Download Presentation

Loading in 2 Seconds...

play fullscreen
1 / 32

FOCUSED CRAWLING - PowerPoint PPT Presentation

  • Uploaded on

FOCUSED CRAWLING. Context. World Wide Web growth. Inktomi crawler: Hundreds of Sun Sparc workstations; Sun Spark Э 75GB RAM, 1TB disk; Over 10M pages crawled. Still only 30-40% Web crawled. Long refreshes (weeks up to a month). Low precision results for crafty queries.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'FOCUSED CRAWLING' - liuz

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

World Wide Web growth.

Inktomi crawler:

Hundreds of Sun Sparc workstations;

Sun Spark Э 75GB RAM, 1TB disk;

Over 10M pages crawled.

Still only 30-40% Web crawled.

Long refreshes (weeks up to a month).

Low precision results for crafty queries.

Burden of indexing millions of pages.

Inefficient location of relevant topic-specific resources when using keyword queries.


why focused
Why Focused?

Better cover a single galaxy than the whole universe.

Work done on relatively narrow segment of Web.

Respectable coverage at rapid rate (due to segment-of-interest narrowness).

Small investment in hardware.

Low network resource usage.


core elements
Core Elements

Focused crawler = example-driven automatic porthole generator.

Guided by a classifier and a distiller.

Former recognizes relevance from examples embedded in topic taxonomy.

Latter identifies topical vantage points on Web.

Based on canonical topic taxonomy with examples.


operation synopsis
Operation Synopsis

Taxonomy creation.

Example collection.

Taxonomy selection and refinement.

Interactive exploration.


Resource discovery.




taxonomy creation
Taxonomy Creation

Pre-training classifier with:

Canonical taxonomy,

Corresponding examples.


example collection
Example Collection

Collect URLs of interest (e.g browsing).

Import collected URLs.


taxonomy selection and refinement
Taxonomy Selection and Refinement

Propose most common classes where examples fit best.

Mark classes as GOOD.

Refine taxonomy, i.e.:

Refine categories and/or,

Move documents from one category to another.

Integration time required by major changes is:

Few hours for 260,000 Yahoo! documents.

Smaller changes (moving docs) are interactive.


interactive exploration
Interactive Exploration

Propose URLs found in small neighbourhood of examples.

Examine and include some of these examples.



Integrate refinements into statistical class model (classifier-specific action).



Identify relevant hubs by running (intermittently and/or concurrently) a topic distillation algorithm.

Raise visit priorities of hubs and immediate neighbours.



Report most popular sites and resources.

Mark results as useful/useless.

Send feedback to classifier and distiller.


some definitions
Some definitions...

G = directed hypertext graph.

C = tree-shaped hierarchical topic directory.

D(c) = examples referred by topic node c Є C.

C* = subset of topics marked good and known as user's interest.


Good topic is not ancestor of another good topic.

p = web page, RC*(p) = relevance of p wrto C* must be furnished to the system.

Rroot(p) = 1 ; Rc0(p) = ∑Rci(p) where {ci} children of c0.


crawler in terms of graph
Crawler in terms of Graph

Start by visiting all pages Є D(C*).

Inspect V = set of visited pages.

Choose unvisited page from crawl frontier.

GOAL: visit as many relevant pages and as few irrelevant pages as possible, i.e:

Find V D(C*) | V reachable from D(C*) s.t. ∑R(v)/|V| -> max, v Є V.

Goal attainable due to citations.




  • Definitions:
    • good(c) = c is marked as good.
    • For d=document:
      • P(d|r) = 1;
      • P(c|d) = P(parent(c)|d)*P(c|d,parent(c));
      • P(c|d,parent(c)) = P(c|parent(c)) * P(d|c) / ∑P(d|ci) where ci are the siblings of c;
      • P(d|c) depends on document generation model;
      • P(c|parent(c)) = prior distribution of documents.
  • Steps for model generation:
    • Pick leaf node c* using defined probabilities.
    • Class c* has a die with as many faces as unique tokens Є U.
    • Face t turns with probability θ(c*,t).
    • Length n(d) is chosen arbitrarily by generator.
    • Flip die and write token corresponding to face.
    • If token t occurs n(d,t) times =>


remarks on classification
Remarks on Classification

Documents seen as bag of words, without order information and inter-term correlation.

During crawling the task is the reverse of generation.

Two types of focus possible with classifier:


Find c* with highest probability;

If Э ancestor of c* s.t. good(ancestor) => allow future visits of links Є d;

Else prune at d.


Page relevance R(d) = ∑good(c)P(c|d);

Assume priority of neighbour(d) = R(d);

If multiple paths for a page => take maximum of relevance;

When neighbour visited => update score.



Goal: identify hubs.

Overtaken idea:

v node Є Web has two scores a(v), h(v) =>

h(u) = ∑ (u,v) Є E a(v) (1)

a(v) = ∑(u,v) Є E h(u) (2)

E = adjacency matrix


Non-unit edge weight;

Forward and backward weights matrices: EF and EB

EF[u,v] = R(v) prevents leakage of prestige from relevant hubs to irrelevant authorities;

EB[u,v] = R(u) prevents relevant authority from reflecting prestige on irrelevant hubs;

ρ = threshold for including relevant authorities into graph.


Construct edge set E, only for pages on different sites, with forward and backward edge weights.

Apply (1) and (2) always restricting authorities using ρ.


integration with the crawler
Integration with the Crawler

One watchdog thread:

Inspect new work from crawl frontier (stored on disk);

Pass new work to working threads(using shared memory buffers).

Many working threads:

Save details of newly explored pages in per-worker disk structures;

Invoke classifier for each new page.

Stop workers, collect and integrate results into central pool (priority queue).

Soft crawling -> URLs ordered by:

(# page-fetches ascending, R descending)

Hard crawling -> surviving URLs ordered by:

# page-fetches ascending

Populate link graph.

Periodically stop crawler and execute distiller => revisit obtained hubs + visit unvisited pages pointed by hubs.



Performance parameters:

Precision (relevance);

Quality of resource discovery.


Experimental setup;

Harvesting rate of relevant pages;

Acquisition robustness;

Resource discovery robustness;

Good resources remoteness;

Effect of distillation on crawling.


experimental setup
Experimental Setup

Crawler = C++ application.

Operating through firewall.

Crawler run with relatively few threads.

Up to 12 example web pages used / category

6,000 URLs / hour returned.

20 topics (gardening, mutual funds, cycling, etc).


harvesting rate of relevant pages
Harvesting Rate of Relevant Pages

Goal: high relevant-page acquisition rate.

Low harvest rate -> time spent merely on eliminating irrelevant pages => better use ordinary crawl instead.

3 crawls done:

Same sample set Э few dozen relevant URLs.


All out-links registered for exploration;

No use of R, except measurement => little slow down.


Probably more robust than hard crawling, BUT needs more skill against unwanted topic diffusion.

Problem distinguish between noisy and systematic drop in relevance.



acquisition robustness
Acquisition Robustness
  • Goal: maintain proper acquisition rate without being too sensitive on the start set.
  • Tests:
    • 2 disjoint sets Є 30% of starting URLs randomly chosen.
    • For each subset launch a focused crawler.
    • Goal achieved by measuring URLs overlap.
    • Generous visits to new IP-addresses and also normal increase in overlapping IP-addresses.


resource discovery robustness
Resource Discovery Robustness

2 sets of crawlers launched from different random samples.

popularity/quality algorithm run with 50 iterations.

Server overlap measured.

Result: most popular sites identified by both sets of crawlers although different samples sets were used.


good resources remoteness
Good Resources Remoteness

Any real exploration done ?

Non-trivial work done by focused crawler, i.e pursuing certain paths while pruning others.

Large # of servers found at 10 links away and beyond from starting set.

Millions of pages within 10 links distance.


effect of distillation on crawling
Effect of Distillation on Crawling

Relevant page may be abandoned due to misclassification (e.g page has many images /classifier mistakes).

Distiller reveals top hubs => new unvisited URLs.




Steady collection of relevant resources;

Robustness to different starting conditions;

Localization of good resources;

Immunity to noise;

Learning specialization from examples;

Filtering done at data-acquisition level rather than as post-processing;

Crawling done to greater depths due to frontier crawling;

Still to go:

At what specificity can focused crawl be sustained? i.e how do harvest rates depend on topics?

Sociology of citations between topics => insights on how Web evolves.