Adaptive Focused Crawling

Adaptive Focused Crawling Dr. Alexandra Cristea a.i.cristea@warwick.ac.uk http://www.dcs.warwick.ac.uk/~acristea/

1. Contents • Introduction • Crawlers and the World Wide Web • Focused Crawling • Agent-Based Adaptive Focused Crawling • Machine Learning-Based Adaptive Focused Crawling • Evaluation Methodologies • Conclusions

Motivation • Large amount of info on web • Standard crawler: traverses web downloading all • Focused, adaptive crawler: selects only related documents, ignores rest

Introduction

A focused crawler retrieval

Adaptive Focused Crawler • Traditional non-adaptive focused crawlers: suitable for user communities w shared interests & goals that do not change with time. • focused crawler + learning methods to adapt its behavior to the particular environment and its relationships with the given input parameters (e.g. set of retrieved pages and the user-defined topic ) >> adaptive • adaptive fc: • for personalized search systems w info needs, user’s interests, goals, preferences, etc.. • for single users and not communities of people. • sensitive to potential alterations in the environment.

Crawlers and the WWW

Growth and size of the Web • Growth: • 2005: at least 11.5 billion pages • Doubling in less than 2 years • http://www.worldwidewebsize.com/ • Today (2008) Indexable: 23 billion pages • Change: • 23% changes daily, 40% within a week • Challenge: search engine’s local copies • Crawls: time consuming => tradeoffs needed • Alternatives: • google Sitemaps: an XML file lists web site pages and how often they change (push instead of pull) • distributed crawling • Truism: Web is growing faster than search engines

Reaching the Web: Hypertextual Connectivity and Deep Web • Dark matter: info not accessible to search engines • Page sets: In, Out, SCC (Strongly Connected Component) What happens if you crawl from Out?

Deep Web • dynamic page generators • estimate: public information on the deep Web is in 2001 up to 550 times larger than the normally accessible Web • databases

Crawling Strategies • Important pages first: ordering metrics: e.g., Breadth-First, Backlink, PageRank

Backlink • # of pages linking ‘in’ • Based on bibliographic research • Local minima issue • Based on this: PageRank, HITS

Focused Crawling • Exploiting additional info on web pages, such as anchors or text surrounding the links, to skip some of the pages encountered

Exploiting the Hypertextual Info • Links and (topical) locality are just as important as IR info

Fish search • Input: user’s query + starting URLs (e.g., bookmarks) – priority list • First in list downloaded, scored; a heuristic decides if to continue w that direction; if not, its links will be ignored • If yes, links are scanned, each w a depth value (e.g., parent -1) ; when depth is zero, direction is dropped; • Timeout, max no pages is also possible; • Very heavy and demanding; big web-burden

Other focused crawlers • Taxonomy and distillation • Classifier: evaluates relevance of hypertext docs regarding topic • Distiller: identifies nodes as access points to pages (via HITS algorithm) • Tunneling • allow a limited no of ‘bad’ pages, to avoid loosing info (close topic pages may not point to each other) • Contextual crawling • Context graph: for each page with a related distance (min no links to traverse from initial set) • Naïve Bayes classifiers – category identification, according to distance; predictions of a generic document’s distance is possible • Problem: reverse link info • Semantic Web • Ontologies • Improvements in performance

Agent-based Adaptive Focused Crawling • Genetic-based • Ants

Genetic-based crawling • GA: • approximate solutions to hard-to-solve combinatorial optimization problems • genetic operators: inheritance, mutation, crossover + population evolution • GA crawler agents (InfoSpiders http://www.informatics.indiana.edu/fil/IS/ ): • genotype (chromosome set) defining search behaviour: • trust in out-links; • query terms; • weights (uniform distribution intially; FF NN info later+supervised/unsupervised BP) • Energy = Benefit() – Cost() (Fitness)

Genotype and NN Relevant/ irrelevant

Algorithm 1. Pseudo-code of the InfoSpiders algorithm {initialize each agent’s genotype, energy and starting page} PAGES ←maximum number of pages to visit while number of visited pages < PAGES do while for each agent a do {pick and visit an out-link from the current agent’s page} {update the energy estimating benefit() − cost()} {update the genotype as a function of the current benefit} if agent’s energy > THRESHOLD then {apply the genetic operators to produce offspring} else {kill the agent} end if end while end while

Ant-based Crawling • Collective intelligence • Simple individual behaviour, complex results (shortest path) • Pheromone + trail

Ant Crawling: preferred path

Transition probabilities (p) according to pheromone trails ()

Task accomplishing behaviors 1. at the end of the cycle, the agent updates the pheromone trails of the followed path and places itself in one of the start resources 2. if an ant trail exists, the agent decides to follow it with a probability which is function of the respective pheromone intensity 3. if the agent does not have any available information, it moves randomly

Transition probability Link between i,l where τij(t) corresponds to the pheromone trail between urli and urlj

Trail updating p(k) is the ordered set of pages visited by the k-ant p(k)[i] is the i-th element of p(k) score(p) returns for each page p, the similarity measure with current info needs: [0, 1], where 1 is the highest similarity M # of ants ρ is the trail evaporation coefficient

Algorithm 2. Pseudo-code of the Ant-based crawler. {initialize each agent’s starting page} PAGES ←maximum number of pages to visit cycle ← 1 t ← 1 while number of visited pages < PAGES do while for each agent a do for move = 0 to cycle do {calculate the probabilities Pij(t) of the out-going links as in Eq.} {select the next page to visit for the agent a} end for end while {update all the pheromone trails} {initialize each agent’s starting page} cycle ← cycle + 1 t ← t + 1 end while

Machine Learning-Based Adaptive Focused Crawling • Intelligent Crawling Statistical Model • Reinforcement Learning-Based Approaches

Intelligent Crawling Statistical Model • statistically learn the characteristics of the Web’s linkage structure while performing the search • Unseen page: predicates (content of pages linking in, tokens on unseen page) • Evidence E is used to update probability of relation to user’s needs.

Evidence-based update P(C|E) > P(C) Interest Ratio: I(C,E) = P(C|E) / P(C) = = P(C ∩ E) / P(C)P(E) e.g., E = 10% of the pages pointing in contain ‘Bach’ No initial collection needed: At the beginning, users specify their needs via predicates, e.g., the page content or the title must contain a given set of keywords.

Reinforcement Learning-Based Approaches • Traditional focused crawler + • apprentice : assigns priorities to unvisited URLs (based on DOM features) for the next steps of crawling • Naïve Bayes text classifiers compare text around links to next steps DOM = Document Object Model

Evaluation Methodologies • For fixed Web corpus and standard crawl: • computation time to complete the crawl, or • the number of downloaded resources per time unit • For focused crawl: • We need correctly retrieved documents only, not all of them; so:

Focused Crawling • Precision: Pr = found / (found + false alarm) • Recall: Rr = found / (found + miss)

Precision / Recall

Conclusions • Focused Crawling = interesting alternative to Web search • Adaptive Focused Crawlers: • Learning methods are able to adapt the system behaviour to a particular environment and input parameters during the search • Dark matter research, NLP, Semantic Web

Any questions?

Adaptive Focused Crawling

Adaptive Focused Crawling

Presentation Transcript

Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

FOCUSED CRAWLING

Web Crawling

Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery

Web Crawling

Web Crawling

Web Crawling

Crawling

Crawling

Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

Focused Crawling for both Topical Relevance and Quality of Medical Information

Crawling HTML

Exploiting Inter-Class Rules for Focused Crawling

Crawling

Crawling

Policy Search for Focused Web Crawling

Adaptive Focused Crawling

CRAWLING

Crawling

Accelerated Focused Crawling Through Online Relevance Feedback

Geographically Focused Collaborative Crawling

Focused Crawling and Collection Synthesis