1 / 36

Adaptive Focused Crawling

Adaptive Focused Crawling. Dr. Alexandra Cristea a.i.cristea@warwick.ac.uk http://www.dcs.warwick.ac.uk/~acristea/. 1. Contents. Introduction Crawlers and the World Wide Web Focused Crawling Agent-Based Adaptive Focused Crawling Machine Learning-Based Adaptive Focused Crawling

lazaro
Download Presentation

Adaptive Focused Crawling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adaptive Focused Crawling Dr. Alexandra Cristea a.i.cristea@warwick.ac.uk http://www.dcs.warwick.ac.uk/~acristea/

  2. 1. Contents • Introduction • Crawlers and the World Wide Web • Focused Crawling • Agent-Based Adaptive Focused Crawling • Machine Learning-Based Adaptive Focused Crawling • Evaluation Methodologies • Conclusions

  3. Motivation • Large amount of info on web • Standard crawler: traverses web downloading all • Focused, adaptive crawler: selects only related documents, ignores rest

  4. Introduction

  5. A focused crawler retrieval

  6. Adaptive Focused Crawler • Traditional non-adaptive focused crawlers: suitable for user communities w shared interests & goals that do not change with time. • focused crawler + learning methods to adapt its behavior to the particular environment and its relationships with the given input parameters (e.g. set of retrieved pages and the user-defined topic ) >> adaptive • adaptive fc: • for personalized search systems w info needs, user’s interests, goals, preferences, etc.. • for single users and not communities of people. • sensitive to potential alterations in the environment.

  7. Crawlers and the WWW

  8. Growth and size of the Web • Growth: • 2005: at least 11.5 billion pages • Doubling in less than 2 years • http://www.worldwidewebsize.com/ • Today (2008) Indexable: 23 billion pages • Change: • 23% changes daily, 40% within a week • Challenge: search engine’s local copies • Crawls: time consuming => tradeoffs needed • Alternatives: • google Sitemaps: an XML file lists web site pages and how often they change (push instead of pull) • distributed crawling • Truism: Web is growing faster than search engines

  9. Reaching the Web: Hypertextual Connectivity and Deep Web • Dark matter: info not accessible to search engines • Page sets: In, Out, SCC (Strongly Connected Component) What happens if you crawl from Out?

  10. Deep Web • dynamic page generators • estimate: public information on the deep Web is in 2001 up to 550 times larger than the normally accessible Web • databases

  11. Crawling Strategies • Important pages first: ordering metrics: e.g., Breadth-First, Backlink, PageRank

  12. Backlink • # of pages linking ‘in’ • Based on bibliographic research • Local minima issue • Based on this: PageRank, HITS

  13. Focused Crawling • Exploiting additional info on web pages, such as anchors or text surrounding the links, to skip some of the pages encountered

  14. Exploiting the Hypertextual Info • Links and (topical) locality are just as important as IR info

  15. Fish search • Input: user’s query + starting URLs (e.g., bookmarks) – priority list • First in list downloaded, scored; a heuristic decides if to continue w that direction; if not, its links will be ignored • If yes, links are scanned, each w a depth value (e.g., parent -1) ; when depth is zero, direction is dropped; • Timeout, max no pages is also possible; • Very heavy and demanding; big web-burden

  16. Other focused crawlers • Taxonomy and distillation • Classifier: evaluates relevance of hypertext docs regarding topic • Distiller: identifies nodes as access points to pages (via HITS algorithm) • Tunneling • allow a limited no of ‘bad’ pages, to avoid loosing info (close topic pages may not point to each other) • Contextual crawling • Context graph: for each page with a related distance (min no links to traverse from initial set) • Naïve Bayes classifiers – category identification, according to distance; predictions of a generic document’s distance is possible • Problem: reverse link info • Semantic Web • Ontologies • Improvements in performance

  17. Agent-based Adaptive Focused Crawling • Genetic-based • Ants

  18. Genetic-based crawling • GA: • approximate solutions to hard-to-solve combinatorial optimization problems • genetic operators: inheritance, mutation, crossover + population evolution • GA crawler agents (InfoSpiders http://www.informatics.indiana.edu/fil/IS/ ): • genotype (chromosome set) defining search behaviour: • trust in out-links; • query terms; • weights (uniform distribution intially; FF NN info later+supervised/unsupervised BP) • Energy = Benefit() – Cost() (Fitness)

  19. Genotype and NN Relevant/ irrelevant

  20. Algorithm 1. Pseudo-code of the InfoSpiders algorithm {initialize each agent’s genotype, energy and starting page} PAGES ←maximum number of pages to visit while number of visited pages < PAGES do while for each agent a do {pick and visit an out-link from the current agent’s page} {update the energy estimating benefit() − cost()} {update the genotype as a function of the current benefit} if agent’s energy > THRESHOLD then {apply the genetic operators to produce offspring} else {kill the agent} end if end while end while

  21. Ant-based Crawling • Collective intelligence • Simple individual behaviour, complex results (shortest path) • Pheromone + trail

  22. Ant Crawling: preferred path

  23. Transition probabilities (p) according to pheromone trails ()

  24. Task accomplishing behaviors 1. at the end of the cycle, the agent updates the pheromone trails of the followed path and places itself in one of the start resources 2. if an ant trail exists, the agent decides to follow it with a probability which is function of the respective pheromone intensity 3. if the agent does not have any available information, it moves randomly

  25. Transition probability Link between i,l where τij(t) corresponds to the pheromone trail between urli and urlj

  26. Trail updating p(k) is the ordered set of pages visited by the k-ant p(k)[i] is the i-th element of p(k) score(p) returns for each page p, the similarity measure with current info needs: [0, 1], where 1 is the highest similarity M # of ants ρ is the trail evaporation coefficient

  27. Algorithm 2. Pseudo-code of the Ant-based crawler. {initialize each agent’s starting page} PAGES ←maximum number of pages to visit cycle ← 1 t ← 1 while number of visited pages < PAGES do while for each agent a do for move = 0 to cycle do {calculate the probabilities Pij(t) of the out-going links as in Eq.} {select the next page to visit for the agent a} end for end while {update all the pheromone trails} {initialize each agent’s starting page} cycle ← cycle + 1 t ← t + 1 end while

  28. Machine Learning-Based Adaptive Focused Crawling • Intelligent Crawling Statistical Model • Reinforcement Learning-Based Approaches

  29. Intelligent Crawling Statistical Model • statistically learn the characteristics of the Web’s linkage structure while performing the search • Unseen page: predicates (content of pages linking in, tokens on unseen page) • Evidence E is used to update probability of relation to user’s needs.

  30. Evidence-based update P(C|E) > P(C) Interest Ratio: I(C,E) = P(C|E) / P(C) = = P(C ∩ E) / P(C)P(E) e.g., E = 10% of the pages pointing in contain ‘Bach’ No initial collection needed: At the beginning, users specify their needs via predicates, e.g., the page content or the title must contain a given set of keywords.

  31. Reinforcement Learning-Based Approaches • Traditional focused crawler + • apprentice : assigns priorities to unvisited URLs (based on DOM features) for the next steps of crawling • Naïve Bayes text classifiers compare text around links to next steps DOM = Document Object Model

  32. Evaluation Methodologies • For fixed Web corpus and standard crawl: • computation time to complete the crawl, or • the number of downloaded resources per time unit • For focused crawl: • We need correctly retrieved documents only, not all of them; so:

  33. Focused Crawling • Precision: Pr = found / (found + false alarm) • Recall: Rr = found / (found + miss)

  34. Precision / Recall

  35. Conclusions • Focused Crawling = interesting alternative to Web search • Adaptive Focused Crawlers: • Learning methods are able to adapt the system behaviour to a particular environment and input parameters during the search • Dark matter research, NLP, Semantic Web

  36. Any questions?

More Related