1 / 11

FICA: A Fast and Intelligent Crawling Algorithm for Efficient Web Information Retrieval

FICA (Fast Intelligent Crawling Algorithm) enhances web crawling efficiency by focusing on high-value pages. Unlike traditional search engine crawlers that index the entire web, FICA prioritizes pages using a distance-based model and reinforcement learning, improving relevance in search results. Its Breadth-First Crawling approach ensures systematic exploration while minimizing computational requirements. FICA demonstrates superior performance in discovering important pages with a time complexity of O(E log V) and operates adaptively, making it a promising solution for modern web crawling challenges.

bianca-cole
Download Presentation

FICA: A Fast and Intelligent Crawling Algorithm for Efficient Web Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Information retrieval (Web IR) Handout #11: FICA: A Fast Intelligent Crawling Algorithm Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

  2. Web Crawling • Search engines do not index the entire Web • Therefore, we have to focus on the most valuable and appealing ones • To do this, a better crawling criterion is required • FICA

  3. Breadth-First Crawling u q v r w p x s y t BFS Advantages Why it is a acceptable algorithm? z

  4. Logarithmic Distance Crawling When i points to j then: u q v log4 dpv=log4+log3=1.07 r w p x s y t dpt=log4 z dpz=log4+log2=0.9

  5. FICA • Intelligent surfer model • It is based on reinforcement learning

  6. Priority Queue FICA (On-line) Web • Distance is used as the priority value Web pages Downloader URLs Text and Metadata Repository URL1 URL2 … FICA scheduler URLs Seeds

  7. Comparison with Others Web Partial Ranking Algorithm Downloader Repository URLs and Links URL1 URL2 … Seeds

  8. Experimental Results • Experiment was done on UK web graph including 18 million web pages • We chose PageRank as an ideal ranking mechanism

  9. FICA Properties • Its time complexity is O(ElogV) • Complexity of Partial PageRank is • FICA outperforms others in discovering highly important pages • It requires small memory for computation • It is online & adaptive

  10. FICA as a Ranking Algorithm • We used Kendall's metric for correlation between two rank lists • Ideal is PageRank

  11. Dynamic Version of FICA

More Related