110 likes | 246 Views
FICA (Fast Intelligent Crawling Algorithm) enhances web crawling efficiency by focusing on high-value pages. Unlike traditional search engine crawlers that index the entire web, FICA prioritizes pages using a distance-based model and reinforcement learning, improving relevance in search results. Its Breadth-First Crawling approach ensures systematic exploration while minimizing computational requirements. FICA demonstrates superior performance in discovering important pages with a time complexity of O(E log V) and operates adaptively, making it a promising solution for modern web crawling challenges.
E N D
Web Information retrieval (Web IR) Handout #11: FICA: A Fast Intelligent Crawling Algorithm Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir
Web Crawling • Search engines do not index the entire Web • Therefore, we have to focus on the most valuable and appealing ones • To do this, a better crawling criterion is required • FICA
Breadth-First Crawling u q v r w p x s y t BFS Advantages Why it is a acceptable algorithm? z
Logarithmic Distance Crawling When i points to j then: u q v log4 dpv=log4+log3=1.07 r w p x s y t dpt=log4 z dpz=log4+log2=0.9
FICA • Intelligent surfer model • It is based on reinforcement learning
Priority Queue FICA (On-line) Web • Distance is used as the priority value Web pages Downloader URLs Text and Metadata Repository URL1 URL2 … FICA scheduler URLs Seeds
Comparison with Others Web Partial Ranking Algorithm Downloader Repository URLs and Links URL1 URL2 … Seeds
Experimental Results • Experiment was done on UK web graph including 18 million web pages • We chose PageRank as an ideal ranking mechanism
FICA Properties • Its time complexity is O(ElogV) • Complexity of Partial PageRank is • FICA outperforms others in discovering highly important pages • It requires small memory for computation • It is online & adaptive
FICA as a Ranking Algorithm • We used Kendall's metric for correlation between two rank lists • Ideal is PageRank