220 likes | 327 Views
This presentation discusses the integration of MapReduce into web archiving systems using a Silo architecture. It addresses the lifespan of web pages, limitations of existing crawling methods, and how MapReduce can enhance web analysis. We explore the Hadoop framework, its components, and practical implementations like distributed crawling, keyword extraction, and inverted indexing. Moreover, various experiments illustrate the efficacy of the proposed methods, paving the way for improved search response quality and page ranking mechanisms.
E N D
SILOs - Distributed Web Archiving & Analysis using Map Reduce Anushree Venkatesh Sagar Mehta Sushma Rao
AGENDA • Motivation • What is Map-Reduce? • Why Map-Reduce? • The HADOOP Framework • Map Reduce in SILOs • SILOs Architecture • Modules • Experiments
Motivation • Life span of a web page – 44 to 75 days • Limitations of centralized/distributed crawling • Exploring map reduce • Analysis of web [ subset ] • Web graph • Search response quality • Tweaked page rank • Inverted Index
Why Map-Reduce • Divide and conquer • Functional programming counterparts -> distributed data processing • Plumbing behind the scenes -> Focus on the problem • Map – Division of key space • Reduce – Combine results • Pipelining functionality
Hadoop Framework • Open source implementation of Map reduce in Java • HDFS – Hadoop specific file system • Takes care of • fault tolerance • dependencies between nodes • Setup through VM instance - Problems
SETUP • Currently Single Node cluster • HDFS Setup • Incorporation of Berkeley DB
Silos Architecture Seed List Graph Builder URL Extractor Distributed Crawler M Parse for URL R URL, 1 (Remove Duplicates) Adjacency List Table M URL, value R URL, page content Key Word Extractor <URL, parent URL> Back Links Mapper Inverted Index Table M Parse for key word R KeyWord, URL M Parent, URL R URL, Parent Diff Compression Page Content Table URL Table Back Links Table
Distributed Crawler Map Input <url, 1> if(!duplicate(URL)) { Insert into url_table Page_content = http_get(url); <hash(url), url, hash(page_content),time_stamp > Output Intermediate pair < url, page_content> } Else If( ( duplicate(url) && (Current Time – Time Stamp(URL) > Threshold) { Page_content = http_get(url); Update url table(hash(url),current_time); Output Intermediate pair < url, page_content> } Else { Update url table(hash(url),current_time); } Reduce Input < url, page_content > If(! Exits hash(URL) in page content table) { Insert into page_content_table <hash(page_content), compress(page_content) > } Else if(hash(page_content_table(hash(url)) != hash(current_page_content) { Insert into page_content_table <hash(page_content), compress( diff_with_latest(page_content) )> } }
Distributed Crawler • Currently outside of Map-Reduce • Manual transfer of files to HDFS • Currently Depth First Search, will be modified for Breadth First Search
EXPERIMENT:Key WORD Extraction Map Input < url, page_content> List<keywords> = parse(page_content); For each keyword, emit Output Intermediate pair < keyword, url> Reduce Combine all <keyword, url> pairs with the same keyword to emit <keyword, List<urls> > Insert into inverted index table <keyword, List<urls> >
references • HTML Parser • Hadoop Framework (Apache) • Peer Crawl