SILOs - Distributed Web Archiving & Analysis using Map Reduce

SILOs - Distributed Web Archiving & Analysis using Map Reduce Anushree Venkatesh Sagar Mehta Sushma Rao

AGENDA • Motivation • What is Map-Reduce? • Why Map-Reduce? • The HADOOP Framework • Map Reduce in SILOs • SILOs Architecture • Modules • Experiments

Motivation • Life span of a web page – 44 to 75 days • Limitations of centralized/distributed crawling • Exploring map reduce • Analysis of web [ subset ] • Web graph • Search response quality • Tweaked page rank • Inverted Index

What is Map Reduce?

Why Map-Reduce • Divide and conquer • Functional programming counterparts -> distributed data processing • Plumbing behind the scenes -> Focus on the problem • Map – Division of key space • Reduce – Combine results • Pipelining functionality

Hadoop Framework • Open source implementation of Map reduce in Java • HDFS – Hadoop specific file system • Takes care of • fault tolerance • dependencies between nodes • Setup through VM instance - Problems

SETUP • Currently Single Node cluster • HDFS Setup • Incorporation of Berkeley DB

Map Reduce in Silos

Silos Architecture Seed List Graph Builder URL Extractor Distributed Crawler M Parse for URL R URL, 1 (Remove Duplicates) Adjacency List Table M URL, value R URL, page content Key Word Extractor <URL, parent URL> Back Links Mapper Inverted Index Table M Parse for key word R KeyWord, URL M Parent, URL R URL, Parent Diff Compression Page Content Table URL Table Back Links Table

Berkley DB Schema

Distributed Crawler Map Input <url, 1> if(!duplicate(URL)) { Insert into url_table Page_content = http_get(url); <hash(url), url, hash(page_content),time_stamp > Output Intermediate pair < url, page_content> } Else If( ( duplicate(url) && (Current Time – Time Stamp(URL) > Threshold) { Page_content = http_get(url); Update url table(hash(url),current_time); Output Intermediate pair < url, page_content> } Else { Update url table(hash(url),current_time); } Reduce Input < url, page_content > If(! Exits hash(URL) in page content table) { Insert into page_content_table <hash(page_content), compress(page_content) > } Else if(hash(page_content_table(hash(url)) != hash(current_page_content) { Insert into page_content_table <hash(page_content), compress( diff_with_latest(page_content) )> } }

Distributed Crawler • Currently outside of Map-Reduce • Manual transfer of files to HDFS • Currently Depth First Search, will be modified for Breadth First Search

EXPERIMENT:Key WORD Extraction Map Input < url, page_content> List<keywords> = parse(page_content); For each keyword, emit Output Intermediate pair < keyword, url> Reduce Combine all <keyword, url> pairs with the same keyword to emit <keyword, List<urls> > Insert into inverted index table <keyword, List<urls> >

EXPERIMENT:Key word extraction

Experiment: inverted index

Experiment:URL Count

Experiment:URL DEPTH

Questions,Comments, Criticisms

references • HTML Parser • Hadoop Framework (Apache) • Peer Crawl

Thank You!

SILOs - Distributed Web Archiving & Analysis using Map Reduce