1 / 22

SILOs - Distributed Web Archiving & Analysis using Map Reduce

SILOs - Distributed Web Archiving & Analysis using Map Reduce. Anushree Venkatesh Sagar Mehta Sushma Rao. AGENDA. Motivation What is Map-Reduce? Why Map-Reduce? The HADOOP Framework Map Reduce in SILOs SILOs Architecture Modules Experiments. Motivation.

luyu
Download Presentation

SILOs - Distributed Web Archiving & Analysis using Map Reduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SILOs - Distributed Web Archiving & Analysis using Map Reduce Anushree Venkatesh Sagar Mehta Sushma Rao

  2. AGENDA • Motivation • What is Map-Reduce? • Why Map-Reduce? • The HADOOP Framework • Map Reduce in SILOs • SILOs Architecture • Modules • Experiments

  3. Motivation • Life span of a web page – 44 to 75 days • Limitations of centralized/distributed crawling • Exploring map reduce • Analysis of web [ subset ] • Web graph • Search response quality • Tweaked page rank • Inverted Index

  4. What is Map Reduce?

  5. Why Map-Reduce • Divide and conquer • Functional programming counterparts -> distributed data processing • Plumbing behind the scenes -> Focus on the problem • Map – Division of key space • Reduce – Combine results • Pipelining functionality

  6. Hadoop Framework • Open source implementation of Map reduce in Java • HDFS – Hadoop specific file system • Takes care of • fault tolerance • dependencies between nodes • Setup through VM instance - Problems

  7. SETUP • Currently Single Node cluster • HDFS Setup • Incorporation of Berkeley DB

  8. Map Reduce in Silos

  9. Silos Architecture Seed List Graph Builder URL Extractor Distributed Crawler M Parse for URL R URL, 1 (Remove Duplicates) Adjacency List Table M URL, value R URL, page content Key Word Extractor <URL, parent URL> Back Links Mapper Inverted Index Table M Parse for key word R KeyWord, URL M Parent, URL R URL, Parent Diff Compression Page Content Table URL Table Back Links Table

  10. Berkley DB Schema

  11. Distributed Crawler Map Input <url, 1> if(!duplicate(URL)) { Insert into url_table Page_content = http_get(url); <hash(url), url, hash(page_content),time_stamp > Output Intermediate pair < url, page_content> } Else If( ( duplicate(url) && (Current Time – Time Stamp(URL) > Threshold) { Page_content = http_get(url); Update url table(hash(url),current_time); Output Intermediate pair < url, page_content>  } Else { Update url table(hash(url),current_time);  } Reduce Input < url, page_content > If(! Exits hash(URL) in page content table) { Insert into page_content_table <hash(page_content), compress(page_content) > } Else if(hash(page_content_table(hash(url)) != hash(current_page_content) { Insert into page_content_table <hash(page_content), compress( diff_with_latest(page_content) )> } }

  12. Distributed Crawler • Currently outside of Map-Reduce • Manual transfer of files to HDFS • Currently Depth First Search, will be modified for Breadth First Search

  13. EXPERIMENT:Key WORD Extraction Map Input < url, page_content> List<keywords> = parse(page_content); For each keyword, emit Output Intermediate pair < keyword, url> Reduce Combine all <keyword, url> pairs with the same keyword to emit <keyword, List<urls> > Insert into inverted index table <keyword, List<urls> >

  14. EXPERIMENT:Key word extraction

  15. EXPERIMENT:Key word extraction

  16. Experiment: inverted index

  17. Experiment:URL Count

  18. Experiment:URL DEPTH

  19. Questions,Comments, Criticisms

  20. references • HTML Parser • Hadoop Framework (Apache) • Peer Crawl

  21. Thank You!

More Related