silos distributed web archiving analysis using map reduce n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
SILOs - Distributed Web Archiving & Analysis using Map Reduce PowerPoint Presentation
Download Presentation
SILOs - Distributed Web Archiving & Analysis using Map Reduce

Loading in 2 Seconds...

play fullscreen
1 / 22

SILOs - Distributed Web Archiving & Analysis using Map Reduce - PowerPoint PPT Presentation


  • 90 Views
  • Uploaded on

SILOs - Distributed Web Archiving & Analysis using Map Reduce. Anushree Venkatesh Sagar Mehta Sushma Rao. AGENDA. Motivation What is Map-Reduce? Why Map-Reduce? The HADOOP Framework Map Reduce in SILOs SILOs Architecture Modules Experiments. Motivation.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'SILOs - Distributed Web Archiving & Analysis using Map Reduce' - luyu


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
silos distributed web archiving analysis using map reduce

SILOs - Distributed Web Archiving & Analysis using Map Reduce

Anushree Venkatesh

Sagar Mehta

Sushma Rao

agenda
AGENDA
  • Motivation
  • What is Map-Reduce?
  • Why Map-Reduce?
  • The HADOOP Framework
  • Map Reduce in SILOs
    • SILOs Architecture
    • Modules
  • Experiments
motivation
Motivation
  • Life span of a web page – 44 to 75 days
  • Limitations of centralized/distributed crawling
    • Exploring map reduce
  • Analysis of web [ subset ]
    • Web graph
    • Search response quality
      • Tweaked page rank
      • Inverted Index
why map reduce
Why Map-Reduce
  • Divide and conquer
  • Functional programming counterparts -> distributed data processing
  • Plumbing behind the scenes -> Focus on the problem
  • Map – Division of key space
  • Reduce – Combine results
  • Pipelining functionality
hadoop framework
Hadoop Framework
  • Open source implementation of Map reduce in Java
  • HDFS – Hadoop specific file system
  • Takes care of
    • fault tolerance
    • dependencies between nodes
  • Setup through VM instance - Problems
setup
SETUP
  • Currently Single Node cluster
  • HDFS Setup
  • Incorporation of Berkeley DB
silos architecture
Silos Architecture

Seed List

Graph Builder

URL Extractor

Distributed Crawler

M

Parse for URL

R

URL, 1

(Remove Duplicates)

Adjacency

List

Table

M

URL, value

R

URL, page content

Key Word Extractor

<URL, parent URL>

Back Links

Mapper

Inverted

Index

Table

M

Parse for key word

R

KeyWord, URL

M

Parent, URL

R

URL, Parent

Diff

Compression

Page

Content

Table

URL

Table

Back Links

Table

distributed crawler
Distributed Crawler

Map

Input <url, 1>

if(!duplicate(URL)) {

Insert into url_table

Page_content = http_get(url);

<hash(url), url, hash(page_content),time_stamp >

Output Intermediate pair < url, page_content>

}

Else If( ( duplicate(url) && (Current Time – Time Stamp(URL) > Threshold) {

Page_content = http_get(url);

Update url table(hash(url),current_time);

Output Intermediate pair < url, page_content>

 }

Else {

Update url table(hash(url),current_time);

 }

Reduce

Input < url, page_content >

If(! Exits hash(URL) in page content table) {

Insert into page_content_table

<hash(page_content), compress(page_content) >

}

Else if(hash(page_content_table(hash(url)) != hash(current_page_content) {

Insert into page_content_table

<hash(page_content), compress( diff_with_latest(page_content) )>

}

}

distributed crawler1
Distributed Crawler
  • Currently outside of Map-Reduce
  • Manual transfer of files to HDFS
  • Currently Depth First Search, will be modified for Breadth First Search
experiment key word extraction
EXPERIMENT:Key WORD Extraction

Map

Input < url, page_content>

List<keywords> = parse(page_content);

For each keyword, emit

Output Intermediate pair < keyword, url>

Reduce

Combine all <keyword, url> pairs with the same keyword to emit

<keyword, List<urls> >

Insert into inverted index table

<keyword, List<urls> >

references
references
  • HTML Parser
  • Hadoop Framework (Apache)
  • Peer Crawl