4 scalability and mapreduce n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
4 . Scalability and MapReduce PowerPoint Presentation
Download Presentation
4 . Scalability and MapReduce

Loading in 2 Seconds...

play fullscreen
1 / 16

4 . Scalability and MapReduce - PowerPoint PPT Presentation


  • 86 Views
  • Uploaded on

ENEE 759D | ENEE 459D | CMSC 858Z. 4 . Scalability and MapReduce. Prof. Tudor Dumitraș. Assistant Professor, ECE University of Maryland, College Park. http://ter.ps/ 759d https://www.facebook.com/SDSAtUMD. Today’s Lecture. Where we’ve been

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '4 . Scalability and MapReduce' - quant


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
4 scalability and mapreduce

ENEE 759D | ENEE 459D | CMSC 858Z

4. Scalability and MapReduce

Prof. Tudor Dumitraș

Assistant Professor, ECEUniversity of Maryland, College Park

http://ter.ps/759d

https://www.facebook.com/SDSAtUMD

today s lecture
Today’s Lecture
  • Where we’ve been
    • How to say “hapaxlegomenon” and “heteroskedasticity”
    • Interpretation of Statistics
    • Attributes of Big Data
  • Where we’re going today
    • Threats to validity
    • Scalability
    • MapReduce
  • Where we’re going next
    • Machine learning
the irop keyboard zeller 2011
The IROP Keyboard[Zeller, 2011]

To prevent bugs, remove the keystrokesthat predict 74% of failure-prone modules in Eclipse

slide4

Does this work?

What am I measuring?

C

Sample D

V1 ?

V2 ?

Sample C

G

D

V3 ?

Reconstruct Lineage

N

E

How well does this work in the real world?

Sample E

Korgo worm family

S

T

F

Will this work tomorrow?

what am i measuring scalability vs latency
What Am I Measuring: Scalability vs. Latency

Can we make use of 1000s of cheap computers?

  • Analyzing data in parallel
    • To access 1 TB in 1 min, must distribute data over 20 disks
    • Parallelism is useful for algorithms where complexity constants matter
      • N log N operations sequentially => (N log N)/K operations in parallel
    • Scalability: ability to throw resources at the problem
  • You can measurescalability
    • Scaleup(weak scalability):
      • More resources => solve proportionally bigger problem with same latency
    • Speedup(strong scalability):
      • More resources => proportionally lower latency with same problem size
some problems are embarrassingly parallel 1
Some Problems Are Embarrassingly Parallel (1)

Task: Convert 405K TIFF images (~4 TB) to PNG

Input: many TIFF images

Distribute images among K computers

f is a function to convert TIFF to PNG; apply it to every item

f

f

f

f

f

f

Output: a big distributed set of converted images

http://open.blogs.nytimes.com/2008/05/21/the-new-york-times-archives-amazon-web-services-timesmachine/

some problems are embarrassingly parallel 2
Some Problems Are Embarrassingly Parallel (2)

Task: Compute the word frequency of 5M documents

Input: millions of documents

Distribute documentsamong K computers

For each document freturns a set of <word, freq> pairs

f

f

f

f

f

f

Output: a big a big distributed list of sets of word freqs.

Adapted from slides by Bill Howe

some problems are embarrassingly parallel 3
Some Problems Are Embarrassingly Parallel (3)

Task: Compute the word frequency across all documents

Input: millions of documents

Distribute documentsamong K computers

For each document freturns a set of <word, freq> pairs

f

f

f

f

f

f

We don’t want a bunch of little histograms – we want one big histogram

Now what?

mapreduce
MapReduce

Task: Compute the word frequency across all documents

Distribute documentsamong K computers

For each document freturns a set of <word, freq> pairs

map

map

map

map

map

map

A big distributed list of setsof word freqs.

Shuffle <word, freq> pairs so that all the counts for a word are sent to the same host

reduce

reduce

reduce

reduce

Add the countsof each word

Output: the distributed histogram

hadoop on one slide
Hadoop on One Slide
  • MapReduce was invented at Google[Dean & Ghemawat, OSDI’04]
  • Hadoop = open source implementation
  • Data stored on HDFS distributed file system
    • Direct-attached storage
    • No schema needed on load
  • Programmers write Map and Reduce functions
  • Framework provides automated parallelization and fault tolerance
    • Data replication, restarting failed tasks
    • Scheduling Map and Reduce tasks on hosts with local copies of input data

Source: Huy Vo

mapreduce programming model
MapReduce Programming Model
  • Iput& Output: each a set of key/value pairs
  • Programmer specifies two functions:

map (in_key, in_value) -> list(out_key, intermediate_value)

    • Processes input key/value pair
    • Produces set of intermediate pairs

reduce (out_key, list(intermediate_value)) -> list(out_value)

    • Combines all intermediate values for a particular key
    • Produces a set of merged output values (usually just one)
  • Inspired by primitives from functional programming languages such as Lisp, Scheme, and Haskell

Slide source: Google

example what does this do
Example: What Does This Do?

map(String input_key, String input_value):

// input_key: document name // input_value: document contents

for each word w in input_value:

EmitIntermediate(w, 1);

reduce(String output_key, Iterator intermediate_values):

// output_key: word // output_values: ????

intresult = 0;

for each v in intermediate_values:

result += v;

EmitFinal(output_key, result);

big data in the security industry
Big Data in the Security Industry
  • Booz Allen Hamilton
    • Dr. Brian Keller’s colloquium “Innovating with Analytics”
    • Sponsors Data Science Bowl, October 5th 1-5:30 pm CSIC 2117 & 2120 https://www.datasciencebowl.com/
  • Symantec
    • WINE platform for data analytics in security
  • Google
    • Mine user access patterns to mitigate data loss due to stolen credentials
      • Supplementary to passwords and two-factor authentication
    • Fuzz testing at scale
big data for security benefits and challenges
Big Data for Security: Benefits and Challenges
  • Benefits
    • Ability to analyze data at scale (e.g., the information on the 403 millions malware variants created in 2011)
    • MapReduce provides simple programming model, automated parallelization and fault tolerance
      • Commercial parallel DBs (e.g. Vertica, Greenplum, Aster Data) also provide some of these benefits, but they are very expensive
  • Challenges
    • Lack of ground truth on malware families
    • Lack of contextual data: e.g., date and time of appearance
    • Inability to collect some types of data owing to privacy concerns
    • Sharing data (e.g., malware samples are dangerous, some data sets may include personal information)

Illustrate general threats to validity in experimental cyber security

threats to validity
Threats to Validity

Construct validity: use metrics that model the hypothesis

Internal validity: establish causal connection

Does it work?

What am I measuring?

Will it work tomorrow?

Will it work tomorrow?

Will it work in the real world?

Content validity: include only and all relevant data

External validity: generalize results beyond experimental data

review of lecture
Review of Lecture
  • What did we learn?
    • Construct, content, internal, external validity
    • Programming in MapReduce
    • Measuring scalability
  • What’s next?
    • Paper discussion: ‘Before We Knew It: An Empirical Study of Zero-Day Attacks In The Real World’
    • Next lecture: Machine learning techniques
  • Deadline reminder
    • Pilot project reports due on Wednesday
    • Post report on Piazza