Handling big data
This presentation is the property of its rightful owner.
Sponsored Links
1 / 91

Handling Big Data PowerPoint PPT Presentation


  • 81 Views
  • Uploaded on
  • Presentation posted in: General

Handling Big Data. Howles Credits to Sources on Final Slide. Handling Large Amounts of Data. Current technologies are to: Parallelize – use multiple processors or threads. Can be a single machine, or a machine with multiple processors

Download Presentation

Handling Big Data

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Handling big data

Handling Big Data

Howles

Creditsto Sources on Final Slide


Handling large amounts of data

Handling Large Amountsof Data

  • Current technologies are to:

    • Parallelize – use multiple processors or threads. Can be a single machine, or a machine with multiple processors

    • Distribute – use a network to partition work across many computers


Parallelized operations

Parallelized Operations

  • This is relatively easy if the task itself can easily be split into units. Still presents some problems, including:

    • How is the work assigned?

    • What happens if we have more work units than threads or processors?

    • How do we know when all work units have completed?

    • How do we aggregate results in the end?

    • How do we handle if the work can’t be cleanly divided?


Parallelized operations1

Parallelized Operations

  • To solve this problem, we need communication mechanisms

  • Need synchronization mechanism for communication (timing/notification of events), and to control sharing (mutex)


Why is it needed

Why is it needed?

  • Data consistency

  • Orderly execution of instructions or activities

  • Timing – control race conditions


Examples

Examples

  • Two people want to buy the same seat on a flight

  • Readers and writers

  • P1 needs a resource but it’s being held by P2

  • Two threads updating a single counter

  • Bounded Buffer

  • Producer/Consumer

  • …….


Synchronization primitives

Synchronization Primitives

  • Review:

  • A special shared variable used to guarantee atomic operations

  • Hardware support

    • Processor may lock down memory bus while other reads/write occur

  • Semaphores, monitors, conditions are examples of language-level synchronization mechanisms


Needed when

Needed when:

  • Resources need to be shared

  • Timing needs to be coordinated

    • Access data

    • Send messages or data

  • Potential race conditions – timing

    • Difficult to predict

    • Results in inconsistent, corrupt or destroyed info

    • Tricky to find; difficult to recreate

  • Activities need to be synchronized


  • Producer consumer

    Producer

    while count == MAX

    NOP

    Put in buffer

    counter++

    Consumer

    while count == 0

    NOP

    Remove from buffer

    counter--

    Producer/Consumer


    Race conditions

    Race Conditions

    • … can result in an incorrect solution

    • An issue with any shared resource (including devices)

      • Printer

      • Writers to a disk


    Critical section

    Critical Section

    • Also called the critical region

    • Segment of code (or device) for which a process must have exclusive use


    Examples of critical sections

    Examples of Critical Sections

    • Updating/reading a shared counter

    • Controlling access to a device or other resource

    • Two users want write access to a file


    Rules for solutions

    Rules for solutions

    • Must enforce mutex

    • Must not postpone process if not warranted (exclude from CR if no other process in CR)

    • Bounded Waiting (to enter the CR)

    • No execution time guarantees


    Atomic operation

    Atomic Operation

    • Operation is guaranteed to process without interruption

    • How do we enforce atomic operations?


    Semaphores

    Semaphores

    • Dijkstra, circa 1965

    • Two standard operations: wait() and signal()

    • Older books may still use P() and V(), respectively (or Up() and Down()). You should be familiar with any notation


    Semaphores1

    Semaphores

    • A semaphore is comprised of an integer counter and a waiting list of blocked processes

    • Initialize the counter (depends on application)

    • wait() decrements the counter and determines if the process must block

    • signal() increments the counter and determines if a blocked process can unblock


    Semaphores2

    Semaphores

    • wait() and signal() are atomic operations

    • What is the other advantage of a semaphore over the previous solutions?


    Binary semaphore

    Binary Semaphore

    • Initialized to one

    • Allows only one process access at a time


    Semaphores3

    Semaphores

    • wait() and signal() are usually system calls. Within the kernel, interrupts are disabled to make the counter operations atomic.


    Problems with semaphores

    Process 0:

    wait (s);// 1st

    wait (q);// 3rd

    …….

    signal (s);

    signal (q);

    Assume both semaphores initialized to 1

    Process 1:

    wait (q);// 2nd

    wait (s);// 4th

    …….

    signal (q);

    signal (s);

    Problems with Semaphores


    Other problems

    Other problems

    • Incorrect order

    • Forgetting to signal()

    • Incorrect initial value


    Monitors

    Monitors

    • Encapsulates the synchronization with the code

    • Only one process may be active in the monitor at a time

    • Waiting processes are blocked (no busy waiting)


    Monitors1

    Monitors

    • Condition variables control access to the monitor

    • Two operations: wait() and signal() (easy to confuse with semaphores, so be careful!)

    • enter() and leave() or other named functions may be used


    Monitors2

    Monitors

    if (some condition)

    call wait() on the monitor

    <<mutex>>

    call signal() on the monitor


    States in the monitor

    States in the Monitor

    • Active (running)

    • Waiting (blocked, waiting on a condition)


    Examples1

    Examples


    Signals in the monitor

    Signals in the Monitor

    • When an ACTIVE process issues a signal(), it must allow a blocked process to become active

    • This would allow 2 ACTIVE processes and can’t allow this in a CR.

    • So – the first process that wants to execute the signal() must be active in order to issue the signal(); the signal() will make a waiting process become active.


    Signals

    Signals

    • Two solutions:

    • Delay the signal

    • Delay the waiting process from becoming active


    Gladiator monitor cavers brown 1978

    Gladiator monitor (Cavers & Brown, 1978)

    • Delay the signaled process, signaling process continues

    • Create a new state (URGENT) to hold the process that has just been signaled. This signals the process but delays execution of the process just signaled.

    • When the signal-er leaves the monitor (or wait()s again), the process in URGENT is allowed to run.


    Mediator cavers brown adapted from hoare 1974

    Mediator (Cavers & Brown adapted from Hoare, 1974)

    • Delay the signaling process

    • When the process signal()s, it is blocked so the signaled process becomes active right away.

    • This monitor may be more difficult to get correct interaction. Be warned, especially if you have loops in your CR.


    Tips for using monitors

    Tips for Using Monitors

    • Remember that excess signal() instructions don’t matter so don’t test for them or try to count them.

    • Don’t intermix with semaphores.

    • Be sure everything shared is declared inside the monitor

    • Carefully think about the process ordering (which monitor you wish to use)


    Deadlocks

    Deadlocks

    T3

    T4

    Lock-X(B)

    Read(B)

    B=B-50

    Write(B)

    Lock-S(A)

    Read(A)

    Lock-S(B)

    Lock-X(A)

    Deadlock occurs whenever a transaction T1 holds a lock on an item A and is requesting a lock on an item B and a transaction T2 holds a lock on item B and is requesting a lock on item A.

    Are T3 and T4 deadlocked here?


    Deadlock

    Deadlock:

    T1 is waiting for T2 to release lock on X

    T2 is waiting for T1 to release lock on Y

    Deadlock: graph cycle


    Two strategies

    Two strategies:

    Pessimistic: deadlock will happen and therefore should use “preventive” measures: Deadlock prevention

    Optimistic: deadlock will rarely occur and therefore wait until it happens and then try to fix it. Therefore, need to have a mechanism to “detect” a deadlock: Deadlock detection.


    Deadlock prevention

    Deadlock Prevention

    • Locks:

      • Lock all items before transaction begins execution

      • Either all are locked in one step or none are locked

      • Disadvantages:

        • Hard to predict what data items need to be locked

        • Data-item utilization may be very low


    Detection

    Detection

    • Circular Wait

      • Graph the resources. If a cycle, you are deadlocked

    • No (or reduced) throughput (because the deadlock may not involve all users)


    Deadlock recovery

    Deadlock Recovery

    • Pick a victim and rollback

      • Select a transaction, rollback, and restart

    • What criteria would you use to determine a victim?


    Synchronization is tricky

    Synchronization is Tricky

    • Forgetting to signal or release a semaphore

    • Blocking while holding a lock

    • Synchronizing on the wrong synchronization mechanism

    • Deadlock

    • Must use locks consistently, and minimize amount of shared resources


    Handling big data

    Java

    • Synchronization keyword

    • wait() and notify() notifyAll()

    • Code examples


    Java threads

    Java Threads

    • P1 is in the monitor (synchronized block of code)

    • P2 wants to enter the monitor

    • P2 must wait until P1 exits

    • While P2 is waiting, think of it as “waiting at the gate”

    • When P1 finishes, monitor allows one process waiting at the gate to become active.

    • Leaving the gate is not initiated by P2 – it is a side effect of P1 leaving the monitor


    Big data

    Big Data


    What does big data mean

    What does “Big Data” mean?

    • Most everyone thinks “volume”

    • Laney [3] expanded to include velocity and variety


    Defining big data

    Defining “Big Data”

    • It’s more than just big – meaning a lot of data

    • Can be viewed as 3 issues

      • Volume

        • Size

      • Velocity

        • How quickly it arrives vs consumed or response time

      • Variety

        • Diverse sources, formats, quality, structures


    Specific problems with big data

    Specific Problems withBig Data

    • I/O Bottlenecks

    • The cost of failure

    • Resource limitations


    I o bottlenecks

    I/O Bottlenecks

    • Moore’s Law: Gordon Moore, the co-founder of Intel

    • Stated that processor ability roughly doubles every 2 years (often quoted at 18 months)

      • Regardless …

    • The issue is that I/O, network, and memory speeds have not kept up with processor speeds

    • This creates a huge bottleneck


    Other issues

    Other Issues

    • What are the restart operations if a thread/processor fails?

      • If dealing with “Big Data”, parallelized solutions may not be sufficient because of the high cost of failure

    • Distributed systems involve network communication that brings an entirely different and complex set of problems


    Cost of failure

    Cost of Failure

    • The failure of many jobs is a problem

      • Can’t just restart because data has been modified

      • Need to roll-back and restart

      • May require human intervention

      • Resource costly (time, lost processor cycles, delayed results)

    • This is especially problematic if a process has been running a very long time


    Using a dbms for big data

    Using a DBMS for Big Data

    • Due to the volume of data:

      • May overwhelm a traditional DBMS system

      • The data may lack structure to easily integrate into a DBMS system

      • The time or cost to clean/prepare the data for use in a traditional DBMS may be prohibitive

      • Time may be critical. Need to look at today’s online transactions to know how to run business tomorrow


    Memory network resources

    Memory & NetworkResources

    • Might be too much data to use existing storage or software mechanisms

      • Too much data for memory

      • Files too large to realistically distribute over a network

    • Because of the volume, need new approaches


    Would this work

    Would this work?

    • Reduce the data

      • Dimensionality reduction

      • Sampling


    Weaknesses in current architectures

    Weaknesses in Current Architectures

    • Monolithic Servers scale-up

      • Large server farms

      • Buy more equipment as the load increases

    • Distributed systems scale-out

      • Duplicate data across >1 machine or server

      • Remaining problem of efficiency: I/O still the bottleneck because of large file sizes

    • What are other issues with these architectures?


    Needed new tools and approaches

    Needed: New Tools and Approaches

    • Need tools and architectures that are:

      • Able to handle very large amounts of data

      • Available and accessible

      • Robust

      • Simple to use and easy to learn

      • Cost effective


    A new generation of tools and technologies

    A New Generation of Tools and Technologies


    Hadoop

    Hadoop


    Advantages of hadoop

    Advantages of Hadoop

    • Can support very large datasets (multi-terabytes)

    • Runs as a cluster using commodity hardware


    Hadoop1

    Hadoop

    • Open source

    • Derived from Google’s MapReduce and Google File System (GFS) papers

    • We have a Hadoop cluster under development


    File systems

    File Systems

    • Uses a distributed file system for persistent data storage

    • More efficient than trying to store one file in one location

    • Provides options for recovery if failures


    Handling big data

    GFS

    • Proprietary

    • Uses commodity machines, not specialized hardware

    • Scalable – easy to increase capacity when needed

    • Fault-tolerant


    Handling big data

    DFS

    • File is chunked

    • Typical chunk is 16-64 MB

    • Chunks are replicated across multiple hosts in case of failure

    • When distributing chunks, tries to move copies to different racks (physical location)


    Handling big data

    HDFS

    • Hadoop Distributed File System

    • DataNodes – communicate with each other for pipeline file reads/writes

    • Files on the DataNodes are chunked in blocks

    • Copies of blocks appear across several DataNodes (default is 3)

    • The NameNode tracks the DataNodes and the file blocks assigned to each


    How does this differ

    How does this differ?

    • Suppose you are to count all words and the number of times each occurs

      • The file may fit in memory

      • The file may be too large for memory but your data structures to store the word counts may fit in memory

      • Neither may fit in memory

    • Need some type of parallelizes solution – but we previously looked at some associated problems


    Map reduce

    Map Reduce


    Map reduce1

    Map Reduce

    • Distributes the work over a large set of computers – divide and conquer

    • Has built-in fault tolerance; if one node fails, detects and send to another


    Map reduce view for programmers

    Map/Reduce View for Programmers

    • Map: Maps to a key, emitting a temporary (k,v)

    • Reduce: Receives data arranged by the key; you apply what you need done with the (k,value-list)

    • Example: Count words: Map each key (word) to a count (e.g. 1 if you are reading a file, tokenize, and emit (theWord, 1)

    • Reduce: You receive a (key, list of counts) – process the list and emit the (key, aggregagedData)


    Handling big data

    Credit: aws.typepad.com


    Under the hood

    Under the Hood

    • Reader processes split the file and send the assigned blocks to the worker machines

    • A Combiner process may take the Map results and aggregate to optimize performance

      • Example: May aggregate emitted results before sending out over the network

    • A Shuffle and Sort process sits between Map and Reduce to

      • Determine which Reducer should receive the interim result

      • Ensure the keys sent to a Reducer are sorted


    Under the hood 2

    Under the Hood [2]

    • TaskTrackers create the jobs to perform the Map Reduce work

    • A JobTracker controls and schedules the TaskTracker


    Map reduce2

    Map/Reduce

    • You need to program a Map and Reduce function – plus any other functions you may need for your specific application or problem

      • Map: Input is a <k,v> pair; emit intermediate values consisting of 0 or more <k,v>

      • Reduce: Input is a <k,list-of-values>

    • No data model – data is stored in files


    Example word count

    Example: Word Count

    • Suppose we have a very large file of text data

    • In the Map phase, we will expect lines of the file

    • Remember any one Map function will not see all the data, only the portion of the file assigned to that node

    • How would you construct the Map function?


    Example word count1

    Example: Word Count

    • How would you construct the Reduce function?

      • Remember that Reduce will receive a <key, list-of-values> from all the Mappers

    • What will be contained in the list-of-values?

    • What should be emitted?

    • Think back to the Map phase: How could this be made more efficient?


    In pseudocode

    In Pseudocode

    • Map (key, value)

      • For each word in value (sentence, paragraph, document – whatever)

        • Emit (w,1)

    • Reduce (key, list of values)

    • For each item in the list

      • Emit (result)


    Real m r example in python

    Real M-R Example in Python

    • Expects json format file where [0] is a book title; [1] is the text

    • Map breaks the text up by tokens

    • Reduce counts the occurrences in (key, list) that is provided to the function


    Sample code

    Sample Code

    Map:

    value = record[1] // the text ([0] is the title–ignored)

    words = value.split()

    for w in words:

    mr.emit_intermediate (w, 1)

    Reduce:

    total = 0

    for v in list_of_values: // list_of_values is argprovided to function– the counts for eachword (a list of ones in this case)

    total += v

    mr.emit((key, total))


    Under the hood simplified

    Under the Hood (simplified)

    • Prior to the Map function, the data was split (chunked) and sent to each of the worker nodes (the user does not see this happen)

    • The output of each Map function is grouped by keys (the user does not see this happen either). This grouping is sent to the Reducer as the (key, list of values)

    • Another function (not seen by the user) aggregates the results from the Reducer functions and returns to the main program


    Detecting failures

    Detecting Failures

    • A Master node pings Workers to determine if a Worker node has failed/crashed

    • The Master waits for all Worker nodes to complete. If it detects that a Worker has failed or is too slow (bottleneck) it reassigns the work to another Worker node

    • It checks for completion and ignores a second Worker reporting completion of the same work (e.g. a Worker was slow and the work was reassigned to another Worker)


    Optimizing

    Optimizing

    • You can optimize MR jobs by

      • Preprocessing some of the data

      • Aggregating results in the Map function

        • Example: Map (key, sentence)

        • If counting words, emit one count for all occurrences of a word in a sentence instead of emitting for each word

        • “The dog chased another dog”

      • The file system does support writes – typically done in the Reduce phase

      • I have also seen examples where users indicate a single Reducer – all work goes to this node where post-processing can be done


    Counting cars

    Counting Cars

    What is the result?


    Nosql

    NoSQL


    Traditional dbms

    Traditional DBMS

    • Are efficient, reliable, convenient

      • Well understood schemas

      • Query language support

      • Transaction guarantees

    • Enforce security so multiple users can access

    • Can store massive amounts of persistent data


    Traditional dbms1

    Traditional DBMS

    • Usually designed from the bottom up

    • Generally maintained by highly skilled DBA

    • Changes to the schema must be carefully managed

    • As the volume grows, more hardware (scale up) is often needed to support

    • Data accessed through a schema, using a structured query language


    Shortcomings of dbms

    Shortcomings of DBMS

    • Not all problems fit into a DBMS data model

    • DBMS may provide more than what we need


    Nosql1

    NoSQL

    • NoSQL doesn’t quite mean what the name may imply

      • It means that a traditional SQL database structure may not work for the current problem or data

    • May not have a specific schema

    • May have few restrictions on the data model

    • Most data represented as <k,v> pairs


    Why nosql

    Why NoSQL

    • Data may be “unruly”, unstructured

      • May need a flexible schema

    • Time

      • May need something quick and cheap to set up

    • May need to be updated or purged frequently

    • May have too much data

      • Massive scalability


    Example

    Example

    • Analyzing web logs

      • We want to find all entries for a given user, URL, or within a time window

    • Cleaning this data, designing a schema, loading into DBMS is time consuming

      • Data may be obsolete tomorrow

    • If we don’t use a DBMS, we can parallelize the solution

    • If concerns about consistency are relaxed, “close enough” may be “good enough”


    Nosql love it or hate it

    NoSQL: Love it or hate it

    • Hate it:

      • It lacks structure

      • Lack of a query language leaves the data difficult to use unless users have knowledge of other tools

      • Not well understood – has not been around very long

      • Significant skill to install and maintain (same is true for DBMS, but that technology has been around a long time)

      • All NoSQL developers are in a “learning mode”

    • Love it:

      • All open source (which could also be a reason to hate it)

      • Don’t need DBA (cost)

      • Flexible

      • Quick and cheap to set up


    Non dbms solutions

    Non-DBMS Solutions

    • Map Reduce is an example of a NoSQL solution

      • NoSQL: A SQL-only solution may not fit for all problems

    • But a NoSQL solution (exclusively without a DBMS) may not work either

      • May miss the structure of a schema and the ability to perform SQL-like queries


    Other non dbms nosql approaches

    Other non-DBMS (NoSQL) Approaches

    • Found that a lack of a schema was limiting for some problems

    • Some problems made more difficult because no schema

    • Pig and Hive


    Handling big data

    Hive

    • Supports a schema

    • A SQL-like query language

    • Compiles to a workload of Hadoop (M/R jobs)


    Handling big data

    Pig

    • Supports relational operators

    • Also compiles to a workload of Hadoop (M/R jobs)


    Limitations of hadoop

    Limitations of Hadoop

    • Not everything is fault-tolerant

      • Prior to the second version, used single-master model

      • No redundancy if the master fails

    • Security

      • Most services are not protected

      • Malicious users can subvert and assume identities

      • Any user can kill another user’s jobs


    References credits

    References & Credits

    [1] Chuck Lam’s “Hadoop in Action” book. Manning Publications, 2010.

    [2] Aaron Kimball videos @Google, UWashington

    [3] Doug Laney’s “3D Data Management: Controlling Data Volume, Velocity, and Variety”

    [4] Alex Holmes’ “Hadoop in Practice” book.

    [5] Jennifer Widom at Stanford


  • Login