handling big data n.
Skip this Video
Loading SlideShow in 5 Seconds..
Handling Big Data PowerPoint Presentation
Download Presentation
Handling Big Data

Handling Big Data

171 Views Download Presentation
Download Presentation

Handling Big Data

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Handling Big Data Howles Creditsto Sources on Final Slide

  2. Handling Large Amountsof Data • Current technologies are to: • Parallelize – use multiple processors or threads. Can be a single machine, or a machine with multiple processors • Distribute – use a network to partition work across many computers

  3. Parallelized Operations • This is relatively easy if the task itself can easily be split into units. Still presents some problems, including: • How is the work assigned? • What happens if we have more work units than threads or processors? • How do we know when all work units have completed? • How do we aggregate results in the end? • How do we handle if the work can’t be cleanly divided?

  4. Parallelized Operations • To solve this problem, we need communication mechanisms • Need synchronization mechanism for communication (timing/notification of events), and to control sharing (mutex)

  5. Why is it needed? • Data consistency • Orderly execution of instructions or activities • Timing – control race conditions

  6. Examples • Two people want to buy the same seat on a flight • Readers and writers • P1 needs a resource but it’s being held by P2 • Two threads updating a single counter • Bounded Buffer • Producer/Consumer • …….

  7. Synchronization Primitives • Review: • A special shared variable used to guarantee atomic operations • Hardware support • Processor may lock down memory bus while other reads/write occur • Semaphores, monitors, conditions are examples of language-level synchronization mechanisms

  8. Needed when: • Resources need to be shared • Timing needs to be coordinated • Access data • Send messages or data • Potential race conditions – timing • Difficult to predict • Results in inconsistent, corrupt or destroyed info • Tricky to find; difficult to recreate • Activities need to be synchronized

  9. Producer while count == MAX NOP Put in buffer counter++ Consumer while count == 0 NOP Remove from buffer counter-- Producer/Consumer

  10. Race Conditions • … can result in an incorrect solution • An issue with any shared resource (including devices) • Printer • Writers to a disk

  11. Critical Section • Also called the critical region • Segment of code (or device) for which a process must have exclusive use

  12. Examples of Critical Sections • Updating/reading a shared counter • Controlling access to a device or other resource • Two users want write access to a file

  13. Rules for solutions • Must enforce mutex • Must not postpone process if not warranted (exclude from CR if no other process in CR) • Bounded Waiting (to enter the CR) • No execution time guarantees

  14. Atomic Operation • Operation is guaranteed to process without interruption • How do we enforce atomic operations?

  15. Semaphores • Dijkstra, circa 1965 • Two standard operations: wait() and signal() • Older books may still use P() and V(), respectively (or Up() and Down()). You should be familiar with any notation

  16. Semaphores • A semaphore is comprised of an integer counter and a waiting list of blocked processes • Initialize the counter (depends on application) • wait() decrements the counter and determines if the process must block • signal() increments the counter and determines if a blocked process can unblock

  17. Semaphores • wait() and signal() are atomic operations • What is the other advantage of a semaphore over the previous solutions?

  18. Binary Semaphore • Initialized to one • Allows only one process access at a time

  19. Semaphores • wait() and signal() are usually system calls. Within the kernel, interrupts are disabled to make the counter operations atomic.

  20. Process 0: wait (s); // 1st wait (q); // 3rd ……. signal (s); signal (q); Assume both semaphores initialized to 1 Process 1: wait (q); // 2nd wait (s); // 4th ……. signal (q); signal (s); Problems with Semaphores

  21. Other problems • Incorrect order • Forgetting to signal() • Incorrect initial value

  22. Monitors • Encapsulates the synchronization with the code • Only one process may be active in the monitor at a time • Waiting processes are blocked (no busy waiting)

  23. Monitors • Condition variables control access to the monitor • Two operations: wait() and signal() (easy to confuse with semaphores, so be careful!) • enter() and leave() or other named functions may be used

  24. Monitors if (some condition) call wait() on the monitor <<mutex>> call signal() on the monitor

  25. States in the Monitor • Active (running) • Waiting (blocked, waiting on a condition)

  26. Examples

  27. Signals in the Monitor • When an ACTIVE process issues a signal(), it must allow a blocked process to become active • This would allow 2 ACTIVE processes and can’t allow this in a CR. • So – the first process that wants to execute the signal() must be active in order to issue the signal(); the signal() will make a waiting process become active.

  28. Signals • Two solutions: • Delay the signal • Delay the waiting process from becoming active

  29. Gladiator monitor (Cavers & Brown, 1978) • Delay the signaled process, signaling process continues • Create a new state (URGENT) to hold the process that has just been signaled. This signals the process but delays execution of the process just signaled. • When the signal-er leaves the monitor (or wait()s again), the process in URGENT is allowed to run.

  30. Mediator (Cavers & Brown adapted from Hoare, 1974) • Delay the signaling process • When the process signal()s, it is blocked so the signaled process becomes active right away. • This monitor may be more difficult to get correct interaction. Be warned, especially if you have loops in your CR.

  31. Tips for Using Monitors • Remember that excess signal() instructions don’t matter so don’t test for them or try to count them. • Don’t intermix with semaphores. • Be sure everything shared is declared inside the monitor • Carefully think about the process ordering (which monitor you wish to use)

  32. Deadlocks T3 T4 Lock-X(B) Read(B) B=B-50 Write(B) Lock-S(A) Read(A) Lock-S(B) Lock-X(A) Deadlock occurs whenever a transaction T1 holds a lock on an item A and is requesting a lock on an item B and a transaction T2 holds a lock on item B and is requesting a lock on item A. Are T3 and T4 deadlocked here?

  33. Deadlock: T1 is waiting for T2 to release lock on X T2 is waiting for T1 to release lock on Y Deadlock: graph cycle

  34. Two strategies: Pessimistic: deadlock will happen and therefore should use “preventive” measures: Deadlock prevention Optimistic: deadlock will rarely occur and therefore wait until it happens and then try to fix it. Therefore, need to have a mechanism to “detect” a deadlock: Deadlock detection.

  35. Deadlock Prevention • Locks: • Lock all items before transaction begins execution • Either all are locked in one step or none are locked • Disadvantages: • Hard to predict what data items need to be locked • Data-item utilization may be very low

  36. Detection • Circular Wait • Graph the resources. If a cycle, you are deadlocked • No (or reduced) throughput (because the deadlock may not involve all users)

  37. Deadlock Recovery • Pick a victim and rollback • Select a transaction, rollback, and restart • What criteria would you use to determine a victim?

  38. Synchronization is Tricky • Forgetting to signal or release a semaphore • Blocking while holding a lock • Synchronizing on the wrong synchronization mechanism • Deadlock • Must use locks consistently, and minimize amount of shared resources

  39. Java • Synchronization keyword • wait() and notify() notifyAll() • Code examples

  40. Java Threads • P1 is in the monitor (synchronized block of code) • P2 wants to enter the monitor • P2 must wait until P1 exits • While P2 is waiting, think of it as “waiting at the gate” • When P1 finishes, monitor allows one process waiting at the gate to become active. • Leaving the gate is not initiated by P2 – it is a side effect of P1 leaving the monitor

  41. Big Data

  42. What does “Big Data” mean? • Most everyone thinks “volume” • Laney [3] expanded to include velocity and variety

  43. Defining “Big Data” • It’s more than just big – meaning a lot of data • Can be viewed as 3 issues • Volume • Size • Velocity • How quickly it arrives vs consumed or response time • Variety • Diverse sources, formats, quality, structures

  44. Specific Problems withBig Data • I/O Bottlenecks • The cost of failure • Resource limitations

  45. I/O Bottlenecks • Moore’s Law: Gordon Moore, the co-founder of Intel • Stated that processor ability roughly doubles every 2 years (often quoted at 18 months) • Regardless … • The issue is that I/O, network, and memory speeds have not kept up with processor speeds • This creates a huge bottleneck

  46. Other Issues • What are the restart operations if a thread/processor fails? • If dealing with “Big Data”, parallelized solutions may not be sufficient because of the high cost of failure • Distributed systems involve network communication that brings an entirely different and complex set of problems

  47. Cost of Failure • The failure of many jobs is a problem • Can’t just restart because data has been modified • Need to roll-back and restart • May require human intervention • Resource costly (time, lost processor cycles, delayed results) • This is especially problematic if a process has been running a very long time

  48. Using a DBMS for Big Data • Due to the volume of data: • May overwhelm a traditional DBMS system • The data may lack structure to easily integrate into a DBMS system • The time or cost to clean/prepare the data for use in a traditional DBMS may be prohibitive • Time may be critical. Need to look at today’s online transactions to know how to run business tomorrow

  49. Memory & NetworkResources • Might be too much data to use existing storage or software mechanisms • Too much data for memory • Files too large to realistically distribute over a network • Because of the volume, need new approaches

  50. Would this work? • Reduce the data • Dimensionality reduction • Sampling