Handling Big Data Howles Creditsto Sources on Final Slide
Handling Large Amountsof Data • Current technologies are to: • Parallelize – use multiple processors or threads. Can be a single machine, or a machine with multiple processors • Distribute – use a network to partition work across many computers
Parallelized Operations • This is relatively easy if the task itself can easily be split into units. Still presents some problems, including: • How is the work assigned? • What happens if we have more work units than threads or processors? • How do we know when all work units have completed? • How do we aggregate results in the end? • How do we handle if the work can’t be cleanly divided?
Parallelized Operations • To solve this problem, we need communication mechanisms • Need synchronization mechanism for communication (timing/notification of events), and to control sharing (mutex)
Why is it needed? • Data consistency • Orderly execution of instructions or activities • Timing – control race conditions
Examples • Two people want to buy the same seat on a flight • Readers and writers • P1 needs a resource but it’s being held by P2 • Two threads updating a single counter • Bounded Buffer • Producer/Consumer • …….
Synchronization Primitives • Review: • A special shared variable used to guarantee atomic operations • Hardware support • Processor may lock down memory bus while other reads/write occur • Semaphores, monitors, conditions are examples of language-level synchronization mechanisms
Needed when: • Resources need to be shared • Timing needs to be coordinated • Access data • Send messages or data • Potential race conditions – timing • Difficult to predict • Results in inconsistent, corrupt or destroyed info • Tricky to find; difficult to recreate • Activities need to be synchronized
Producer while count == MAX NOP Put in buffer counter++ Consumer while count == 0 NOP Remove from buffer counter-- Producer/Consumer
Race Conditions • … can result in an incorrect solution • An issue with any shared resource (including devices) • Printer • Writers to a disk
Critical Section • Also called the critical region • Segment of code (or device) for which a process must have exclusive use
Examples of Critical Sections • Updating/reading a shared counter • Controlling access to a device or other resource • Two users want write access to a file
Rules for solutions • Must enforce mutex • Must not postpone process if not warranted (exclude from CR if no other process in CR) • Bounded Waiting (to enter the CR) • No execution time guarantees
Atomic Operation • Operation is guaranteed to process without interruption • How do we enforce atomic operations?
Semaphores • Dijkstra, circa 1965 • Two standard operations: wait() and signal() • Older books may still use P() and V(), respectively (or Up() and Down()). You should be familiar with any notation
Semaphores • A semaphore is comprised of an integer counter and a waiting list of blocked processes • Initialize the counter (depends on application) • wait() decrements the counter and determines if the process must block • signal() increments the counter and determines if a blocked process can unblock
Semaphores • wait() and signal() are atomic operations • What is the other advantage of a semaphore over the previous solutions?
Binary Semaphore • Initialized to one • Allows only one process access at a time
Semaphores • wait() and signal() are usually system calls. Within the kernel, interrupts are disabled to make the counter operations atomic.
Process 0: wait (s); // 1st wait (q); // 3rd ……. signal (s); signal (q); Assume both semaphores initialized to 1 Process 1: wait (q); // 2nd wait (s); // 4th ……. signal (q); signal (s); Problems with Semaphores
Other problems • Incorrect order • Forgetting to signal() • Incorrect initial value
Monitors • Encapsulates the synchronization with the code • Only one process may be active in the monitor at a time • Waiting processes are blocked (no busy waiting)
Monitors • Condition variables control access to the monitor • Two operations: wait() and signal() (easy to confuse with semaphores, so be careful!) • enter() and leave() or other named functions may be used
Monitors if (some condition) call wait() on the monitor <<mutex>> call signal() on the monitor
States in the Monitor • Active (running) • Waiting (blocked, waiting on a condition)
Signals in the Monitor • When an ACTIVE process issues a signal(), it must allow a blocked process to become active • This would allow 2 ACTIVE processes and can’t allow this in a CR. • So – the first process that wants to execute the signal() must be active in order to issue the signal(); the signal() will make a waiting process become active.
Signals • Two solutions: • Delay the signal • Delay the waiting process from becoming active
Gladiator monitor (Cavers & Brown, 1978) • Delay the signaled process, signaling process continues • Create a new state (URGENT) to hold the process that has just been signaled. This signals the process but delays execution of the process just signaled. • When the signal-er leaves the monitor (or wait()s again), the process in URGENT is allowed to run.
Mediator (Cavers & Brown adapted from Hoare, 1974) • Delay the signaling process • When the process signal()s, it is blocked so the signaled process becomes active right away. • This monitor may be more difficult to get correct interaction. Be warned, especially if you have loops in your CR.
Tips for Using Monitors • Remember that excess signal() instructions don’t matter so don’t test for them or try to count them. • Don’t intermix with semaphores. • Be sure everything shared is declared inside the monitor • Carefully think about the process ordering (which monitor you wish to use)
Deadlocks T3 T4 Lock-X(B) Read(B) B=B-50 Write(B) Lock-S(A) Read(A) Lock-S(B) Lock-X(A) Deadlock occurs whenever a transaction T1 holds a lock on an item A and is requesting a lock on an item B and a transaction T2 holds a lock on item B and is requesting a lock on item A. Are T3 and T4 deadlocked here?
Deadlock: T1 is waiting for T2 to release lock on X T2 is waiting for T1 to release lock on Y Deadlock: graph cycle
Two strategies: Pessimistic: deadlock will happen and therefore should use “preventive” measures: Deadlock prevention Optimistic: deadlock will rarely occur and therefore wait until it happens and then try to fix it. Therefore, need to have a mechanism to “detect” a deadlock: Deadlock detection.
Deadlock Prevention • Locks: • Lock all items before transaction begins execution • Either all are locked in one step or none are locked • Disadvantages: • Hard to predict what data items need to be locked • Data-item utilization may be very low
Detection • Circular Wait • Graph the resources. If a cycle, you are deadlocked • No (or reduced) throughput (because the deadlock may not involve all users)
Deadlock Recovery • Pick a victim and rollback • Select a transaction, rollback, and restart • What criteria would you use to determine a victim?
Synchronization is Tricky • Forgetting to signal or release a semaphore • Blocking while holding a lock • Synchronizing on the wrong synchronization mechanism • Deadlock • Must use locks consistently, and minimize amount of shared resources
Java • Synchronization keyword • wait() and notify() notifyAll() • Code examples
Java Threads • P1 is in the monitor (synchronized block of code) • P2 wants to enter the monitor • P2 must wait until P1 exits • While P2 is waiting, think of it as “waiting at the gate” • When P1 finishes, monitor allows one process waiting at the gate to become active. • Leaving the gate is not initiated by P2 – it is a side effect of P1 leaving the monitor
What does “Big Data” mean? • Most everyone thinks “volume” • Laney  expanded to include velocity and variety
Defining “Big Data” • It’s more than just big – meaning a lot of data • Can be viewed as 3 issues • Volume • Size • Velocity • How quickly it arrives vs consumed or response time • Variety • Diverse sources, formats, quality, structures
Specific Problems withBig Data • I/O Bottlenecks • The cost of failure • Resource limitations
I/O Bottlenecks • Moore’s Law: Gordon Moore, the co-founder of Intel • Stated that processor ability roughly doubles every 2 years (often quoted at 18 months) • Regardless … • The issue is that I/O, network, and memory speeds have not kept up with processor speeds • This creates a huge bottleneck
Other Issues • What are the restart operations if a thread/processor fails? • If dealing with “Big Data”, parallelized solutions may not be sufficient because of the high cost of failure • Distributed systems involve network communication that brings an entirely different and complex set of problems
Cost of Failure • The failure of many jobs is a problem • Can’t just restart because data has been modified • Need to roll-back and restart • May require human intervention • Resource costly (time, lost processor cycles, delayed results) • This is especially problematic if a process has been running a very long time
Using a DBMS for Big Data • Due to the volume of data: • May overwhelm a traditional DBMS system • The data may lack structure to easily integrate into a DBMS system • The time or cost to clean/prepare the data for use in a traditional DBMS may be prohibitive • Time may be critical. Need to look at today’s online transactions to know how to run business tomorrow
Memory & NetworkResources • Might be too much data to use existing storage or software mechanisms • Too much data for memory • Files too large to realistically distribute over a network • Because of the volume, need new approaches
Would this work? • Reduce the data • Dimensionality reduction • Sampling