Multitasking and parallelism
This presentation is the property of its rightful owner.
Sponsored Links
1 / 17

Multitasking and Parallelism PowerPoint PPT Presentation


  • 51 Views
  • Uploaded on
  • Presentation posted in: General

Multitasking and Parallelism. Kristopher Windsor CS 147, Fall 2008. Table of contents. Parallel processing on one core Multicore usage, difficulties, and next steps Alternatives to multicore CPUs Multicore benchmarks. Optimizing each clock cycle.

Download Presentation

Multitasking and Parallelism

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Multitasking and parallelism

Multitasking and Parallelism

Kristopher Windsor

CS 147, Fall 2008


Table of contents

Table of contents

  • Parallel processing on one core

  • Multicore usage, difficulties, and next steps

  • Alternatives to multicore CPUs

  • Multicore benchmarks


Optimizing each clock cycle

Optimizing each clock cycle

  • Multiple instructions and / or data can be processed each cycle, for batch-processing efficiency

  • For example, MMX has many ALUs operate simultaneously to process multiple data

  • Vector architecture is similar to SIMD, but its speed comes from parallel data movement, not parallel data processing


Hardware multithreading

Hardware multithreading

  • Required whenever there are more threads than cores

  • There are multiple ways for a core to switch to a different thread

    • Fine-grained multithreading: switch every cycle

    • Course-grained multithreading: switch when the current thread is stalled (IE it is waiting for some data to come back from the RAM)

    • Simultaneous multithreading (SMT): multiple threads are processed each cycle


Reasons for multiple cores and processors

Reasons for multiple cores and processors

  • Clock speed limits for each core due to heat

    • Heat produced is exponentially related to clock speed, and cooling methods are limited

    • This limit has already been reached, and one core is not enough

  • Power efficiency

    • Smaller CPU designs can be optimized better

    • Individual cores or processors can be turned off when not needed


Two types of multicore use

Two types of multicore use

Job-level parallelism

Parallel processing program

  • Each process can only use one core

  • Easier to code

  • Most programs are written like this

  • Inefficient when you have multiple cores but only one main program

  • Each process can have multiple threads, which run on different cores

  • Harder to code

  • Used in OS, which has many independent tasks, and in web servers, where each request can be handled separately

  • Best use of multiple cores


Problem parallel processing game programming dilemma

Problem: Parallel processing: Game programming dilemma

  • Software-rendered display represents most of the game’s CPU usage (IE more than the physics calculations), and the graphics output cannot naturally be split into multiple threads

  • 3D hardware-accelerated graphic output is typically the performance bottleneck, and since the GPU is 50x + faster on a video card than on a CPU, multicore CPUs will not help

  • In games where every object can collide with every other object, physics cannot be parallelized easily because any two collisions may need to access the same memory

  • Every event has to happen in order, but parallel processing does not naturally do this


Problem parallel processing complexity

Problem: Parallel processing: Complexity

Sequential

Concurrent

Dim Shared As Integer total

Sub program ()

'this part can be done several times at once

'because it does not depend on

'other parts of the program

Dim As Integer addme = 0

For i As Integer = 1 To 10000

addme += 1

Next i

'accesses a global variable

total += addme

End Sub

For i As Integer = 1 To 100

program()

Next i

Dim Shared As Integer total

Dim Shared As Any Ptrmutex

Sub program ()

Dim As Integer addme = 0

For i As Integer = 1 To 10000

addme += 1

Next i

Mutexlock(mutex)

total += addme

Mutexunlock(mutex)

End Sub

mutex = Mutexcreate()

Dim As Any Ptr threads(1 To 100)

For i As Integer = 1 To 100

threads(i) = Threadcreate(@program())

Next i

For i As Integer = 1 To 100

Threadwait(threads(i))

Next i

Mutexdestroy(mutex)


Problem parallel processing cache coherance

Problem: Parallel processing: Cache coherance

  • Each processor has its own cache

  • If one processor changes the memory, the other processors may have the wrong data cached

  • Snooping protocol: when one processor changes the data, every other processor must remove (invalidate) its copy

  • AMD’s MOESI protocol: every cache block has data in one of these five states: modified, owned, exclusive, shared, or invalid


Amdahl s law

Amdahl’s law

  • Adding several cores to a machine will provide limited speed improvements, because the other components have not been upgraded

  • In this example, adding cores allows more FLOPs, but not more data transfer


Parallel processing next steps

Parallel processing: next steps

  • Intel is developing 6 and 8 core processors (Westmere and Nehalem)

  • Tilera produces 64-core chips (TILE64) with an architecture made for many cores

    • Removes the bus data-transfer bottleneck

    • Saves power by powering-off individual cores

    • Comes with developer tools for making parallel processing programs


Alternative architecture the gpu

Alternative architecture: the GPU

CPU

GPU

  • Slowly adopting multiple cores

  • Caches exploit locality

  • Needs low-latency RAM

  • Naturally better suited to parallelism, and uses major multithreading to achieve performance

    • The GeForce 8800 GTX has 16 multiprocessors and 16 * 8 multithreaded floating-point processors

  • No locality; uses course-grained hardware multithreading to minimize time loss

  • Needs high-bandwidth RAM


Alternative architecture clusters

Alternative architecture: clusters

Costs

Benefits

  • Maintenance and storage costs for each machine

  • Operating systems will take RAM from each machine

  • Resources such as RAM cannot be shared well among machines

  • Can be built with mass-produced computers and standard LAN hardware.

  • Can reach sizes beyond the limits of current multicore chips

  • Can be spread over multiple physical locations

    • Gives your company more bandwidth than any one ISP offers

    • Provides redundancy in case of fire or power outage

  • Can be upgraded without replacing the current hardware


Benchmarks

Benchmarks


Benchmarks1

Benchmarks

  • Sparse Matrix-Vector multiplication test and the Lattice-Boltzmann Magneto-Hydrodynamics test give different results

  • Less FLOPs per core when there are many cores

    • Upgrading from 2 cores to 4 may have little effect

  • Certain processors better for certain applications (IE Xeon)

    • Multicores demand new methods of software optimization


References

References

  • Computer Organization and Design: the Hardware / Software Interface, 4th ed., by David A. Patterson and John L. Hennessy

  • AMD.com

  • PCLaunches.com (New Intel Processors)

  • Tilera.com


The end

The end


  • Login