Lecture 6 multicore systems
This presentation is the property of its rightful owner.
Sponsored Links
1 / 32

Lecture 6: Multicore Systems PowerPoint PPT Presentation


  • 86 Views
  • Uploaded on
  • Presentation posted in: General

Lecture 6: Multicore Systems. Multicore Computers (chip multiprocessors). Combine two or more processors (cores) on a single piece of silicon Each core consists of ALU, registers, pipeline hardware, L1 instruction and data caches Multithreading is used. Pollack’s Rule.

Download Presentation

Lecture 6: Multicore Systems

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Lecture 6 multicore systems

Lecture 6:Multicore Systems


Multicore computers chip multiprocessors

Multicore Computers (chip multiprocessors)

Combine two or more processors (cores) on a single piece of silicon

Each core consists of ALU, registers, pipeline hardware, L1 instruction and data caches

Multithreading is used


Pollack s rule

Pollack’s Rule

  • Performance increase is roughly proportional to the square root of the increase in complexity

    performance  √complexity

  • Power consumption increase is roughly linearly proportional to the increase in complexity

    power consumption  complexity


Pollack s rule1

Pollack’s Rule

complexitypowerperformance

1 11

4 42

25 255

100s of low complexity cores, each operating at very low power

Ex: Four small cores

complexitypowerperformance

4x1 4x14


Increasing cpu performance

Increasing CPU Performance

Manycore Chip

  • Composed of hybrid cores

    • Some general purpose

    • Some graphics

    • Some floating point


Exascale systems

Exascale Systems

  • Boardcomposed ofmultiple manycore chipssharingmemory

  • Rack composedof multipleboards

  • A room full of these racks

    Millions of cores

    Exascale systems (1018 Flop/s)


Moore s law reinterpreted

Moore’s Law Reinterpreted

  • Number of cores per chip doubles every 2 years

  • Number of threads of execution doubles every 2 years


Shared memory mimd

Shared Memory MIMD

Shared memory

Single address space

All processes have access to the pool of shared memory

P

P

P

P

Bus

Memory


Shared memory mimd1

Shared Memory MIMD

Each processor executes different instructions asynchronously, using different data

data

CU

PE

data

CU

PE

Memory

data

CU

PE

data

CU

PE

instruction


Symmetric multiprocessors smp

Symmetric Multiprocessors (SMP)

MIMD

Shared memory

UMA

Proc

Proc

L1

L1

L2

L2

System bus

I/O

Main Memory

I/O

I/O


Symmetric multiprocessors smp1

Symmetric Multiprocessors (SMP)

Characteristics:

Two or more similar processors

Processors share the same memory and I/O facilities

Processors are connected by bus or other internal connection scheme, such that memory access time is the same for each processor

All processors share access to I/O devices

All processors can perform the same functions

The system is controlled by an integrated operating system that provides interaction between processors and their programs


Symmetric multiprocessors smp2

Symmetric Multiprocessors (SMP)

Operating system:

Provides tools and functions to exploit the parallelism

Schedules processes or threads across all of the processors

Takes care of

scheduling of threads and processes on processors

synchronization among processors


Multicore computers

Multicore Computers

Dedicated L1 Cache

(ARM11 MPCore)

CPU

core 1

CPU

core n

L1-I

L1-D

L1-I

L1-D

L2

I/O

Main Memory

I/O

I/O


Multicore computers1

Multicore Computers

Dedicated L2 Cache

(AMD Opteron)

CPU

core 1

CPU

core n

L1-I

L1-D

L1-I

L1-D

L2

L2

I/O

Main Memory

I/O

I/O


Multicore computers2

Multicore Computers

Shared L2 Cache

(Intel Core Duo)

CPU

core 1

CPU

core n

L1-I

L1-D

L1-I

L1-D

L2

I/O

Main Memory

I/O

I/O


Multicore computers3

Multicore Computers

Shared L3 Cache

(Intel Core i7)

CPU

core 1

CPU

core n

L1-I

L1-D

L1-I

L1-D

L2

L2

L3

I/O

Main Memory

I/O

I/O


Multicore computers4

Multicore Computers

Advantages of Shared L2 cache

Reduced overall miss rate

Thread on one core may cause a frame to be brought into the cache, thread on another core may access the same location that has already been brought into the cache

Data shared by multiple cores is not replicated

The amount of shared cache allocated to each core may be dynamic

Interprocessor communication is easy to implement

Advantages of Dedicated L2 cache

Each core can access its private cache more rapidly

L3 cache

When the amount of memory and number of cores grow, L3 cache provides better performance


Multicore computers5

Multicore Computers

On-chip interconnects

Bus

Crossbar

Off-chip communication (CPU-to-CPU or I/O):

Bus-based


Multicore computers chip multiprocessors1

Multicore Computers (chip multiprocessors)

Combine two or more processors (cores) on a single piece of silicon

Each core consists of ALU, registers, pipeline hardware, L1 instruction and data caches

Multithreading is used


Multicore computers6

Multicore Computers

Multithreading

A multithreaded processor provides a separate PC for each thread (hardware multithreading)

Implicit multithreading

Concurrent execution of multiple threads extracted from a single sequential program

Explicit multithreading

Execute instructions from different explicit threads by interleaving instructions from different threads on shared or parallel pipelines


Multicore computers7

Multicore Computers

Explicit Multithreading

Fine-grained multithreading (Interleavedmultithreading)

Processor deals with two or more thread contexts at a time

Switching from one thread to another at each clock cycle

Coarse-grained multithreading (Blockedmultithreading)

Instructions of a thread are executed sequentially until an event that causes a delay (eg. cache miss) occurs

This event causes a switch to another thread

Simultaneous multithreading (SMT)

Instructions are simultaneously issued from multiple threads to the execution units of a superscalar processor

Thread-level parallelism is combined with instruction-level parallelism (ILP)

Chip multiprocessing (CMP)

Each processor of a multicore system handles separate threads


Coarse grained fine grained symmetric multithreading cmp

Coarse-grained, Fine-grained, Symmetric Multithreading, CMP


Gpus graphics processing units

GPUs (Graphics Processing Units)

Characteristics of GPUs

GPUs are accelerators for CPUs

SIMD

GPUs have many parallel processors and many concurrent threads (i.e. 10 or more cores; 100s or 1000s of threads per core)

CPU-GPU combination is an example for heterogeneous computing

GPGPU (general purpose GPU): using a GPU to perform applications traditionally handled by the CPU


Lecture 6 multicore systems

GPUs


Lecture 6 multicore systems

GPUs

Core Complexity

Out-of-order execution

Dynamic branch prediction

Larger pipelines for higher clock rates

 More circuitry

 High performance


Lecture 6 multicore systems

GPUs

Complex cores are preferable:

Highly instruction parallel numeric applications

Floating-point applications

Large number of simple cores are preferable:

Application’s serial part is small


Cache performance

Cache Performance

  • Intel Core i7


Roofline performance model

Roofline Performance Model

Arithmetic intensity is the ratio of floating-point operations in a program to the number of data bytes accessed by the program from main memory

floating-point operations

Arithmetic intensity = --------------------------------------- = FLOPs/Byte

number of data bytes


Roofline performance model1

Roofline Performance Model

Attainable GFLOPs/second

Peak memory bandwidth x Arithmetic intensity

= min

Peak floating-point performance


Roofline performance model2

Roofline Performance Model

Peak floating-point performance is given by the hardware specifications of the computer (FLOPs/second)

For multicore chips, peak performance is the collective performance of all the cores on the chip. So, multiply the peak per chip by the number of chips

Peak memory performance is also given by the hardware specifications of the computer (Mbytes/second)

Maximum floating-point performance that the memory system of the computer can support for a given arithmetic intensity, can be plotted as

Peak memory bandwidth x Arithmetic intensity

(bytes/second) x (FLOPs/bytes) ==> FLOPs/second


Roofline performance model3

Roofline Performance Model

Roofline sets an upper bound on performance

Roofline of a computer does not vary by benchmark kernel


Stream benchmark

Stream Benchmark

A synthetic benchmark

Measures the performance of long vector operations

They have no temporal locality and they access arrays that are larger than the cache size

http://www.cs.virginia.edu/stream/ref.html

define N2000000

...

void tuned_STREAM_Copy() {

int j;

#pragmaomp parallel for

for (j=0; j<N; j++)

c[j] = a[j];

}

void tuned_STREAM_Scale(double scalar) {

int j;

#pragmaomp parallel for

for (j=0; j<N; j++)

b[j] = scalar*c[j];

}

void tuned_STREAM_Add() {

int j;

#pragmaomp parallel for

for (j=0; j<N; j++)

c[j] = a[j]+b[j];

}

void tuned_STREAM_Triad(double scalar) {

int j;

#pragmaomp parallel for

for (j=0; j<N; j++)

a[j] = b[j]+scalar*c[j];

}


  • Login