Parallel architectures
Download
1 / 87

Parallel Architectures - PowerPoint PPT Presentation


  • 81 Views
  • Uploaded on

Parallel Architectures. Martino Ruggiero [email protected] Why Multicores ?. The SPECint performance of the hottest chip grew by 52% per year from 1986 to 2002 , and then grew only 20% in the next three years (about 6% per year ). [from Patterson & Hennessy].

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Parallel Architectures' - barry-witt


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Parallel architectures

Parallel Architectures

Martino Ruggiero

[email protected]


Why multicores
WhyMulticores?

The SPECintperformance of the hottest chip grew by 52% per year from 1986 to 2002, and then grew only 20% in the next three years (about 6% per year).

[from Patterson & Hennessy]

Diminishingreturns from uniprocessordesigns


Power w all
PowerWall

[from Patterson & Hennessy]

  • The design goal for the late 1990’s and early 2000’s was to drive the clock rate up.

    • This was done by adding more transistors to a smaller chip.

  • Unfortunately, this increased the power dissipation of the CPU chip beyond the capacity of inexpensive cooling techniques


Roadmap for cpu clock speed circa 2005
Roadmap for CPU Clock Speed: Circa 2005

[from Patterson & Hennessy]

Here is the result of the best thought in 2005. By 2015, the clock speed

of the top “hot chip” would be in the 12 – 15 GHz range.


The cpu clock speed roadmap a few revisions later
The CPU Clock Speed Roadmap (A Few Revisions Later)

[from Patterson & Hennessy]

This reflects the practical experience gained with dense chips that were literally

“hot”; they radiated considerable thermal power and were difficult to cool.

Law of Physics: All electrical power consumed is eventually radiated as heat.


The multicore approach
The MultiCore Approach

Multiple cores on the same chip

  • Simpler

  • Slower

  • Less power demanding


The memory gap
The Memory Gap

[from Patterson & Hennessy]

  • Bottom-line: memory access is increasingly expensive and computer architect must devise new ways of hiding this cost


Transition to multicore
Transition to Multicore

Sequential App Performance

Spring 2011 -- Lecture #15


Parallel architectures1
Parallel Architectures

  • Definition: “A parallel architecture is a collection of processing elements that cooperate and communicate to solve large problems fast”

  • Questions about parallel architectures:

    • How many are the processing elements?

    • How powerful are processing elements?

    • How do they cooperate and communicate?

    • How are data transmitted?

    • What type of interconnection?

    • What are HW and SW primitives for programmer?

    • Does it translate into performance?


Flynn taxonomy of parallel computers
Flynn Taxonomy of parallel computers

M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.

  • Flynn's Taxonomy provides a simple, but very broad, classification of computer architectures:

    • Single Instruction, Single Data (SISD)

      • A single processor with a single instruction stream, operating sequentially on a single data stream.

    • Single Instruction, Multiple Data (SIMD)

      • A single instruction stream is broadcast to every processor, all processors execute the same instructions in lock-step on their own local data stream.

    • Multiple Instruction, Multiple Data (MIMD)

      • Each processor can independently execute its own instruction stream on its own local data stream.

  • SISD machines are the traditional single-processor, sequential computers - also known as Von Neumann architecture, as opposed to “non-Von" parallel computers.

  • SIMD machines are synchronous, with more fine-grained parallelism - they run a large number parallel processes, one for each data element in a parallel vector or array.

  • MIMD machines are asynchronous, with more coarse-grained parallelism - they run a smaller number of parallel processes, one for each processor, operating on the large chunks of data local to each processor.


Single instruction single data stream sisd
Single Instruction/Single Data Stream:SISD

  • Sequential computer

  • No parallelism in either the instruction or data streams

  • Examples of SISD architecture are traditional uniprocessor machines

Processing Unit


Multiple instruction single data stream misd
Multiple Instruction/Single Data Stream:MISD

  • Computer that exploits multiple instruction streams against a single data stream for data operations that can be naturally parallelized

    • For example, certain kinds of array processors

  • No longer commonly encountered, mainly of historical interest only


Single instruction multiple data stream simd
Single Instruction/Multiple Data Stream:SIMD

  • Computer that exploits multiple data streams against a single instruction stream to operations that may be naturally parallelized

    • e.g., SIMD instruction extensions or Graphics Processing Unit (GPU)

  • Single control unit

  • Multiple datapaths (processing elements – PEs) running in parallel

    • PEs are interconnected and exchange/share data as directed by the control unit

    • Each PE performs the same operation on its own local data


Multiple instruction multiple data streams mimd
Multiple Instruction/Multiple Data Streams:MIMD

  • Multiple autonomous processors simultaneously executing different instructions on different data.

  • MIMD architectures include multicore and Warehouse Scale Computers (datacenters)


Parallel computing architectures memory model

Centrilizedmemory

Physically distributed memory

Pro-

Pro-

Pro-

Pro-

...

...

cessor

cessor

cessor

cessor

Shared address space

Interconnection

Local

Memory

Local

Memory

Shared Memory

Interconnection

NUMA (Non-Uniform Memory Access)

distributed-shared-memory

multiprocessor

UMA (Uniform Memory Access)

(SMP) symmetric multiprocessor

Pro-

Pro-

...

cessor

cessor

empty

Private address spaces

Local

Memory

Local

Memory

receive

send

Interconnection

MPP (Massively Parallel Processors)

message-passing

(shared-nothing) multiprocessor

Parallel Computing Architectures Memory Model

Parallel Architecture = Computer Architecture + Communication Architecture

  • Question: how do we organize and distribute memory in a multicore architecture?

  • 2 classes of multiprocessors WRT memory:

  • Centralized Memory Multiprocessor

  • Physically Distributed-Memory multiprocessor

  • 2 classes of multiprocessors WRT addressing:

  • Shared

  • Private


Memory performance metrics
Memory Performance Metrics

  • Latencyis the overhead in setting up a connection between processors for passing data.

    • This is the most crucial problem for all parallel architectures - obtaining good performance over a range of applications depends critically on low latency for accessing remote data.

  • Bandwidthis the amount of data per unit time that can be passed between processors.

    • This needs to be large enough to support efficient passing of large amounts of data between processors, as well as collective communications, and I/O for large data sets.

  • Scalabilityis how well latency and bandwidth scale with the addition of more processors.

    • This is usually only a problem for architectures with manycores.


Distributed shared memory architecture numa
Distributed Shared Memory Architecture:NUMA

  • Data set is distributed among processors:

    • each processor accesses only its own data from local memory

    • if data from another section of memory (i.e. another processor) is required, it is obtained by a remote access.

  • Much larger latency for accessing non-local data, but can scale to large numbers (thousands) of processors for many applications.

    • Advantage:Scalability

    • Disadvantage:Locality Problems and Connection congestion

  • Aggregated memory of the whole system appear as one single address space.

Communication Network

P Processor

M Local Memory

Host

Processor

P 1

P 2

P 3

M 1

M 2

M 3


Distributed memory message passing architectures

Interconnect Network

NI

NI

NI

NI

NI

NI

NI

NI

P

P

P

P

P

P

P

P

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Distributed Memory—Message Passing Architectures

  • Each processor is connected to exclusive local memory

    • i.e. no other CPU has direct access to it

  • Each node comprises at least one network interface (NI) that mediates the connection to a communication network.

  • On each CPU runs a serial process that can communicate with other processes on other CPUs by means of the network.

  • Non-blocking vs. Blocking communication

  • MPI Problems:

    • All data layout must be handled by software

    • Message passing has high software overhead


Shared memory architecture uma

Pro-

cessor

Pro-

cessor

Pro-

cessor

Pro-

cessor

Primary

Cache

Primary

Cache

Primary

Cache

Primary

Cache

Secondary

Cache

Secondary

Cache

Secondary

Cache

Secondary

Cache

Bus

Global Memory

Shared Memory Architecture: UMA

  • Each processor has access to all the memory, through a shared memory bus and/or communication network

    • Memory bandwidth and latency are the same for all processors and all memory locations.

  • Lower latency for accessing non-local data, but difficult to scale to large numbers of processors, usually used for small numbers (order 100 or less) of processors.


Shared memory candidates

Pro-

cessor

Pro-

cessor

Pro-

cessor

Pro-

cessor

Primary Cache

Pro-

cessor

Pro-

cessor

Pro-

cessor

Pro-

cessor

Pro-

cessor

Pro-

cessor

Pro-

cessor

Pro-

cessor

Primary

Cache

Primary

Cache

Primary

Cache

Primary

Cache

Primary

Cache

Primary

Cache

Primary

Cache

Primary

Cache

Secondary Cache

Secondary

Cache

Secondary

Cache

Secondary

Cache

Secondary

Cache

Secondary Cache

Global Memory

Global Memory

Global Memory

Shared memory candidates

Shared-main memory

Shared-primary cache

Shared-secondary cache

  • Caches are used to reduce latency and to lower bus traffic

  • Must provide hardware to ensure that caches and memory are consistent (cache coherency)

  • Must provide a hardware mechanism to support process synchronization


Challenge of parallel processing
Challenge of Parallel Processing

  • Two biggest performance challenges in using multiprocessors

    • Insufficient parallelism

      • The problem of inadequate application parallelism must be attacked primarily in software with new algorithms that can have better parallel performance.

    • Long-latency remote communication

      • Reducing the impact of long remote latency can be attacked both by the architecture and by the programmer.


Amdahl s law

Exec time w/o E

Speedup w/ E = ----------------------

Exec time w/ E

Amdahl’s Law

  • Speedup due to enhancement E is

  • Suppose that enhancement E accelerates a fraction F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffected

Execution Time w/o E  [ (1-F) + F/S]

Execution Time w/ E =

1 / [ (1-F) + F/S ]

Speedup w/ E =


Amdahl s law1
Amdahl’s Law

Speedup =

Example: the execution time of half of the program can be accelerated by a factor of 2.What is the program speed-up overall?


Amdahl s law2
Amdahl’s Law

Speedup = 1

Example: the execution time of half of the program can be accelerated by a factor of 2.What is the program speed-up overall?

(1 - F) + F

S

Non-speed-up part

Speed-up part

1

1

=

=

1.33

0.5 + 0.5

0.5 + 0.25

2


Amdahl s law3
Amdahl’s Law

If the portion ofthe program thatcan be parallelizedis small, then thespeedup is limited

The non-parallelportion limitsthe performance



Strong and weak scaling
Strong and Weak Scaling

  • To get good speedup on a multiprocessor while keeping the problem size fixed is harder than getting good speedup by increasing the size of the problem.

    • Strong scaling: when speedup can be achieved on a parallel processor without increasing the size of the problem

    • Weak scaling: when speedup is achieved on a parallel processor by increasing the size of the problem proportionally to the increase in the number of processors

Needed to amortizesources of OVERHEAD (additional code, notpresent in the originalsequentialprogram, needed to execute the program in parallel)


Symmetric shared memory architectures

Pro-

cessor

Pro-

cessor

Pro-

cessor

Pro-

cessor

Primary

Cache

Primary

Cache

Primary

Cache

Primary

Cache

Secondary

Cache

Secondary

Cache

Secondary

Cache

Secondary

Cache

Bus

Global Memory

Symmetric Shared-Memory Architectures

Symmetric shared-memory machines usually support the caching of both shared and private data.

Private data are used by a single processor, while shared data are used by multiple processors.

When a private item is cached, its location is migrated to the cache, reducing the average access time as well as the memory bandwidth required. Since no other processor uses the data, the program behavior is identical to that in a uniprocessor.

When shared data are cached, the shared value may be replicated in multiple caches. In addition, This replication also provides a reduction in contention that may exist for shared data items that are being read by multiple processors simultaneously.

 Caching of shared data, however, introduces a new problem : cache coherence


Example cache coherence problem

u

= ?

u

= ?

u

= 7

5

4

3

1

2

u

u

u

:5

:5

:5

Example Cache Coherence Problem

P

P

P

  • Cores see different values for u after event 3

  • With write back caches, value written back to memory depends on the order of which cache flushes or writes back value

  • Unacceptable for programming, and it is frequent!

2

1

3

$

$

$

I/O devices

Memory


Keeping multiple caches coherent
Keeping Multiple Caches Coherent

  • Architect’s job: shared memory => keep cache values coherent

  • Idea: When any processor has cache miss or writes, notify other processors via interconnection network

    • If only reading, many processors can have copies

    • If a processor writes, invalidate all other copies

  • Shared written result can “ping-pong” between caches


Shared memory multiprocessor
Shared Memory Multiprocessor

Memory

Bus

Snoopy

Cache

M1

Physical

Memory

Snoopy

Cache

M2

DMA

Snoopy

Cache

DISKS

M3

Use snoopy mechanism to keep all processors’ view of memory coherent


Example write thru invalidate

u

= ?

u

= ?

u

= 7

5

4

3

1

2

u

u

u

:5

:5

:5

u = 7

Example: Write-thru Invalidate

P

P

P

  • Must invalidate before step 3

  • Write update uses more broadcast medium BW all recent SMP multicores use write invalidate

2

1

3

$

$

$

I/O devices

Memory


Need for a more scalable protocol
Need for a more scalable protocol

  • Snoopy schemes do not scale because they rely on broadcast

  • Hierarchical snoopy schemes have the root as a bottleneck

  • Directorybased schemes allow scaling

    • They avoid broadcasts by keeping track of all CPUs caching a memory block, and then using point-to-point messages to maintain coherence

    • They allow the flexibility to use any scalable point-to-point network


Scalable approach directories
Scalable Approach: Directories

  • Every memory block has associated directory information

    • keeps track of copies of cached blocks and their states

    • on a miss, find directory entry, look it up, and communicate only with the nodes that have copies if necessary

    • in scalable networks, communication with directory and copies is through network transactions

  • Many alternatives for organizing directory information


Basic operation of directory
Basic Operation of Directory

• Read from main memory by processor i:

• If dirty-bit OFF then { read from main memory; turn p[i] ON; }

• if dirty-bit ON then { recall line from dirty proc (downgrade cache state to shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to i;}

• Write to main memory by processor i:

• If dirty-bit OFF then {send invalidations to all caches that have the block; turn dirty-bit ON; supply data to i; turn p[i] ON; ... }

• k processors

• With each cache-block in memory: k presence-bits, 1 dirty-bit

• With each cache-block in cache: 1 valid bit, and 1 dirty (owner) bit


Real manycore architectures
Real Manycore Architectures

  • ARM Cortex A9

  • GPU

  • P2012


Arm cortex a9 processors
ARM Cortex-A9 processors

  • 98% of mobile phones use at least on ARM processor

  • 90% of embedded 32-bit systems use ARM

  • The Cortex-A9 processors are the highest performance ARM processors implementing the full richness of the widely supported ARMv7 architecture.


Cortex a9 cpu
Cortex-A9 CPU

  • Superscalar out-of-order instruction execution

    • Any of the four subsequent pipelines can select instructions from the issue queue

  • Advanced processing of instruction fetch and branch prediction

  • Up to four instruction cache line prefetch-pending

    • Further reduces the impact of memory latency so as to maintain instruction delivery

  • Between two and four instructions per cycle forwarded continuously into instruction decode

  • Counters for performance monitoring


The cortex a9 mpcore multicore processor
The Cortex-A9 MPCore Multicore Processor

  • Design-configurable Processor supporting between 1 and 4 CPU

  • Each processor may be independently configured for their cache sizes, FPU and NEON

  • Snoop Control Unit

  • Accelerator Coherence Port


Snoop control unit and accelerator coherence port
Snoop Control Unit and Accelerator Coherence Port

  • The SCU is responsible for managing:

    • the interconnect,

    • arbitration,

    • communication,

    • cache-2-cache and system memory transfers,

    • cache coherence

  • The Cortex-A9 MPCore processor also exposes these capabilities to other system accelerators and non-cached DMA driven mastering peripherals:

    • To increase the performance

    • To reduce the system wide power consumption by sharing access to the processor’s cache hierarchy

  • This system coherence also reduces the software complexity involved in otherwise maintaining software coherence within each OS driver.


What is gpgpu
What is GPGPU?

  • The graphics processing unit (GPU) on commodity video cards has evolved into an extremely flexible and powerful processor

    • Programmability

    • Precision

    • Power

  • GPGPU: an emerging field seeking to harness GPUs for general-purpose computation other than 3D graphics

    • GPU accelerates critical path of application

  • Data parallel algorithms leverage GPU attributes

    • Large data arrays, streaming throughput

    • Fine-grain SIMD parallelism

    • Low-latency floating point (FP) computation

  • Applications – see //GPGPU.org

    • Game effects (FX) physics, image processing

    • Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting


Motivation 1
Motivation 1:

  • Computational Power

    • GPUs are fast…

    • GPUs are getting faster, faster


Motivation 2
Motivation 2:

  • Flexible, Precise and Cheap:

    • Modern GPUs are deeply programmable

      • Solidifying high-level language support

    • Modern GPUs support high precision

      • 32 bit floating point throughout the pipeline

      • High enough for many (not all) applications













Stalls
Stalls!

  • Stalls occur when a core cannot run the next instruction because of a dependency on a previous operation.

  • Memory access latency = 100’s to 1000’s of cycles

  • We’ve removed the fancy caches and logic that helps avoid stalls.

  • But we have LOTS of independent work items.

  • Idea #3: Interleave processing of many elements on a single core to avoid stalls caused by high latency operations.








Nvidia tesla
NVIDIA Tesla

  • Three key ideas

    • Use many “slimmed down cores” to run in parallel

    • Pack cores full of ALUs (by sharing instruction stream across groups of work items)

    • Avoid latency stalls by interleaving execution of many groups of work-items/ threads/ ...

      • When one group stalls, work on another group


On-chip memory

  • Each multiprocessor has on-chip memory of the four following types:

    • One set of local 32-bit registers per processor,

    • A parallel shared memory that is shared by all scalar processor cores and is where the shared memory space resides,

    • A read-only constant cache that is shared by all scalar processor cores and speeds up reads from the constant memory space, which is a read-only region of device memory,

    • A read-only texture cache that is shared by all scalar processor cores and speeds up reads from the texture memory space, which is a read-only region of device memory; each multiprocessor accesses the texture cache via a texture unit that implements the various addressing modes and data filtering.

  • The local and global memory spaces are read-write regions of device memory and are not cached.


Shared Memory

  • Is on-chip:

    • much faster than the global memory

    • divided into equally-sized memory banks

    • as fast as a register when no bank conflicts

  • Successive 32-bit words are assigned to successive banks

  • Each bank has a bandwidth of 32 bits per clock cycle.


Shared Memory

Examples of Shared Memory Access Patterns

without Bank Conflicts


Shared Memory

Examples of Shared Memory Access Patterns

with Bank Conflicts


Global Memory: Coalescing

  • The device is capable of reading 4-byte, 8-byte, or 16-byte words from global memory into registers in a single instruction.

  • Global memory bandwidth is used most efficiently when the simultaneous memory accesses can be coalesced into a single memory transaction of 32, 64, or 128 bytes.




Nvidia s fermi generation cuda compute architecture
NVIDIA’s Fermi Generation CUDA Compute Architecture:

The key architectural highlights of Fermi are:

  • Third Generation Streaming Multiprocessor (SM)

    • 32 CUDA cores per SM, 4x over GT200

    • 8x the peak double precision floatingpoint performance over GT200

  • Second Generation ParallelThread Execution ISA

    • Unified Address Space with Full C++ Support

    • Optimized for OpenCL and DirectCompute

  • Improved Memory Subsystem

    • NVIDIA Parallel DataCache hierarchywith Configurable L1 and Unified L2 Caches

    • improved atomic memory op performance

  • NVIDIA GigaThreadTM Engine

    • 10x faster application context switching

    • Concurrent kernel execution

    • Out of Order thread block execution

    • Dual overlapped memory transfer engines


Third generation streaming multiprocessor
Third Generation Streaming Multiprocessor

  • 512 High Performance CUDA cores

    • Each SM features 32 CUDA processors

    • Each CUDA processor has a fully pipelined integer arithmetic logic unit (ALU) and floating point unit (FPU)

  • 16 Load/Store Units

    • Each SM has 16 load/store units, allowing source and destination addresses to be calculated for sixteen threads per clock.

    • Supporting units load and store the data at each address to cache or DRAM.

  • Four Special Function Units

    • Special Function Units (SFUs) execute transcendental instructions such as sin, cosine, reciprocal, and square root.


P2012 introduction
P2012 Introduction

P2012

Cluster

P2012

Cluster

P2012

Cluster

P2012

Cluster

P2012

Cluster

P2012

Cluster

P2012

Cluster

P2012

Cluster

P2012

Cluster

System

Bridge

System

Bridge

System

Bridge

Fabric

Controller

The P2012 cluster is the computing node of the P2012 Fabric

The P2012 cluster has two variants :

  • An homogeneous computing variant,

  • An heterogeneous computing variant.

    A single architecture for both variants.


P2012 cluster main features
P2012 Cluster Main Features

Symmetric Multi-processing

Uniform Memory Access within the cluster

Non Uniform Memory Access between clusters

Up to 16 +1 processors per cluster.

Up to 30.6 GOPS peak per cluster (assuming non-SIMD extension) at 600 MHz.

Up to 20.4 GFLOPs (32 bits) peak per cluster at 600 MHz.

2 DMA channels allowing up to 6.4 GB/s data transfer

HW Support for synchronization:

  • Fast barrier (within a cluster only) in ~4 Cycles for 16 processors

  • Flexible barrier ~20 cycles for 16 processors

    Seamless combination of non-programmable (HWPEs) and programmable (PEs) processing elements

    High level of customization though:

  • The number of STxP70 processing elements

  • The STxP70 extensions (ISA customization)

  • Up to 32 User-defined HWPEs,

  • Memory sizes,

  • Banking factor of the shared memory,


P2012 cluster overview
P2012 Cluster Overview

P2012 Cluster Architecture

  • N x STxP70 Cores

  • 2xN-banked Shared Data Memory

  • N-to-2M Logarithmic interconnect (memory)

  • Peripheral Logarithmic interconnect

  • Runtime accelerator (HWS)

  • Timers

  • Cluster interfaces (I/O)

Multi-core

Sub-system

(ENCore <N>)

Global Interconnect Interface


P2012 cluster overview1
P2012 Cluster Overview

P2012 Cluster Architecture

  • 1 STxP70-based Cluster processor

  • 16KB P$ & TCDM

  • CC peripheral (boot, …)

  • Clock, variability, power controller (CVP)

  • Cluster Controller Interconnect

ClusterController

(CC)

Multi-core

Sub-system

(ENCore <N>)

Global Interconnect Interface


P2012 cluster overview2
P2012 Cluster Overview

P2012 Cluster Architecture

ClusterController

(CC)

Multi-core

Sub-system

(ENCore <N>)

Global Interconnect Interface

  • Provides controllability and observability to the application developer

  • Breakpoint propagation inside the cluster and across the fabric

Debug and Test Unit (DTU)


P2012 cluster overview3
P2012 Cluster Overview

P2012 Cluster Architecture

Custom HW Processing Elements

  • P x HW Processing Elements

  • Stream Flow Local Interconnect (LIC)

  • HWPE to/from LIC interfaces (HWPE_WPR)

  • CC to/from LIC interface (SIF).

Steaming Interface (SIF)

ClusterController

(CC)

Multi-core

Sub-system

(ENCore <N>)

Global Interconnect Interface

Debug and Test Unit (DTU)


P2012 cluster overview con d
P2012 Cluster Overview (Con’d)

P2012 Cluster Architecture

HWPE

#1

HWPE

#2

HWPE

#P

……………………

  • 32-bit RISC processor

  • 16 KB P$, No local data memory

  • 600 MHz in 32 nm

  • Variable length ISA

  • Up to two instructions executed per cycle

  • Configurable core

  • Extendible through its ISA

  • Complete software development tools chain

HWPE_WPR & SIF

HWPE_WPR & SIF

HWPE_WPR & SIF

Local Interconnect (Stream flow)

Steaming Interface (SIF)

Global Interconnect Interface

CC Interconnect, CCI

Timers

Shared Tightly Coupled Data

Memory (TCDM)

DMA

Channel

#0

32-KB TCDM

TCDM

HWS

STxP70

Cluster

Processor

(CP)

STxP70

Cluster

Processor

+ FPx (CP)

Memory bank #31

Memory bank #32

Memory bank #1

Memory bank #2

Memory bank #3

Memory bank #4

Memory bank #2xN-1

Memory bank #2xN

………

ENCore<N>-CC interface

DMA

Channel

#1

16KB-P$

16KB-P$

CC-Peripherals

Peripheral Logarithmic interconnect

EXT2MEM

Logarithmic interconnect (TCDM)

EXT2PER

STxP70#2

STxP70#N

STxP70

+ FPx#1

STxP70

+FPx#2

STxP70

+FPx#16

STxP70#1

…………….

ENC2EXT

CVP-CC

16KB-P$

16KB-P$

16KB-P$

16KB-P$

16KB-P$

16KB-P$

Debug and Test Unit (DTU)


P2012 cluster overview con d1
P2012 Cluster Overview (Con’d)

P2012 Cluster Architecture

HWPE

#1

HWPE

#2

HWPE

#P

……………………

HWPE_WPR & SIF

HWPE_WPR & SIF

HWPE_WPR & SIF

Local Interconnect (Stream flow)

Steaming Interface (SIF)

Global Interconnect Interface

CC Interconnect, CCI

Timers

Shared Tightly Coupled Data

Memory (TCDM)

DMA

Channel

#0

TCDM

32-KB TCDM

HWS

STxP70

Cluster

Processor

(CP)

STxP70

Cluster

Processor

+ FPx (CP)

Memory bank #31

Memory bank #32

Memory bank #1

Memory bank #2

Memory bank #3

Memory bank #4

Memory bank #2xN-1

Memory bank #2xN

………

ENCore<N>-CC interface

DMA

Channel

#1

16KB-P$

16KB-P$

CC-Peripherals

Peripheral Logarithmic interconnect

EXT2MEM

Logarithmic interconnect (TCDM)

EXT2PER

STxP70

+FPx#2

STxP70#1

STxP70

+ FPx#1

STxP70

+FPx#16

STxP70#N

STxP70#2

…………….

  • Parametric multi-core crossbar with a logarithmic structure

  • Reduced arbitration complexity

  • round robin arbitration scheme

  • Up to N memory accesses per cycle

  • Test-and-Set support

ENC2EXT

CVP-CC

16KB-P$

16KB-P$

16KB-P$

16KB-P$

16KB-P$

16KB-P$

Debug and Test Unit (DTU)


P2012 cluster overview con d2
P2012 Cluster Overview (Con’d)

P2012 Cluster Architecture

HWPE

#1

HWPE

#2

HWPE

#P

……………………

HWPE_WPR & SIF

HWPE_WPR & SIF

HWPE_WPR & SIF

Local Interconnect (Stream flow)

Steaming Interface (SIF)

Global Interconnect Interface

CC Interconnect, CCI

Timers

Shared Tightly Coupled Data

Memory (TCDM)

DMA

Channel

#0

32-KB TCDM

TCDM

HWS

STxP70

Cluster

Processor

(CP)

STxP70

Cluster

Processor

+ FPx (CP)

Memory bank #31

Memory bank #32

Memory bank #1

Memory bank #2

Memory bank #3

Memory bank #4

Memory bank #2xN-1

Memory bank #2xN

………

ENCore<N>-CC interface

DMA

Channel

#1

16KB-P$

16KB-P$

CC-Peripherals

Peripheral Logarithmic interconnect

EXT2MEM

  • Supports 1D & 2D transfers

  • Up to 3.2GB/s peak per DMA

  • Support up to 16 outstanding transactions

  • Support of Out of Order (OoO)

Logarithmic interconnect (TCDM)

EXT2PER

STxP70#2

STxP70#1

STxP70

+FPx#2

STxP70

+FPx#16

STxP70#N

STxP70

+ FPx#1

…………….

ENC2EXT

CVP-CC

16KB-P$

16KB-P$

16KB-P$

16KB-P$

16KB-P$

16KB-P$

Debug and Test Unit (DTU)


P2012 cluster overview con d3
P2012 Cluster Overview (Con’d)

P2012 Cluster Architecture

HWPE

#1

HWPE

#2

HWPE

#P

……………………

HWPE_WPR & SIF

HWPE_WPR & SIF

HWPE_WPR & SIF

Local Interconnect (Stream flow)

Steaming Interface (SIF)

Global Interconnect Interface

CC Interconnect, CCI

Timers

Shared Tightly Coupled Data

Memory (TCDM)

DMA

Channel

#0

  • Ultrafast frequency adaptation (power control)

  • Continuous critical path monitoring (dynamic bin sampling)

  • Continuous thermal sensing (temperature control)

32-KB TCDM

TCDM

HWS

STxP70

Cluster

Processor

+ FPx (CP)

STxP70

Cluster

Processor

(CP)

Memory bank #31

Memory bank #32

Memory bank #1

Memory bank #2

Memory bank #3

Memory bank #4

Memory bank #2xN-1

Memory bank #2xN

………

ENCore<N>-CC interface

DMA

Channel

#1

16KB-P$

16KB-P$

CC-Peripherals

Peripheral Logarithmic interconnect

EXT2MEM

Logarithmic interconnect (TCDM)

EXT2PER

STxP70

+FPx#16

STxP70#N

STxP70

+ FPx#1

STxP70#2

STxP70

+FPx#2

STxP70#1

…………….

ENC2EXT

CVP-CC

16KB-P$

16KB-P$

16KB-P$

16KB-P$

16KB-P$

16KB-P$

Debug and Test Unit (DTU)


P2012 cluster overview con d4
P2012 Cluster Overview (Con’d)

P2012 Cluster Architecture

HWPE

#1

HWPE

#2

HWPE

#P

……………………

HWPE_WPR & SIF

HWPE_WPR & SIF

HWPE_WPR & SIF

Local Interconnect (Stream flow)

  • Highly flexible and configurable interconnect,

  • Asynchronous implementation

  • Low-area or high-performance targets,

  • Natural GALS enabler

  • high robustness to variations

Steaming Interface (SIF)

Global Interconnect Interface

CC Interconnect, CCI

Timers

Shared Tightly Coupled Data

Memory (TCDM)

DMA

Channel

#0

32-KB TCDM

TCDM

HWS

STxP70

Cluster

Processor

+ FPx (CP)

STxP70

Cluster

Processor

(CP)

Memory bank #31

Memory bank #32

Memory bank #1

Memory bank #2

Memory bank #3

Memory bank #4

Memory bank #2xN-1

Memory bank #2xN

………

ENCore<N>-CC interface

DMA

Channel

#1

16KB-P$

16KB-P$

CC-Peripherals

Peripheral Logarithmic interconnect

EXT2MEM

Logarithmic interconnect (TCDM)

EXT2PER

STxP70

+FPx#16

STxP70#N

STxP70

+ FPx#1

STxP70#2

STxP70

+FPx#2

STxP70#1

…………….

ENC2EXT

CVP-CC

16KB-P$

16KB-P$

16KB-P$

16KB-P$

16KB-P$

16KB-P$

Debug and Test Unit (DTU)


P2012 cluster overview con d5
P2012 Cluster Overview (Con’d)

P2012 Cluster Architecture

HWPE

#1

HWPE

#2

HWPE

#P

……………………

HWPE_WPR & SIF

HWPE_WPR & SIF

HWPE_WPR & SIF

Local Interconnect (Stream flow)

Steaming Interface (SIF)

Global Interconnect Interface

CC Interconnect, CCI

Timers

Shared Tightly Coupled Data

Memory (TCDM)

DMA

Channel

#0

32-KB TCDM

TCDM

HWS

STxP70

Cluster

Processor

+ FPx (CP)

STxP70

Cluster

Processor

(CP)

Memory bank #31

Memory bank #32

Memory bank #1

Memory bank #2

Memory bank #3

Memory bank #4

Memory bank #2xN-1

Memory bank #2xN

………

ENCore<N>-CC interface

DMA

Channel

#1

16KB-P$

16KB-P$

CC-Peripherals

Peripheral Logarithmic interconnect

EXT2MEM

Logarithmic interconnect (TCDM)

EXT2PER

STxP70#2

STxP70#N

STxP70#1

STxP70

+ FPx#1

STxP70

+FPx#2

STxP70

+FPx#16

…………….

ENC2EXT

CVP-CC

16KB-P$

16KB-P$

16KB-P$

16KB-P$

16KB-P$

16KB-P$

Debug and Test Unit (DTU)


Ld st and dma memory transfers
LD/ST and DMA memory transfers

P2012

Cluster

P2012

Cluster

P2012

Cluster

P2012

Cluster

P2012

Cluster

P2012

Cluster

P2012

Cluster

P2012

Cluster

P2012

Cluster

L2-MEM

System

Bridge

System

Bridge

Fabric

Controller

Intra-Cluster:

  • LD/ST (UMA)

  • DMA: From/to TCDM to/from HWPE

    Inter-Cluster:

  • LD/ST (NUMA)

  • DMA: L1-to/from-L1

    Cluster to/from L2-Mem:

  • LD/ST (NUMA)

  • DMA: L1 to/from L2

    Cluster to/from L3-Mem (though the system bridge):

  • LD/ST (NUMA)

  • DMA: L1 to/from L3


P2012 as gp accelerator
P2012 as GP Accelerator

P2012 Fabric

ARM Host

Cluster 0

Cluster 2

Cluster 1

Cluster 3

L1 TCDM

L1 TCDM

L1 TCDM

L1 TCDM

FC

L2

L3 (DRAM)


Summary
Summary

P2012 Cluster includes up to 16 + 1 x STxP70 cores, delivering up to 30.6 GOPS and 20.4 GFLOPpeak.

~7 GB/s DMA transfers

Symmetric multi-processing in a UMA fashion within a Cluster; shared data memory in a NUMA fashion between Clusters.

Fast multi processor synchronization thanks to HW support

Seamless combination of non-programmable (HWPEs) and programmable (PEs) processing elements


Mobile soc in 2012
Mobile SOC in 2012…

Lots of (cool) memory, butweneed more

NVIDIA Tegra II SoC (2011)

  • Features

    • TSMC 40nm (LP/G)

    • Dual core A9 – 1-1.2GHz (G)

    • GPU, etc. - 330-400MHz (LP)

    • GEForce ULV (8 shaders)

    • 2 separate Vddrails

    • 1MB L2$

    • 32b LPDDR2 (600MHz DR)

A few (2, 4, 8) High-power processors (ARM): weneed to handlepowerpeaks

EfficientacceleratorFabrics with many (10s) PEs: weneed to improveefficiency


ad