slide1
Download
Skip this Video
Download Presentation
Morphable Multithreaded Memory Tiles (M 3 T)

Loading in 2 Seconds...

play fullscreen
1 / 1

Morphable Multithreaded Memory Tiles (M 3 T) - PowerPoint PPT Presentation


  • 82 Views
  • Uploaded on

M3T. CPU. CPU. Cache. Cache. Barrier. 6%. M. P. M. P. Memory. Barrier. . M. P. M. P. M. P. M. . M. P. M. Safe. Safe. Performance. CMP. TaskScalar. SMT. Superscalar. Parallelism. SpecInt. SpecFP. Scientific. Overhead. Speculative. Speculative. K. K. K. K.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Morphable Multithreaded Memory Tiles (M 3 T)' - umed


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

M3T

CPU

CPU

Cache

Cache

Barrier

6%

M

P

M

P

Memory

Barrier

M

P

M

P

M

P

M

M

P

M

Safe

Safe

Performance

CMP

TaskScalar

SMT

Superscalar

Parallelism

SpecInt

SpecFP

Scientific

Overhead

Speculative

Speculative

K

K

K

K

K

K

Morphable Multithreaded Memory Tiles (M3T)

Josep Torrellas (University of Illinois at Urbana-Champaign) Ben Abbott (Southwest Research Institute)

Ted Bapty (Vanderbilt University) Bob Bassett, David Ngo (BAE SYSTEMS) Hubertus Franke, Jose Moreira(IBM Research)

Architecture

Compiler Support

Novel compiler algorithms to build tasks

M3T Architecture

Novel Inter-Task Optimizations

Task vectorization

Task fusion

Task fission

Task partitioning

Task motion

Task telescoping

Task elimination

Front End

Task T1

Task T2

Task T3

64 Processor cores

0.11um

309.3mm2

ASIC: 800 MHz

Thermal Power: 66W

Multichip support

Morph into MIMD, VLIW, TaskScalar and Stream

High Level Transformations

X

X+1

X+2

Task Selection

Task T1,3

Inter-Task Optimizations

Intra-Task Optimizations

X, X+2

Code Generation

Software Productivity

TaskScalar Morph

Unit of execution: a task

A task can be committed or squashed in one shot

Debugging Data Races [ISCA03]

Reducing Parallel Programming Effort [ASPLOS02]

TST: Task State Table

PTW: Pending Task Window

Sync Bus

Task X

Task Y

  • Do not fine tune synchronization
    • Code coarse critical section
    • Insert perhaps unnecessary barriers
  • Speculative synchronization: speculate past active barriers, locks, flags
    • Detect conflicts, roll back offending threads
    • Use caches to store speculative state
  • Maintain 1 or more safe tasks  forward progress
    • Lock: owner
    • Flag: producer
    • Barrier: lagging tasks

lock(L)

LD A

INC

ST A

unlock(L)

CPU+L1

CPU+L1

CPU+L1

TST

TST

TST

LD A

INC

ST A

A

A

Banked L2

Banked L2

Banked L2

task

?

PTW

PTW

PTW

No explicit orderbetween  and 

Sync Time Reduction

On-Chip Network

Speculative Barrier

Detect unsynchronized communication

Incremental undo and re-execution

Re-execution is deterministic

Off-Chip Memory

17.7%

A

B

BARRIER

C

TaskScalar Morph Evaluation

Applications:Matrix: 20x20 element dense matrix multiply

Bubble: Application with bursty task spawns that creates bubbles of activity

Pathological: like Matrix but all activity concentrated on one cluster

Task Ordering

Speculative Lock

D

Lock L

ACQUIRE

Effect of Task Size

Effect of Number of Processors

Effect of Network Latency

Lock L

Wait F

A

B

C

Unlock L

Set F

Unlock L

RELEASE

E

  • Large reduction: 40%

Code

Speedup

TaskScalar

Hardware

Speedups are high even for small-sized tasks

Scalability of the speedups significantly depends on the application

Speedups are very tolerant to network latency

Effectiveness

No TaskScalar

Hardware

Timeline of Tasks (Bubble)

Timeline of Tasks (Matrix)

Timeline of Tasks (Pathological)

Man-Hours Invested Programming

  • TaskScalar attempts to
    • run section in parallel
    • speculate past synchronization
  • Result: appear as if we had invested more man-hours

Bursty spawn and execution of tasks

High load imbalance; waste of resources

Smooth execution of tasks

ad