Vector class on limited local memory llm multi core processors
This presentation is the property of its rightful owner.
Sponsored Links
1 / 24

Vector Class on Limited Local Memory (LLM) Multi-core Processors PowerPoint PPT Presentation


  • 84 Views
  • Uploaded on
  • Presentation posted in: General

Vector Class on Limited Local Memory (LLM) Multi-core Processors. Ke Bai Di Lu and Aviral Shrivastava. Compiler Microarchitecture Lab Arizona State University, USA. Summary. Cannot improve performance without improving power-efficiency Cores are becoming simpler in multicore architectures

Download Presentation

Vector Class on Limited Local Memory (LLM) Multi-core Processors

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Vector class on limited local memory llm multi core processors

Vector Class on Limited Local Memory (LLM) Multi-core Processors

KeBai

Di Lu and AviralShrivastava

Compiler Microarchitecture Lab

Arizona State University, USA


Summary

Summary

  • Cannot improve performance without improving power-efficiency

    • Cores are becoming simpler in multicore architectures

  • Caches not scalable (both power and performance)

    • Limited Local Memory multicore architectures

      • Each core has a scratch pad (e.g., Cell processor)

      • Need explicit DMAsto communicate with global memory

  • Objective:

    • How to enable vector data structure (dynamic arrays) on the LLMcores?

  • Challenges:

    • 1. Use local store as temporary buffer (e.g., software cache) for vector data

    • 2. Dynamic global memory management, and core request arbitration

    • 3. How to use pointers when the data pointed to may have moved ?

  • Experiments

    • Any size vector is supported

    • All SPUs may use vector library simultaneously – and is scalable


From multi to many core processors

From multi- to many-core processors

  • Simpler design and verification

    • Reuse the cores

  • Can improve performance without

    much increase in power

    • Each core can run at a lower frequency

  • Tackle thermal and reliability problems at core granularity

IBM XCell 8i

Tilera TILE64

GeForce 9800 GT


Memory scaling challenge

Memory Scaling Challenge

  • In Chip Multi Processors (CMPs) , caches guarantee data coherency

    • Bring required data from wherever into the cache

    • Make sure that the application gets the latest copy of the data

  • Caches consume too much power

    • 44% power, and greater than 34% area

  • Cache coherency protocols do not scale well

    • Intel 48-core Single Cloud-on-a-Chip has non-coherent caches

Strong ARM 1100

Intel 48 core chip


Vector class on limited local memory llm multi core processors

Limited Local Memory Architecture

SPU

PPE

LS

SPE 7

SPE 1

SPE 3

SPE 5

Element Interconnect Bus (EIB)

SPE 6

Off-chip Global Memory

SPE 0

SPE 2

SPE 4

PPE: Power Processor Element

SPE: Synergistic Processor Element

LS: Local Store

  • Cores have small local memories (scratch pad)

    • Core can only access local memory

    • Accesses to global memory through explicit DMAs in the program

  • e.g. IBM Cell architecture, which is in Sony PS3.


Llm programming

LLM Programming

<spu_mfcio.h>

int main(speid, argp)

{

printf("Hello world!\n");

}

<spu_mfcio.h>

int main(speid, argp)

{

printf("Hello world!\n");

}

<spu_mfcio.h>

int main(speid, argp)

{

printf("Hello world!\n");

}

<spu_mfcio.h>

int main(speid, argp)

{

printf("Hello world!\n");

}

<spu_mfcio.h>

int main(speid, argp)

{

printf("Hello world!\n");

}

<spu_mfcio.h>

int main(speid, argp)

{

printf("Hello world!\n");

}

#include<libspe2.h>

extern spe_program_handle_t hello_spu;

int main(void)

{

int speid, status;

speid

(&hello_spu);

}

Local Core

Local Core

Local Core

Local Core

Local Core

Local Core

= spe_create_thread

Main Core

  • Extremely power-efficient computation

    • If all code and data fit into the local memory of the cores

Otherwise, efficient data management is required!

Task based programming, MPI like communication


Managing data

Managing data

int global;

f1(){

int a,b;

global = a + b;

f2();

}

int global;

f1(){

int a,b;

DMA.fetch(global)

global = a + b;

DMA.writeback(global)

DMA.fetch(f2)

f2();

}

Original Code

Local Memory Aware Code


Vector class introduction

Vector Class Introduction

Vector Class is widely used library for programming!

  • One of classes in Standard Template Library(STL) for C++

  • Implemented as dynamic arrays, sequential container

  • Elements stored in contiguous storage locations

    • Can be accessed by using iterators or offsets on regular pointers to elements

  • Compared to arrays:

    • Vector have the ability to be easily resized

    • Capacity increase and decrease is handled automatically

    • They usually consume more memory than arrays when their capacity is handled automatically

      • This is in order to accommodate extra storage space for future grownth


Vector class management

Vector Class Management

  • All code and data need to be managed

  • This paper focuses on vector data management

    • Vector management is difficult

      • Vector size is dynamic and can be unbounded

    • Cell programming manual suggests “Use dynamic data at your own risk”.

      • Restricting the usage of dynamic data is restrictive for programmers.

main() {

vector<int> vec;

for(int i = 0; i < N; i++)

vec.pushback(i);

}

N0

SPE code

Max N is 8192

8192 INTs is only 32KB, far less than 256KB of local memory. Why it crashes so early?


Outline of the talk

Outline of the Talk

  • Motivation

  • Related Works on Vector Data Management

  • Our Approach of Vector Data Management

  • Experiments


Related works

RelatedWorks

  • Different threads can access vector concurrently, no matter it is in one address space or different spaces.

  • They provide efficient parallel implementations, abstract platform details, provide an interface to programmers to express the parallelism of the problems, automatically translate from one space to another

    • Shared memory: MPTL[Baertschiger2006], MCSTL[Singler2007] and Intel TBB[Intel2006]

    • Distributed memory: POOMA[Reynders1996], AVTL[Sheffler1995], STAPL[Buss2010] and PSTL[Johnson1998]

SPE

SPE

Local Memory

Local Memory

……

They ensure data coherency across different spaces. What about size of local memory is small?

DMA

GlobalMemory

LLM Architecture


Space allocation and reallocation

Space Allocation and Reallocation

  • push_back & insert

    • Adds elements

    • Needs to be re-allocated for a larger space when there is no unused space

0x010100

allocated space

Vector Data

0x010200

0x010500

New allocated space

VectorData

0x010600

0x010700

(a) When the vector use up the allocated space

(b) We allocate a large space and move all data

Unlimited vector requires evicting older vector data to global memory and reallocating more global memory!


Space allocation and reallocation1

Space Allocation and Reallocation

(1) transfer parameters by DMA

structmsgStruct {

intvector_id;

intrequest_size;

intdata_size;

intnew_gAddr;

};

SPE thread

(5) get new vector address by DMA

SPE

(2)operation

type

(4) restart

signal

mailbox based

vector data

Global Memory

(3) operate on vector,

update new_gAddr

in the data structure

PPE thread

PPE

  • Static buffer?

    • Small vector -> low utilization; large vector -> overflow

  • SPU thread can’t use malloc() and free() on global memory

  • Hybrid: DMA + mailbox


Element retrieving

Element Retrieving

  • Block index: index of 1st element in the block

  • Each block contains a block index, besides the data;

    blocks are in linked list.

Block Size is 16

……

……

……

……

15th element

128th element

0th element

1st element

143th element

Block 7

Block 0

133th element: block index = 128 = 133 / 16 * 16

  • Global address:

Based on the global address, we can know whether this block is in the local memory or not. If not, fetch it.


Vector function implementation

Vector Function Implementation

  • In order to keep semantics, we implemented all functions. But only insert function is shown here.

    • Original insertion can take advantage of pointers.

for (……)

(*b++) = (*a++);

New Element

Global Memory

New Element

Local Memory

Global Memory

  • But elements shifting now is a challenging task under LLM architecture

    • Because we cannot use pointers in the local memory to access global memory & DMA requires alignment


Pointer problem

Pointer Problem

Global Memory

Global Memory

struct* S {

……

int* ptr;

}

struct* S {

……

int* ptr;

}

vec

vec

?

Local Memory

Local Memory

(b) The vector element is

moved to global memory

Pointer points to a

vector element

Pointer problem needs to be solved!

In order to support limitless vector data, global memory must be leveraged.

Two address spaces co-exist, no matter what scheme is implemented, pointer issue exist.


Pointer resolution

Pointer Resolution

  • Local address should not be used to identify the data.

main()

{

vector<int> vec;

int* a = vec.at(index);

intsum = 1 + *a;

int* b = a;

}

main()

{

vector<int> vec;

int* a = ppu_addr(vec,index);

a = ptrChecker(a);

intsum = 1 + *a;

a = s2p(a);

int* b = a;

}

(b) Transformed Program

(a) Original Program

  • ppu_addr: returns global address ptr pointing to the vector element.

  • ptrChecker:

    • checks whether ptr is pointing to avector data;

    • guarantees the data pointed is in the local memory;

    • returns the local address.

  • s2p: transforms local address back to global address


Experimental setup

Experimental Setup

  • Hardware

    • PlayStation 3 with IBM Cell BE

  • Software

    • Operating System: Linux Fedora 9 and IBM SDK 3.1

    • Benchmarks: some possible applications using vector data.


Unlimited vector data

Unlimited Vector Data

2n+2 B

16 B

4 B

8 B

Why?

reallocation

reallocation

reallocation

B: Bytes

……

……

……

12


Impact of block size

Impact of Block Size


Impact of buffer space

Impact of buffer Space

buffer_size= number_of_block × block_size.


Impact of associativity

Impact of Associativity

Higher associativity-> high computation spent on looking up data structure & low miss ratio


Scalability

Scalability


Summary1

Summary

  • Cannot improve performance without improving power-efficiency

    • Cores are becoming simpler in multicore architectures

  • Caches not scalable (both power and performance)

    • Limited Local Memory multicore architectures

      • Each core has a scratch pad (e.g., Cell processor)

      • Need explicit DMAsto communicate with global memory

  • Objective:

    • How to enable vector data structure (dynamic arrays) on the LLMcores?

  • Challenges:

    • 1. Use local store as temporary buffer (e.g., software cache) for vector data

    • 2. Dynamic global memory management, and core request arbitration

    • 3. How to use pointers when the data pointed to may have moved ?

  • Experiments

    • Any size vector is supported

    • All SPUs may use vector library simultaneously – and is scalable


  • Login