html5-img
1 / 69

CBE Tutorial The Toy Version

CBE Tutorial The Toy Version. The Basics. What is covered. Basic SPE thread creation Aligned Data transfers To and from the SPE Unaligned Data transfers From Main memory to SPE Mailbox signals and uses Communication using the mailbox channels. Outline. Introduction to CBE Running Code

lakia
Download Presentation

CBE Tutorial The Toy Version

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CBE TutorialThe Toy Version The Basics

  2. What is covered • Basic SPE thread creation • Aligned Data transfers • To and from the SPE • Unaligned Data transfers • From Main memory to SPE • Mailbox signals and uses • Communication using the mailbox channels

  3. Outline • Introduction to CBE • Running Code • Example: Hello, World • Example: Aligned Data • Example: Unaligned data • Example: Mailbox communications • Appendix A: API Parts

  4. Intro to CBE Composed of nine computing elements Created by STI The Cell Broadband Engine • A modified Vector Arch • Limited memory: 256 KiB • All accesses are to and from this local memory • Main Memory Accesses  DMA transfers • The brain of the system • Organizer • Runs Linux • PowerPC dual issue arch • Each SPE has a MFC unit • Issue and receive DMA to and from main memory • Gate Keeper of the bus SPE SPE PPE MFC MFC Flex IO • Four rings • Has QoS in a limited fashion (RAM) BEI PPSS Memory Interface Maintain coherency and consistency between all memory units (the MFC, main memory and PPE caches, but not across the local memory of SPEs)

  5. Outline • Introduction to CBE • Running Code • Example: Hello, World • Example: Aligned Data • Example: Unaligned data • Example: Mailbox communications • Appendix A: API Parts

  6. Running Code The Hardware Way SPU C code: test.c PPU C code: driver.c spu-gcc -o test test.c gcc –c driver.c SPU Executable: test PPU Object File: driver.o ppu32-embedspu -m32 test testtest-embed.o gccdriver.ospu-lib.a -lspe -o driver Embedded Object: test-embeded.o PPU Executable: driver ppu-ar -qcsspu-lib.a test-embed.o Embedded library: spu-lib.a ./driver Command line

  7. Outline • Introduction to CBE • Running Code • Example: Hello, World • Example: Aligned Data • Example: Unaligned data • Example: Mailbox communications • Bibliography

  8. Example 1Hello, World • To learn about PPU thread management function and thread structures spe_wait spe_thread_create SPE SPE SPE SPE exec PPE

  9. Functions • spe_program_handle_t • Type: Structure • Location: libspe.h [PPU side] • Usage: Keeps details about the SPE program. Note that its name should be the same as the one provided in ppu-embedspu command. • speid_t • Type: Type definition • Location: libspe.h [PPU side] • Usage: Keeps some thread specific information.

  10. Functions API Parts • speid_tspe_create_thread (spe_gid_tgid, spe_program_handle_t *handle, void *argp, void *envp, unsigned long mask, intflags) • Type: Function – Thread management • Location:libspe.h [PPU Side] • Input: • gid Group id for this thread (usually zero) • handle  A pointer to the SPU program information • argp  A pointer to the arguments for the SPU thread • envp  A pointer to an environment (global variables, shell setting, etc). Usually zero • mask  Affinity of the processor that should run the thread. Usually -1 (all SPUs equally probable) • flags  Bit wise flags for thread specific properties. Usually zero • Output: • speid_t  Returns an unique identifier for the thread (plus some extra information) • Usage: Create a thread that will run on a SPU unit. It will load the SPU image into the SPU and begin execution on the element and return to the PPU • POSIX Equivalent:pthread_create

  11. Functions API Parts • intspe_wait(speid_tspeid, int *status, intoptions) • Type: Function - Thread management • Location:libspe.h [PPU Side] • Input: • speid The identifier for the thread • status  The return status of the thread • options Options for the waiting behavior. Usually zero • Output: • int Zero represents a correct return value. Any other value represents an error. • Usage: Make a thread wait for all other threads that has been registered with previous calls to spe_wait • POSIX Equivalent: pthread_join

  12. Example 1Hello, World #include <stdlib.h> #include <stdio.h> #include <libspe.h> extern spe_program_handle_t test; #define THREADS 6 int main(){ speid_tspeid[THREADS]; int status, i; for(i = 0; i < THREADS; ++i){ speid[i] = spe_create_thread(0, &test, 0, NULL, -1, 0); if(speid[i] == NULL) return 1; } for(i = 0; i < THREADS; ++i) spe_wait(speid[i], &status, 0); return 0; } #include <stdio.h> int main (){ printf("Hello, World \n"); } SPU binary handle SPU Identifiers Hello, World Hello, World Hello, World Hello, World Hello, World Hello, World PPU C Source Code Create the SPU threads with the binary SPU C Source Code Output window Wait for the threads to finish

  13. Outline • Introduction to CBE • Running Code • Example: Hello, World • Example: Aligned Data • Example: Unaligned data • Example: Mailbox communications • Appendix A: API Parts

  14. Example 2Aligned Data • To learn about DMA transfers with aligned data mfc_get mfc_put Main Memory Main Memory Local Storage Local Storage

  15. Example 2Aligned Data • Explicit Rule 1: PPU and SPU address (from receiving and sending location) MUST be aligned to 16 or be naturally aligned. 0x10BF3 0x10BF0 0x10BF3 Sender Sender Sender Receiver Receiver Receiver 0x22F3 0x2230 0x2234 Both Address are divisible by 16 or the last hexadecimal digit of the address is zero Both Address last hexadecimal digits are the same Not aligned

  16. Example 2Aligned Data • Explicit Rule 2: Any address on main memory is assumed to be a 64-bit address (!). Thus main memory pointers are of the unsigned long longtype • Explicit Rule 3: DMA transfers should be less than 16 KiB • Implicit Rule 1: If array of structures are used, the structures themselves should be of a size that is a multiple of 16

  17. The Code • Uses a simple Dot Product mfc_put mfc_get Local Storage Main Memory

  18. Functions API Parts • mfc_get(ls,ea,size,tag,tid,rid) • Type: Macro – DMA communication • Location: spu_mfcio.h [SPE Side] • Input: • ls  The Starting address in the SPE LS memory • ea  The starting address in the main memory (64-bit) • size  The number of bytes to transfer • tag  DMA tag • tid Transfer class id (usually zero) • rid  Replacement class id (usually zero) • Usage: Initialize a DMA transfer for size bytes from Main memory to this SPE local storage. • POSIX Equivalent: None

  19. Functions API Parts • mfc_put(ls,ea,size,tag,tid,rid) • Type: Macro – DMA communication • Location: spu_mfcio.h [SPE Side] • Input: • ls The Starting address in the SPE LS memory • ea  The starting address in the main memory (64-bit) • size  The number of bytes to transfer • tag  DMA tag • tid Transfer class id (usually zero) • rid  Replacement class id (usually zero) • Usage: Initialize a DMA transfer for size bytes from this SPE local storage to main memory. • POSIX Equivalent: None

  20. Functions API Parts • mfc_write_tag_mask(mask) • Type: Macro – DMA communication • Location:spu_mfcio.h [SPE Side] • Input: • mask  Write the mask contained the group tags for DMA transfers • Usage: Write a mask with the tags for current DMA transfers. Used in conjunction with the mfc_read_tag_status_all to make a DMA (or a group of them) wait for termination. • POSIX Equivalent: None

  21. Functions API Parts • mfc_read_tag_status_all() • Type: Macro – DMA communication • Location:spu_mfcio.h [SPE Side] • Usage: Read the tag status of all DMA transfer that were provided in the mfc_write_tag_mask call. It is a blocking operation and it will only complete when all registered DMAs are completed. • POSIX Equivalent: None

  22. PPE Code Include files and number of threads … #define CMPTS 16384 typedef unsigned long longea_t; typedefstruct { ea_t incomming_array1[4]; ea_t incomming_array2[4]; ea_toutgoing_array; int size; int id; } params; int vec1[CMPTS] __attribute__((aligned(16))); int vec2[CMPTS] __attribute__((aligned(16))); intvect[THREADS * 4] __attribute__((aligned(16))); params p[THREADS] __attribute__((aligned(16))); Number of elements • Address of both incoming arrays • Address of the result array • Size of the DMA chunk • Thread id 64-bit data type Structure for parameter passing Buffers and variables for DMA comm Attribute that makes the variables aligned

  23. PPE Code … intdotp(int vec1[], int vec2[], int size){ inti, res = 0; for(i = 0; i < size; ++i) res += vec1[i] * vec2[i]; return res; } … int main(){ … size = CMPTS / THREADS; … for(i = 0; i < THREADS; ++i){ for(j = 0; j < 4; ++j){ p[i].incomming_array1[j] = (ea_t)(&vec1[i*size + j*(size/4)]); p[i].incomming_array2[j] = (ea_t)(&vec2[i*size + j*(size/4)]); } p[i].outgoing_array = (ea_t)(&vect[i*4 + 0]); p[i].size = size / 4; p[i].id = i; } The Dot product function Utility Functions Variable declarations Size of a thread chunk Filling the vectors Filling the parameter structure: Divide the vector arrays (incoming and outgoing) , decide the size and assign its thread id

  24. PPE Code … speid[status] = spe_create_thread (0, &test, (ea_t)(&p[status]), NULL, -1, 0); … size = 0; for(i = 0; i < THREADS; ++i){ for(j = 0; j < 4; ++j) size += vect[i*4 + j]; } rem = dotp(vec1, vec2, CMPTS); if(rem == size) printf("Sucess :) with %d \n", rem, tm); else printf("Failure :P (%d != %d) \n", rem, size); return 0; } Thread creation and waiting Reduce all the threads results to a single value Calculate the correct result in the PPE and compare with the one obtained from the threads

  25. SPE Code #include <spu_mfcio.h> typedefstruct { ea_t incomming_array1[4]; ea_t incomming_array2[4]; ea_toutgoing_array; int size; int id; } params ; params pp __attribute__((aligned(16))); int *arr1 __attribute__((aligned(16))); int *arr2 __attribute__((aligned(16))); intvecr[4] __attribute__((aligned(16))); intdotp(int *vec1, int *vec2, int size){ … } The DMA and communication header file The Mirror image of the PPE parameter structure Buffer and variables involved in the DMA transfers The Mirror image of the PPE dot product function

  26. SPE Code The PPE parameter structure is passed as one of the parameters to the main function int main (ea_tspeid, ea_targp, ea_tenvp){ … mfc_get(&pp, argp, sizeof(params), 31, 0, 0); mfc_write_tag_mask(1 << 31); mfc_read_tag_status_all(); … arr1 = (int *)memalign(16, sizeof(int) *size); arr2 = (int *)memalign(16, sizeof(int) *size); for(i = 0; i < 4; ++i){ mfc_get(arr1, pp.incomming_array1[i], sizeof(int) * size, 31, 0, 0); mfc_write_tag_mask(1 << 31); mfc_read_tag_status_all(); … vecr[i] = dotp(arr1, arr2, size); } mfc_put(vecr, pp.outgoing_array, sizeof(int) * 4, 31, 0, 0); mfc_write_tag_mask(1 << 31); mfc_read_tag_status_all(); … return 0; } The MFC_GET function that will load the PPE parameter structure The memalign is used to create aligned blocks in the heap The MFC_GET function that will load the first vector The MFC_PUT function that will store the results into main memory

  27. Outline • Introduction to CBE • Running Code • Example: Hello, World • Example: Aligned Data • Example: Unaligned data • Example: Mailbox communications • Appendix A: API Parts

  28. Example 3Unaligned Data • To learn about DMA transfers with unaligned data (copying) mfc_get mfc_put Main Memory Main Memory Local Storage Local Storage

  29. PPE Code It is the same as the aligned example … int vec1[CMPTS] __attribute__((aligned(32))); int vec2[CMPTS] __attribute__((aligned(32))); …

  30. SPE Code Make Sure that the buffers are unaligned  … int *arr1 __attribute__((aligned(8))); int *arr2 __attribute__((aligned(8))); … arr1 = (int *)memalign(7, sizeof(int) *size); arr2 = (int *)memalign(7, sizeof(int) *size); … xfer_unaligned_data(arr1, pp.incomming_array1[i], sizeof(int) * size); xfer_unaligned_data(arr2, pp.incomming_array2[i], sizeof(int) * size); Replace the MFC_GET calls with these calls

  31. SPE Code Get the first chunk of memory that is not aligned (by calculating its address and size unsigned char buff[128] __attribute__ ((aligned(16))); void xfer_data_unaligned(void *ls, GA ga, int size){ int left; unsigned int mask = 0x8; GA tga; intrem; tga = ga & ~0xFULL; rem = ga & 0xFULL; global_id = (global_id + 1) & 0x1F; mfc_get(buff, tga, 16, global_id, 0, 0); mfc_write_tag_mask(1 << global_id); mfc_read_tag_status_all(); memcpy(ls, (void *)((unsigned int)(buff)+(unsigned int)(rem)), 16 - rem); ls += 16 - rem; size -= 16 - rem; ga += (GA)(16 - rem); (Over)Load 16 bytes that contains the remainder bytes that are not aligned. Wait for it to end Copy the temporary buffer to its final location (both the buffer and the location should be aligned

  32. SPE Code Load the rest of the data in chunks of 16 or 128 bytes in series while( size >= 16 ){ left = (size >= 128) ? 128 : 16; mfc_get(buff, ga, left, global_id, 0, 0); mfc_write_tag_mask(1 << global_id); mfc_read_tag_status_all(); memcpy(ls, buff, left); ls += left; size -= left; ga += (GA)(left); } if( size > 0 ){ mfc_get(buff, ga, 16, global_id, 0, 0); mfc_write_tag_mask(1 << global_id); mfc_read_tag_status_all(); memcpy(ls, buff, size); } } Copy temporary buffer to final location (over) Load the last 16 bytes which contains the final data

  33. Outline • Introduction to CBE • Running Code • Example: Hello, World • Example: Aligned Data • Example: Unaligned data • Example: Mailbox communications • Appendix A: API Parts

  34. Example 5Mailbox Communications • Learn about mailboxes and their role in SPU communication spu_readch spu_readchcnt spu_writech spe_write_in_mbox spe_stat_in_mbox spe_read_out_mbox spe_stat_out_mbox PPU SPU SPU SPU SPU SPU SPU SPU SPU

  35. The Code Mailbox mfc_get Local Storage Main Memory

  36. The Code PPU SPU Wait for result Compute Send Results Send Signal to continue Send Signal Send Completion Signal Wait for result Send Completion Signal

  37. Functions API Parts • intspe_write_in_mbox (speid_tspeid ,unsigned intdata); • Type: Function – Mailbox communication • Location:libspe.h [PPU Side] • Input: • speid The identifier of the thread in which the mailbox message will be delivered • data  The data that will be passed to the SPE • Output: • int Returns 0 in success and -1 in failure • Usage: Write a 32 bit value into a mailbox belonging to a given SPE thread • POSIX Equivalent:pthread_cond_signal (?)

  38. Functions API Parts • intspe_stat_in_mbox (speid_tspeid); • Type: Function – Mailbox communication • Location:libspe.h [PPU Side] • Input: • speid The identifier of the thread from which the mail box will be checked • Output: • int Returns 0 if the mail box is full or a non negative number that represents the number of entries that are available. • Usage: Check the status of the mailbox for a given thread • POSIX Equivalent: None

  39. Functions API Parts • unsigned intspe_read_out_mbox (speid_tspeid); • Type: Function – Mailbox communication • Location:libspe.h [PPU Side] • Input: • speid The identifier of the thread from which the mail box entry will be obtained. • Output: • int Returns the value form the outbound mailbox of the given SPE or -1 if no data is available. • Usage: Returns the element in the SPE outbound mailbox • POSIX Equivalent: None

  40. Functions API Parts • intspe_stat_out_mbox (speid_tspeid) • Type: Function – Mailbox communication • Location: libspe.h [PPU Side] • Input: • speid The identifier of the thread from which the status will be read • Output: • int Returns 0 if the mailbox is empty or a non negative number that represents the unread messages in the mailbox • Usage: Returns the status of the SPE outbound mailbox • POSIX Equivalent: None

  41. Functions API Parts • spu_readch(imm) • Type: Macro – Mailbox communication • Location:spu_internals.h [SPE Side] • Input: • immA channel identifier • Output: Returns the channel contents • Usage: Returns the channel contents identified by imm. • POSIX Equivalent: None

  42. Functions API Parts • spu_readchcnt(imm) • Type: Macro – Mailbox communication • Location: spu_internals.h [SPE Side] • Input: • imm A channel identifier • Output: Returns the number of entries in this channel • Usage: Returns the number of entries in the channel identified by imm • POSIX Equivalent: None

  43. Functions API Parts • spu_writech (imm, ra) • Type: Macro – Mailbox communication • Location:spu_internals.h [SPE Side] • Input: • imm A channel identifier • ra  The value to be written to the channel • Usage: Write the value ra to the channel identified by imm. • POSIX Equivalent: None

  44. PPE Code void check_for_results(speid_tspe[]) { unsigned char flags[THREADS]; inti, val, total = 0; for(i = 0; i < THREADS; ++i){ flags[i] = 1; } while (total != THREADS){ for(i = 0; i < THREADS; ++i){ if(flags[i] == 0) continue; if(spe_stat_out_mbox(spe[i]) != 0){ val = spe_read_out_mbox(spe[i]); if(val != -1){ vect += val; spe_write_in_mbox(spe[i], -1);} else{ flags[i] = 0; total++; }}}}} Check if a mailbox has been written and then consume the value Write a signal to tell the SPE that it has received its value

  45. PPE Code … for(status = 0; status < THREADS; ++status){ speid[status] = spe_create_thread(0, &test, (ea_t)(&p[status]), NULL, -1, 0); if(speid[status] == NULL) return 1; } check_for_results(speid); for(i = 0; i < THREADS; ++i) spe_wait(speid[i], &status, 0); … Blocking Function that will check for mailbox messages

  46. SPE Code Send the result and wait for acknowledgment intsend_result(intvec) { spu_writech(SPU_WrOutMbox, vec); while(!spu_readchcnt(SPU_RdInMbox)); return spu_readch(SPU_RdInMbox); } void send_end() { intvec = -1; spu_writech(SPU_WrOutMbox, vec); } Send the termination signal and return

  47. SPE Code Send partial result to the PPE for(i = 0; i < 4; ++i){ … vecr = dotp(arr1, arr2, size); send_result(vecr); } … send_end(); return 0; Send Termination signal to PPE and terminate

  48. One More Thing!!!! The output of all programs is: Sucess :) with 6020

  49. Outline • Introduction to CBE • Running Code • Example: Hello, World • Example: Aligned Data • Example: Unaligned data • Example: Double Buffer • Example: Mailbox communications • Appendix A: API Parts

  50. Appendix A • API Parts • Structures • Function to create and manage threads • PPU / SPU DMA Functions • PPU / SPU mailbox functions

More Related