A framework for parallelizing load stores on embedded processors
1 / 22

A Framework for Parallelizing Load/Stores on Embedded Processors - PowerPoint PPT Presentation

  • Uploaded on

A Framework for Parallelizing Load/Stores on Embedded Processors. Xiaotong Zhuang Santosh Pande John S. Greenland Jr. College of Computing, Georgia Tech. Background and Motivation. Speed gap between memory and CPU remains

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' A Framework for Parallelizing Load/Stores on Embedded Processors' - nydia

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
A framework for parallelizing load stores on embedded processors

A Framework for Parallelizing Load/Stores on Embedded Processors

Xiaotong Zhuang

Santosh Pande

John S. Greenland Jr.

College of Computing, Georgia Tech


Background and motivation
Background and Motivation Processors

  • Speed gap between memory and CPU remains

  • Multi-bank memory architecture: Motorola DSP56000 series, NEC 77016, SONY pDSP, Analog Devices ADSP-210x, Starcore SC140 processor core

  • Parallel instructions allow parallel access to memory banks: PLDXY r1, @a, r2, @b, loads @ar1 and @br2 at the same time.

  • Objective:

    • Try to maximally generate parallel Load/Store (such as PLDXY) instructions through compiler optimizations.

    • Controlled code & data segment growth

    • Reasonable speed of compilation


General approaches
General approaches Processors

  • Model as ILP problem--Rainer Leupers, Daniel Kotte, “Variable partitioning for dual memory bank DSPs”, ICASSP, May’01

    • Variables Ni with value 0/1 for each LD/ST instr. to represent its memory bank assignment (X or Y)

    • Variables Eij with value 0/1 to represent whether two instructions can be merged

    • Enforcing other constraints and max the selected edge weight

  • Model as Graph problem--A.Sudarsanam, S.Malik, “Simultaneous Reference Allocation in Code Generation for Dual Data Memory Bank ASIPs”, TODAES, Apr’00

    • Each Load/Store as a node

    • Edge between nodes represents they can be merged

    • Pick maximal number of edges that are disjoint


Major contributions
Major contributions Processors

  • Keep the model simple and easy to be solved mathematically

  • Identify the movable boundary problem, which impedes the problem modeling and simplification

  • Propose Motion Schedule Graph (MSG) and two approaches to solve it heuristically

  • Merge with instruction duplication and variable duplication

  • Cross basic block merges

  • Other improvements like local conflict elimination through rematerialization and some global optimization issues

  • An iterative approach, which systematically grows the code segment and then the data segment minimally.


Basic concepts 1
Basic concepts (1) Processors

  • Post-pass approach: assuming a good register allocator has been used--Appel & George’s register allocation algorithm

  • Alias analysis

    • Memory access instruction dis-ambiguity

    • Most alias can be uniquely determined in our benchmark program

  • Memory access instructions

    • ST[addr],r is the definition of a memory address

    • LD[addr],r is the use of a memory address

    • For base-offset Load/Store instructions, normally for arrays, assume arrays are inseparable and more register conflicts will be considered.

  • DependenciesAlias analysis

    • Address conflicts

    • Register conflicts


Basic concepts 2
Basic concepts (2) Processors

  • Building Webs

    • Webs: maximal union of du-chains. All variable def/use on the web MUST be allocate to the same memory location

    • One variable appears in separate web can be put into different memory locations

    • Achieve value separation

  • Motion range determination

    • Defined as interval between program points where a Load/Store can be legally moved, restrained by dependencies

    • Load/Store instructions with overlapping range MAY be merged

    • Notice for Movable Boundary problem


Movable boundary problem
Movable boundary problem Processors

  • The motion boundary of one Load/Store instruction is also a Load/Store instruction

  • Assuming fixed boundary will cause incorrect merge


Motion schedule graph
Motion schedule graph Processors

  • Pseudo fixed-boundary

    • For Store: move as early as possible assuming other instructions are fixed

    • For Load: move as late as possible assuming other instructions are fixed

  • Motion Schedule Graph

    • Nodes represent individual Load/Store instructions

    • Oval encloses Load/Store on the same web

    • Edges link nodes that have overlapped motion range (with respect to pseudo fixed-boundaries)


Conflict resolution
Conflict resolution Processors


Example Processors


Graph solving
Graph solving Processors

  • The whole problem is provably NP-complete—refer to Appendix A

  • Two separate problems: Bank Assignment and Edge Picking

  • For predetermined bank assignments, the Edge Picking problem can be optimally solved in polynomial time

  • Heuristic algorithms

    • Brutal force searching will take O(|V|32n) time. Doable for small programs

    • SA can approach the optimal solution but will greatly increase the compilation time

    • Use heuristic to solve bank assignment, then get optimal solution for Edge Picking


Post pass phases
Post-pass phases Processors


Cross bb merge instr duplication
Cross BB merge (Instr. duplication) Processors

  • Move to predecessor/successor to create new opportunities

  • To guarantee profitability

    • Move to where the reference is live

    • Move ST on EBB

    • Move LD on reverse EBB

    • Make sure: can be combined if pushed to at least one of the live predecessors/successors


Variable duplication
Variable duplication Processors


Local conflict elimination
Local conflict elimination Processors

  • Motivation

    • Register allocator may assign same register to neighboring ranges, which leads to register conflicts

    • ISA restrictions may need particular registers but not available at the program point

  • Rematerialization to free a register and reconstruct it after the merge to make the register available.


Compilation time
Compilation time Processors


Runtime performance
Runtime performance Processors


Code size comparison
Code size comparison Processors


Conclusion Processors

  • A framework to analyze and merge LD/STs.

  • Our heuristic approach comes close to exhaustive search with less compilation time.

  • Enhancing the range of motion of the instructions by undertaking variable and instruction replications, so the generated code quality is superior to the exhaustive methods previously proposed.