a framework for parallelizing load stores on embedded processors n.
Skip this Video
Download Presentation
A Framework for Parallelizing Load/Stores on Embedded Processors

Loading in 2 Seconds...

play fullscreen
1 / 22

A Framework for Parallelizing Load/Stores on Embedded Processors - PowerPoint PPT Presentation

  • Uploaded on

A Framework for Parallelizing Load/Stores on Embedded Processors. Xiaotong Zhuang Santosh Pande John S. Greenland Jr. College of Computing, Georgia Tech. Background and Motivation. Speed gap between memory and CPU remains

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'A Framework for Parallelizing Load/Stores on Embedded Processors' - nydia

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
a framework for parallelizing load stores on embedded processors

A Framework for Parallelizing Load/Stores on Embedded Processors

Xiaotong Zhuang

Santosh Pande

John S. Greenland Jr.

College of Computing, Georgia Tech


background and motivation
Background and Motivation
  • Speed gap between memory and CPU remains
  • Multi-bank memory architecture: Motorola DSP56000 series, NEC 77016, SONY pDSP, Analog Devices ADSP-210x, Starcore SC140 processor core
  • Parallel instructions allow parallel access to memory banks: PLDXY r1, @a, r2, @b, loads @ar1 and @br2 at the same time.
  • Objective:
    • Try to maximally generate parallel Load/Store (such as PLDXY) instructions through compiler optimizations.
    • Controlled code & data segment growth
    • Reasonable speed of compilation


general approaches
General approaches
  • Model as ILP problem--Rainer Leupers, Daniel Kotte, “Variable partitioning for dual memory bank DSPs”, ICASSP, May’01
    • Variables Ni with value 0/1 for each LD/ST instr. to represent its memory bank assignment (X or Y)
    • Variables Eij with value 0/1 to represent whether two instructions can be merged
    • Enforcing other constraints and max the selected edge weight
  • Model as Graph problem--A.Sudarsanam, S.Malik, “Simultaneous Reference Allocation in Code Generation for Dual Data Memory Bank ASIPs”, TODAES, Apr’00
    • Each Load/Store as a node
    • Edge between nodes represents they can be merged
    • Pick maximal number of edges that are disjoint


major contributions
Major contributions
  • Keep the model simple and easy to be solved mathematically
  • Identify the movable boundary problem, which impedes the problem modeling and simplification
  • Propose Motion Schedule Graph (MSG) and two approaches to solve it heuristically
  • Merge with instruction duplication and variable duplication
  • Cross basic block merges
  • Other improvements like local conflict elimination through rematerialization and some global optimization issues
  • An iterative approach, which systematically grows the code segment and then the data segment minimally.


basic concepts 1
Basic concepts (1)
  • Post-pass approach: assuming a good register allocator has been used--Appel & George’s register allocation algorithm
  • Alias analysis
    • Memory access instruction dis-ambiguity
    • Most alias can be uniquely determined in our benchmark program
  • Memory access instructions
    • ST[addr],r is the definition of a memory address
    • LD[addr],r is the use of a memory address
    • For base-offset Load/Store instructions, normally for arrays, assume arrays are inseparable and more register conflicts will be considered.
  • DependenciesAlias analysis
    • Address conflicts
    • Register conflicts


basic concepts 2
Basic concepts (2)
  • Building Webs
    • Webs: maximal union of du-chains. All variable def/use on the web MUST be allocate to the same memory location
    • One variable appears in separate web can be put into different memory locations
    • Achieve value separation
  • Motion range determination
    • Defined as interval between program points where a Load/Store can be legally moved, restrained by dependencies
    • Load/Store instructions with overlapping range MAY be merged
    • Notice for Movable Boundary problem


movable boundary problem
Movable boundary problem
  • The motion boundary of one Load/Store instruction is also a Load/Store instruction
  • Assuming fixed boundary will cause incorrect merge


motion schedule graph
Motion schedule graph
  • Pseudo fixed-boundary
    • For Store: move as early as possible assuming other instructions are fixed
    • For Load: move as late as possible assuming other instructions are fixed
  • Motion Schedule Graph
    • Nodes represent individual Load/Store instructions
    • Oval encloses Load/Store on the same web
    • Edges link nodes that have overlapped motion range (with respect to pseudo fixed-boundaries)


graph solving
Graph solving
  • The whole problem is provably NP-complete—refer to Appendix A
  • Two separate problems: Bank Assignment and Edge Picking
  • For predetermined bank assignments, the Edge Picking problem can be optimally solved in polynomial time
  • Heuristic algorithms
    • Brutal force searching will take O(|V|32n) time. Doable for small programs
    • SA can approach the optimal solution but will greatly increase the compilation time
    • Use heuristic to solve bank assignment, then get optimal solution for Edge Picking


cross bb merge instr duplication
Cross BB merge (Instr. duplication)
  • Move to predecessor/successor to create new opportunities
  • To guarantee profitability
    • Move to where the reference is live
    • Move ST on EBB
    • Move LD on reverse EBB
    • Make sure: can be combined if pushed to at least one of the live predecessors/successors


local conflict elimination
Local conflict elimination
  • Motivation
    • Register allocator may assign same register to neighboring ranges, which leads to register conflicts
    • ISA restrictions may need particular registers but not available at the program point
  • Rematerialization to free a register and reconstruct it after the merge to make the register available.


  • A framework to analyze and merge LD/STs.
  • Our heuristic approach comes close to exhaustive search with less compilation time.
  • Enhancing the range of motion of the instructions by undertaking variable and instruction replications, so the generated code quality is superior to the exhaustive methods previously proposed.