A framework for parallelizing load stores on embedded processors
Sponsored Links
This presentation is the property of its rightful owner.
1 / 22

A Framework for Parallelizing Load/Stores on Embedded Processors PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

A Framework for Parallelizing Load/Stores on Embedded Processors. Xiaotong Zhuang Santosh Pande John S. Greenland Jr. College of Computing, Georgia Tech. Background and Motivation. Speed gap between memory and CPU remains

Download Presentation

A Framework for Parallelizing Load/Stores on Embedded Processors

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

A Framework for Parallelizing Load/Stores on Embedded Processors

Xiaotong Zhuang

Santosh Pande

John S. Greenland Jr.

College of Computing, Georgia Tech


Background and Motivation

  • Speed gap between memory and CPU remains

  • Multi-bank memory architecture: Motorola DSP56000 series, NEC 77016, SONY pDSP, Analog Devices ADSP-210x, Starcore SC140 processor core

  • Parallel instructions allow parallel access to memory banks: PLDXY r1, @a, r2, @b, loads @ar1 and @br2 at the same time.

  • Objective:

    • Try to maximally generate parallel Load/Store (such as PLDXY) instructions through compiler optimizations.

    • Controlled code & data segment growth

    • Reasonable speed of compilation


General approaches

  • Model as ILP problem--Rainer Leupers, Daniel Kotte, “Variable partitioning for dual memory bank DSPs”, ICASSP, May’01

    • Variables Ni with value 0/1 for each LD/ST instr. to represent its memory bank assignment (X or Y)

    • Variables Eij with value 0/1 to represent whether two instructions can be merged

    • Enforcing other constraints and max the selected edge weight

  • Model as Graph problem--A.Sudarsanam, S.Malik, “Simultaneous Reference Allocation in Code Generation for Dual Data Memory Bank ASIPs”, TODAES, Apr’00

    • Each Load/Store as a node

    • Edge between nodes represents they can be merged

    • Pick maximal number of edges that are disjoint


Major contributions

  • Keep the model simple and easy to be solved mathematically

  • Identify the movable boundary problem, which impedes the problem modeling and simplification

  • Propose Motion Schedule Graph (MSG) and two approaches to solve it heuristically

  • Merge with instruction duplication and variable duplication

  • Cross basic block merges

  • Other improvements like local conflict elimination through rematerialization and some global optimization issues

  • An iterative approach, which systematically grows the code segment and then the data segment minimally.


Basic concepts (1)

  • Post-pass approach: assuming a good register allocator has been used--Appel & George’s register allocation algorithm

  • Alias analysis

    • Memory access instruction dis-ambiguity

    • Most alias can be uniquely determined in our benchmark program

  • Memory access instructions

    • ST[addr],r is the definition of a memory address

    • LD[addr],r is the use of a memory address

    • For base-offset Load/Store instructions, normally for arrays, assume arrays are inseparable and more register conflicts will be considered.

  • DependenciesAlias analysis

    • Address conflicts

    • Register conflicts


Basic concepts (2)

  • Building Webs

    • Webs: maximal union of du-chains. All variable def/use on the web MUST be allocate to the same memory location

    • One variable appears in separate web can be put into different memory locations

    • Achieve value separation

  • Motion range determination

    • Defined as interval between program points where a Load/Store can be legally moved, restrained by dependencies

    • Load/Store instructions with overlapping range MAY be merged

    • Notice for Movable Boundary problem


Movable boundary problem

  • The motion boundary of one Load/Store instruction is also a Load/Store instruction

  • Assuming fixed boundary will cause incorrect merge


Motion schedule graph

  • Pseudo fixed-boundary

    • For Store: move as early as possible assuming other instructions are fixed

    • For Load: move as late as possible assuming other instructions are fixed

  • Motion Schedule Graph

    • Nodes represent individual Load/Store instructions

    • Oval encloses Load/Store on the same web

    • Edges link nodes that have overlapped motion range (with respect to pseudo fixed-boundaries)


Conflict resolution




Graph solving

  • The whole problem is provably NP-complete—refer to Appendix A

  • Two separate problems: Bank Assignment and Edge Picking

  • For predetermined bank assignments, the Edge Picking problem can be optimally solved in polynomial time

  • Heuristic algorithms

    • Brutal force searching will take O(|V|32n) time. Doable for small programs

    • SA can approach the optimal solution but will greatly increase the compilation time

    • Use heuristic to solve bank assignment, then get optimal solution for Edge Picking


Edge Picking as max flow problem


Bank assignment heuristic


Post-pass phases


Cross BB merge (Instr. duplication)

  • Move to predecessor/successor to create new opportunities

  • To guarantee profitability

    • Move to where the reference is live

    • Move ST on EBB

    • Move LD on reverse EBB

    • Make sure: can be combined if pushed to at least one of the live predecessors/successors


Variable duplication


Local conflict elimination

  • Motivation

    • Register allocator may assign same register to neighboring ranges, which leads to register conflicts

    • ISA restrictions may need particular registers but not available at the program point

  • Rematerialization to free a register and reconstruct it after the merge to make the register available.


Merge type and MSG properties


Compilation time


Runtime performance


Code size comparison



  • A framework to analyze and merge LD/STs.

  • Our heuristic approach comes close to exhaustive search with less compilation time.

  • Enhancing the range of motion of the instructions by undertaking variable and instruction replications, so the generated code quality is superior to the exhaustive methods previously proposed.


  • Login