A framework for parallelizing load stores on embedded processors
Sponsored Links
This presentation is the property of its rightful owner.
1 / 22

A Framework for Parallelizing Load/Stores on Embedded Processors PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

A Framework for Parallelizing Load/Stores on Embedded Processors. Xiaotong Zhuang Santosh Pande John S. Greenland Jr. College of Computing, Georgia Tech. Background and Motivation. Speed gap between memory and CPU remains

Download Presentation

A Framework for Parallelizing Load/Stores on Embedded Processors

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

A framework for parallelizing load stores on embedded processors

A Framework for Parallelizing Load/Stores on Embedded Processors

Xiaotong Zhuang

Santosh Pande

John S. Greenland Jr.

College of Computing, Georgia Tech


Background and motivation

Background and Motivation

  • Speed gap between memory and CPU remains

  • Multi-bank memory architecture: Motorola DSP56000 series, NEC 77016, SONY pDSP, Analog Devices ADSP-210x, Starcore SC140 processor core

  • Parallel instructions allow parallel access to memory banks: PLDXY r1, @a, r2, @b, loads @ar1 and @br2 at the same time.

  • Objective:

    • Try to maximally generate parallel Load/Store (such as PLDXY) instructions through compiler optimizations.

    • Controlled code & data segment growth

    • Reasonable speed of compilation


General approaches

General approaches

  • Model as ILP problem--Rainer Leupers, Daniel Kotte, “Variable partitioning for dual memory bank DSPs”, ICASSP, May’01

    • Variables Ni with value 0/1 for each LD/ST instr. to represent its memory bank assignment (X or Y)

    • Variables Eij with value 0/1 to represent whether two instructions can be merged

    • Enforcing other constraints and max the selected edge weight

  • Model as Graph problem--A.Sudarsanam, S.Malik, “Simultaneous Reference Allocation in Code Generation for Dual Data Memory Bank ASIPs”, TODAES, Apr’00

    • Each Load/Store as a node

    • Edge between nodes represents they can be merged

    • Pick maximal number of edges that are disjoint


Major contributions

Major contributions

  • Keep the model simple and easy to be solved mathematically

  • Identify the movable boundary problem, which impedes the problem modeling and simplification

  • Propose Motion Schedule Graph (MSG) and two approaches to solve it heuristically

  • Merge with instruction duplication and variable duplication

  • Cross basic block merges

  • Other improvements like local conflict elimination through rematerialization and some global optimization issues

  • An iterative approach, which systematically grows the code segment and then the data segment minimally.


Basic concepts 1

Basic concepts (1)

  • Post-pass approach: assuming a good register allocator has been used--Appel & George’s register allocation algorithm

  • Alias analysis

    • Memory access instruction dis-ambiguity

    • Most alias can be uniquely determined in our benchmark program

  • Memory access instructions

    • ST[addr],r is the definition of a memory address

    • LD[addr],r is the use of a memory address

    • For base-offset Load/Store instructions, normally for arrays, assume arrays are inseparable and more register conflicts will be considered.

  • DependenciesAlias analysis

    • Address conflicts

    • Register conflicts


Basic concepts 2

Basic concepts (2)

  • Building Webs

    • Webs: maximal union of du-chains. All variable def/use on the web MUST be allocate to the same memory location

    • One variable appears in separate web can be put into different memory locations

    • Achieve value separation

  • Motion range determination

    • Defined as interval between program points where a Load/Store can be legally moved, restrained by dependencies

    • Load/Store instructions with overlapping range MAY be merged

    • Notice for Movable Boundary problem


Movable boundary problem

Movable boundary problem

  • The motion boundary of one Load/Store instruction is also a Load/Store instruction

  • Assuming fixed boundary will cause incorrect merge


Motion schedule graph

Motion schedule graph

  • Pseudo fixed-boundary

    • For Store: move as early as possible assuming other instructions are fixed

    • For Load: move as late as possible assuming other instructions are fixed

  • Motion Schedule Graph

    • Nodes represent individual Load/Store instructions

    • Oval encloses Load/Store on the same web

    • Edges link nodes that have overlapped motion range (with respect to pseudo fixed-boundaries)


Conflict resolution

Conflict resolution





Graph solving

Graph solving

  • The whole problem is provably NP-complete—refer to Appendix A

  • Two separate problems: Bank Assignment and Edge Picking

  • For predetermined bank assignments, the Edge Picking problem can be optimally solved in polynomial time

  • Heuristic algorithms

    • Brutal force searching will take O(|V|32n) time. Doable for small programs

    • SA can approach the optimal solution but will greatly increase the compilation time

    • Use heuristic to solve bank assignment, then get optimal solution for Edge Picking


Edge picking as max flow problem

Edge Picking as max flow problem


Bank assignment heuristic

Bank assignment heuristic


Post pass phases

Post-pass phases


Cross bb merge instr duplication

Cross BB merge (Instr. duplication)

  • Move to predecessor/successor to create new opportunities

  • To guarantee profitability

    • Move to where the reference is live

    • Move ST on EBB

    • Move LD on reverse EBB

    • Make sure: can be combined if pushed to at least one of the live predecessors/successors


Variable duplication

Variable duplication


Local conflict elimination

Local conflict elimination

  • Motivation

    • Register allocator may assign same register to neighboring ranges, which leads to register conflicts

    • ISA restrictions may need particular registers but not available at the program point

  • Rematerialization to free a register and reconstruct it after the merge to make the register available.


Merge type and msg properties

Merge type and MSG properties


Compilation time

Compilation time


Runtime performance

Runtime performance


Code size comparison

Code size comparison




  • A framework to analyze and merge LD/STs.

  • Our heuristic approach comes close to exhaustive search with less compilation time.

  • Enhancing the range of motion of the instructions by undertaking variable and instruction replications, so the generated code quality is superior to the exhaustive methods previously proposed.


  • Login