1 / 59

DSP architecture

DSP architecture. Review of basic computer architecture concepts C6000 architecture: VLIW Principle and Scheduling Addressing Assembly and linear assembly Pipelining. DSP architecture. Review of basic computer architecture concepts C6000 architecture: VLIW Principle and Scheduling

michel
Download Presentation

DSP architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DSP architecture • Review of basic computer architecture concepts • C6000 architecture: VLIW • Principle and Scheduling • Addressing • Assembly and linear assembly • Pipelining

  2. DSP architecture • Review of basic computer architecture concepts • C6000 architecture: VLIW • Principle and Scheduling • Addressing • Assembly and linear assembly • Pipelining

  3. Instruction Set Architecture (ISA) • Computers run programs made of simple operations called “instructions” • The list of instructions offered by the machine is the “instruction set” • The instruction set is what is visible to the programmer (really the compiler, although humans can directly program in “assembly language”) • Many different DSPs can share the same ISA but have different hardware (i.e. the implementation of the ISA is different)

  4. Instructions • Two kinds of information in a computer: • instructions • data • Instructions are stored as numbers, just like data • Instructions and data are stored in the memory

  5. Basic Computer Organization CPU Limited number of fast registers for temporary storage registers OPCODE OPERANDS PC IR Instructions are loaded into an Instruction register (IR) from the address pointed to by the program counter (PC). The PC is incremented by the instruction size (in bytes) for each new instruction. E.g. PC  PC + 4 load store memory Large amount of slow memory Arranged as an array of bytes

  6. Load/Store Architecture (Reg-Reg) CPU • Instructions can ONLY get their data and store their data from/to registers. • The register numbers are specified in the operand fields of the instruction • Since data is stored in memory, we need special “load” and “store” instructions for transfers between registers and memory. These two instructions are the ONLY ones allowed to access memory registers PC IR load store memory

  7. DSP architecture • Review of basic computer architecture concepts • C6000 architecture: VLIW • Principle and Scheduling • Addressing • Assembly and linear assembly • Pipelining

  8. C6000 Architecture • TMS320C62x/C64x • 16-bit fixed point DSP • TMS320C67x • 32-bit floating point DSP • Instuction set is a superset of the C62x • VLIW Architecture • Very Long Instruction Word

  9. VLIW • VLIW is an architecture that exploits instruction level parallelism (ILP) in the code • What is ILP? • An instruction is dependent on another if it uses (produces) a value produced (used) by the other instruction

  10. Example add c,d,e multb,e,a • The mult instruction must wait for the add instruction to finish before it can execute (sequential data flow) e

  11. Example a b add a,b,e add c,d,f add e,f,g • The first two adds have no data dependency and could even be switched in the code with no effect on the correctness of the answer • The first two adds could be executed in parallel if we had the hardware to do it (two adders) d c + + e f + g

  12. Scheduling • Given a set of hardware resources (functional units), e.g. a number of adders, multipliers, etc…, • the process of determining which instructions can be executed in parallel and which functional units to use on any given clock cycle is called instruction scheduling

  13. VLIW • VLIW is an architecture that depends on the user (compiler) to do the scheduling • Instructions are packed into a very long instruction word (256 bits) • There is no scheduling hardware on the chip like on a Pentium 4 which uses hardware, or dynamic scheduling • Benefits • simple hardware • Drawbacks • requires sophisticated compilers • code compatibility – need to recompile if you use a different DSP, even one with the same ISA

  14. C6713 Architecture

  15. Maximum Performance • C6713 • 8 functional units, two MACS per cycle • 225 MHz • 1800 MIPS • 6 of the 8 units floating point • 225 MHz • 1350 MFLOPS

  16. DSP architecture • Review of basic computer architecture concepts • C6000 architecture: VLIW • Principle and Scheduling • Addressing • Assembly and linear assembly • Pipelining

  17. Addressing Modes • Load/Store • must load registers from memory, process data, store back to memory • Linear (indirect addressing) • 32 registers A0-A15, B0-B15 can act as pointers *R register R contains the address of memory location where a data value is stored

  18. Linear Addressing *R++(d) R contains the address. After R is used, postincrement by discplacement d (default is d = 1), -- post decrements *++R(d)preincrement or predecrement *+R(d)preincrement without modification

  19. Circular Addressing

  20. Circular Addressing • Address Mode Register (AMR)

  21. DSP architecture • Review of basic computer architecture concepts • C6000 architecture: VLIW • Principle and Scheduling • Addressing • Assembly and linear assembly • Pipelining

  22. TMS320 Assemby Language [label][:] mnemonic [operand list] [; comment] [x] means that x is optional • label • symbolic name for the address of the program line • mnemonic • instruction, assembler directive, macro • cannot start in column 1 • operands • constants: binary (e.g. 010101b), decimal, hexdecimal (e.g. 0x9f or 9fh) • register names • symbols defined by assembler directives

  23. Assembler Directives • The assembler produces COFF (common-obect file format) files • COFF files are divided intosectionsthat contain instructions or data • Assembler directives are instructions to the assembler on how to manipulate these sections or to define constants • they are not machine instructions • see Section 4.1 in the text for more details

  24. C6000 ISA functional unit conditional execution parallel

  25. Instruction Packing VELOCITI: 1 to 8 execute packets in a fetch packet Instruction 1 ; instructions 1 and 2 Instruction 2 ; are executed sequentially Instruction 3 ; instructions 3, 4, and 5 || Instruction 4 ; are executed in parallel || Instruction 5

  26. Sample Instructions ADD .L1 A3,A7,A7 ;add A3+A7->A7 SUB .S1 A1,1,A1 ;subtract 1 from A1 MPY .M2 A7,B7,B6 ; mult 16LSBs of A7,B7->B6 || MPYH .M1 A7,B7,A6 ; mult 16MSBs of A7,B7->A6 LDH .D2 *B2++,B7 ; load (B2) -> B7, inc B2 || LDH .D1 *A2++,A7 ; load (A2) -> A7, inc A2

  27. Sample Instructions Loop MVKL .S1 x,A4 ; move 16 LSBs of x addr->A4 MVKH .S2 x,A4 ; move 16 MSBs of x addr->A4 SUB .S1 A1,1,A1 ; decrement A1 [A1] B .S2 Loop ; branch to Loop if A1 != 0 NOP 5 ; 5 NOP instructions STW .D1 A3, *A7 ; store A3 into (A7)

  28. Linear Assembly • To effectively program a DSP using assembly language, you need to do the scheduling by hand! • Need to account for the number of clock cycles each functional unit takes, etc… • Difficult, so TI has linear assembly • you don’t have to schedule it, the compiler does it for you • can use CPU resources without worrying about scheduling, register allocation, etc…

  29. DSP architecture • Review of basic computer architecture concepts • C6000 architecture: VLIW • Principle and Scheduling • Addressing • Assembly and linear assembly • Pipelining

  30. Pipelining • Key technique to make fast CPUs • Multiple instructions are overlapped in execution • E.g. Automotive assembly line

  31. body (B) 1 hour paint (P) 1 hour Wheels (W) 1 hour Pipelining: principle

  32. Pipelining: principle(II) Time (h) Bob 0 B1 1 P1 2 W1 2 cars / 6 hours  1/3 car / hour 3 B2 4 P2 5 W2 6

  33. 1 car / hour (3 x speedup) Pipelining: principle(III) Time (h) Bob Alice Bill 0 B1 1 B2 P1 2 P2 W1 B3 3 B4 P3 W2 4 B5 W3 P4 5 B6 P5 W4 6

  34. Pipelining: principle(IV) COMB. LOGIC cycle time cycle time

  35. Performance Gain • Pipelining a datapath m times can result in up to m times improvement in cycle time • E.g. 5-stage pipelined processor is potentially 5 times faster than an unpipelined processor • In reality, this is limited to less than m because of restrictions in overlapping instructions

  36. 5-Stage RISC Pipeline

  37. 16-Stage C6713 Pipeline • Fetch (4 stages) • calc. address, send address, wait, receive • Decode (2 stages) • separate fetch packets into execute packets • Execute (10 stages) • Different instructions require different number of cycles to execute

  38. Software and I/O

  39. Software and I/O • Code efficiency and programming techniques • Loop unrolling • Software pipelining • I/O considerations • Interrupts • DMA • Block processing

  40. Software and I/O • Code efficiency and programming techniques • Loop unrolling • Software pipelining • I/O considerations • Interrupts • DMA • Block processing

  41. Code Efficiency • Intrinsic functions • e.g. _add2, _mpy, sadd • see TMS320C62x/C67x Programmers Guide • Packed data • use word access to operate on 16-bit data store in the high and low parts of a 32-bit register

  42. Loop Unrolling • A loop is a compact way of representing a repetitive sequence of instructions, but… • The loop condition test is overhead • To remove the loop overhead, unroll the loop (make copies of the loop code) • key way of exposing parallelism !!! • The compiler can now look across loop iterations to find parallel instructions • parallelism increased, but so is code size

  43. Example ; program A: code without unrolling MVK 4,B0 loop: LDH *A5++,A0 || LDH *A6++,A1 ADD A0,A1,A2 ;add 4 times … SUB B0,1,B0 [B0] B loop

  44. Example ; program B: code with unrolling once MVK 2,B0 loop: LDH *A5++,A0 || LDH *A6++,A1 ; add first 2 numbers ADD A0,A1,A2 … LDH *A5++,A0 || LDH *A6++,A1 ; add other 2 numbers ADD A0,A1,A2 … SUB B0,1,B0 [B0] B loop

  45. Software and I/O • Code efficiency and programming techniques • Loop unrolling • Software pipelining • I/O considerations • Interrupts • DMA • Block processing

  46. Software Pipelining • Software pipelining • compiler technique (don’t confuse with h/w pipelining) • Schedule multiple iterations of a loop together to fill any empty cycles and maximize functional unit usage • -O2 –O3

  47. Software Pipelining • The general idea of this optimization is to uncover long sequences of statements without branch statements • Reorganize loops to interleave instructions from different iterations • Dependent instructions within a single loop iteration are then separated from one another by an entire loop body • Increases possibilities of scheduling

  48. Software Pipelining

  49. Software Pipelining • Advantage: yields shorter code than loop unrolling and uses fewer registers • Software pipelining is crucial for VLIW processors • Often, both software pipelining and loop unrolling are used

  50. Software and I/O • Code efficiency and programming techniques • Loop unrolling • Software pipelining • I/O considerations • Interrupts • DMA • Block processing

More Related