1 / 82

MODULE 2 - PowerPoint PPT Presentation

MODULE 2. Syllabus. Fixed and floating point formats code improvement Constraints TMS 320C64x CPU simple programming examples using C/assembly. Fast and inexpensive implementation Limited in the range of numbers Susceptible to problems of overflow

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

PowerPoint Slideshow about ' MODULE 2' - vega

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

• Fixed and floating point formats

• code improvement

• Constraints

• TMS 320C64x CPU

• simple programming examples using C/assembly.

Limited in the range of numbers

Susceptible to problems of overflow

In a fixed-point processor, numbers are represented in integer format.

Fixed-point numbers and their data types are

characterized by their -

word size in bits

binary point

and

whether they are signed or unsigned

Fixed point numbers

• The dynamic range of an N-bit number based on 2’s-complement representation is between -(2N-1) & (2 N-1 - 1), or between -32,768 and 32,767 for a 16-bit system.

• By normalizing the dynamic range between -1 and 1, the range will have 2N sections, 2 -(N-1) -size of each section starting at -1 up to 1 – 2 -(N-1).

• For a 4-bit system, there would be 16 sections, each of size 1/8, from -1 to 7/8 .

• In unsigned integer

the stored number can take on any integer value from 0 to 65,535.

• signed integer

uses two's complement

allows negative numbers

it ranges from -32,768 to 32,767

• With unsigned fraction notation

65,536 levels spread uniformly between 0 and 1

• the signed fraction format

allows negative numbers, equally spaced between -1 and 1

6+(-2)=4

• The 4-bit unsigned numbers represent a modulo (mod) 16 system.

• If 1 is added to the largest number (15), the operation wraps around to give 0 as the answer.

• A number wheel graphically demonstrates the addition properties of a finite bit system.

• 1Find the first number x on the wheel.

• 2. Step off y units in the clockwise direction, which brings you to the answer.

Carry and Overflow system. • Carry applies to unsigned numbers — when adding or subtracting, result is incorrect.• Overflow applies to signed numbers — when adding or subtracting, result is incorrect.

Examples: system.

Overflow

Carry

Sign bit

01111 + 100+

00111 111

-------- -------------

10110 1011

Sign bit

Carry

Data types fractional fixed-point number that has values between +0.99 . . . and -1 can be used.

1.Short:

it is of size 16 bits represented as 2’s complement with a range from -215 to (215 -1)‏

2.Int or signed int:

it is of size 32 bits represented as 2’s complement with a range from -231 to ( 231-1)‏

3.Float:

it is of size 32 bits represented as IEEE 32 bit with a range from 2-126(1.175494x10-38) to 2+128 (3.40282346x1038)‏

4.Double:

it is of size 64 bits represented as IEEE 64 bit with a range from 2-1022(2.22507385x10-308) to 2 1024(1.79769313x10308)‏

The advantage over fractional fixed-point number that has values between +0.99 . . . and -1 can be used.fixed-point representation is that

it can support a much wider range of values.

The floating-point format needs slightly more storage

The speed of floating-point operations is measured in

FLOPS.

Floating-point representation

General format of floating point number : fractional fixed-point number that has values between +0.99 . . . and -1 can be used.

X= M. be

where M is the value of the significand (mantissa),

b is the base

e is the exponent.

Mantissa determines the accuracy of the number

Exponent determines the range of numbers that can be represented

Floating point numbers can be represented as: fractional fixed-point number that has values between +0.99 . . . and -1 can be used.

Single precision :

• called "float" in the C language family

• it is a binary format that occupies 32 bits

• its significand has a precision of 24 bits

Double precision :

• called "double" in the C language family

• it is a binary format that occupies 64 bits

• its significand has a precision of 53 bits

31 fractional fixed-point number that has values between +0.99 . . . and -1 can be used.

30

23

22

0

e

f

S

Single Precision (SP):

Bit 31 represents sign bit

Bits 23 to 30 represents exponent bits

Bits 0 to 22 represents fractional bits

Numbers as small as 10-38 and as large as10 38 can be represented

31 fractional fixed-point number that has values between +0.99 . . . and -1 can be used.

30

20

19

0

31

0

Double precision (DP) :

• since 64 bits, more exponent and fractional bits are available

• a pair of registers are used

Bits 0 to 31 of first register represents fractional bits

Bits 0 to 19 second register also represents fractional bits

Bits 20 to 30 represents exponent bits

Bits 31 is the sign bit

Numbers as small as 10 -308 and as large as 10 +308 can be represented

s

e

f

f

• Instructions ending in fractional fixed-point number that has values between +0.99 . . . and -1 can be used.SP or DP represents single and double precision

• Some Floating point instructions have more latencies than fixed point instructions

Eg: MPY requires one delay

MPYSP has three delays

MPYDP requires nine delays

• Single precision floating point value can be loaded into a single register where as Double precision values need a pair of registers

A1:A0, A3:A2 ,…….. B1:B0, B3:B2 ,……………

• C6711 processor has a single precision reciprocal instruction RCPSP for performing division

Code Optimization fractional fixed-point number that has values between +0.99 . . . and -1 can be used.

code optimization is used to drastically reduce the execution time of the code.

There are several techniques- (i) Use instructions in parallel (ii) Word-wide data (iii) intrinsic functions (iv) Software pipelining.

Optimized assembly (ASM) code runs faster than C and require less memory space.

C fractional fixed-point number that has values between +0.99 . . . and -1 can be used.

C ++

Optimising Compiler

80 - 100%

Low

LinearASM

Assembly Optimiser

95 - 100%

Med

ASM

Hand Optimised

100%

High

Comparison of Programming Techniques

Source

Effort

Efficiency*

* Typical efficiency vs. hand optimized assembly.

Linear Assembly fractional fixed-point number that has values between +0.99 . . . and -1 can be used.

• The resulting assembly-coded program produced by the assembler optimizer is typically more efficient than one resulting from the C compiler optimizer.

• Linear assembly code programming provides a compromise between coding effort and coding efficiency.

• Optimization Steps fractional fixed-point number that has values between +0.99 . . . and -1 can be used.

1.Program in C. Build your project without Optimization

2. Use intrinsic functions when appropriate as well as the various optimization levels

3. Use the profiler to determine/ identify the functions that may need to be further optimized.

Then convert these functions in linear ASM.

4. Optimize code in ASM.

Profiler fractional fixed-point number that has values between +0.99 . . . and -1 can be used.

• The profiler analyzes program execution and shows you where your program is spending its time.

• A profile analysis can report how many cycles a particular function takes to execute and how often it is called.

• Profiling helps you to direct valuable development time toward optimizing the sections of code that most dramatically affect program performance.

Compiler options: fractional fixed-point number that has values between +0.99 . . . and -1 can be used.A C-coded program is first passed through a parser that performs preprocessing functions and generate an intermediate file (.if) which becomes the input to an optimizer.

The optimizer generates an (.opt) file which becomes the input to a code generator for further optimization and generates ASM file.

.opt

.if

code generator

ASM

Parser

Optimizer

C Code

The options for optimization levels: fractional fixed-point number that has values between +0.99 . . . and -1 can be used.1. -O0 optimizes the use of registers2. -O1 performs a local optimization in addition to

optimization done by -00.3. -O2 performs global optimization in addition to

optimization done by -00 and -01.4. -O3 performs file optimization in addition to the

optimizations done by -00, -01 and -02.

-02 and -03 attempt to do software optimizations.

Intrinsic C functions: fractional fixed-point number that has values between +0.99 . . . and -1 can be used.

• Similar to run time support library function

• C intrinsic function are used to increase the efficiency of code.

• int-mpy ( ) has an equivalent ASM instruction MPY, which multiplies 16 LSBs of a number by 16 LSBs of another number.

2. int-mpyh ( ) has an equivalent ASM instruction MPYH which multiplies 16 MSBs of a number by the 16 MSBs of another number.

3. int-mpylh ( ) has an equivalent ASM instruction MPYLH which multiplies 16 LSBs of a number by 16 MSBs of another.

4. int-mpyhl ( ) has an equivalent ASM instruction MPYHL which multiplies 16 MSBs of a number by the 16 LSBs of another.

5. Void-nassert (int) generates no code.

It tells the compiler that expression declared with the asssert function is true.

6. Uint-lo (double) and Uint-hi (double) obtain low and high 32 bits of a double word.

Trip directive for loop count fractional fixed-point number that has values between +0.99 . . . and -1 can be used.:Linear assembly directive (.trip) is used to specify the number of times a loop iterates.If the exact number is known and used, redundant loops are not generated and can improve both code size and execution time.

Cross-Paths fractional fixed-point number that has values between +0.99 . . . and -1 can be used.

• Data and address cross-path instructions are used to increase code efficiency.

• MPY .M1x A2,B2,A4

• MPY .M2x A2,B2,B4

Software pipelining fractional fixed-point number that has values between +0.99 . . . and -1 can be used.

• software pipelining is a scheme which uses available resources to obtain efficient pipelining code.

• The aim is to use all eight functional units within one cycle.

There are three stages:

1. prolog (warm-up)- This stage contains instructions needed to build up the loop kernel cycle.

2. Loop kernel (cycle)- within this loop, all instructions are executed in parallel.

Entire loop is executed in one cycle.

3. Epilog (cool-off)- This stage contains the instructions necessary to complete all iterations

Procedure for software pipelining: fractional fixed-point number that has values between +0.99 . . . and -1 can be used.

1. Draw the dependency graph

2. Set up a scheduling table

3. Obtain code from the scheduling table.

Dependency graph: (Procedure)‏

1. Draw the nodes and paths

2. Write the number of cycles to complete an instruction

3. Assign functional units associated with each code

4. Separate the data paths, so that the maximum number of units are utilized.

dependency graph fractional fixed-point number that has values between +0.99 . . . and -1 can be used.

• A node has one or more data paths going in and/or out of the node.

• The numbers next to each node represent the number of cycles required to complete the associated instruction.

• A parent node contains an instruction that writes to a variable; whereas a child node contains an instruction that reads a variable written by the parent.

• LDH - > Parent of MPY node.

• MPY - >Parent of ADD

• The ADD instruction is fed back as input for the next iteration; similarly with the SUB instruction.

Dependency graph : node. (Eg. Two sum of product)‏

Side B

Side A

LDW

LDW

bi

.D2

ai

.D1

5

5

5

5

MPY

MPYH

Prod h

.M1x

Prod l

.M2x

2

2

1

1

Sum l

Sum h

.L2

.L1

SUB

B

1

count

loop

1

.S2

.S1

Scheduling table: node.

1. LDW starts in cycle 1

2. MPY and MPYH must start five cycles after LDW, due to four delay slots.

Therefore MPY/MPYH starts at cycle 6.

3. ADD must start two cycles after MPY/MPYH due to one delay slot of MPY/MPYH.

Therefore ADD starts in cycle 8.

4. B has 5 delay slots and starts in cycle 3, since branching occurs in cycle 9, after ADD instructions.

5. SUB instruction must start one cycle before branch instruction, since the loop count is decremented before branching occurs.

Therefore SUB starts in cycle 2.

cycles

units

1,9,17.. 2,10,18.. 3,11,.. 4,12,.. 5,13,.. 6,14,.. 7,15,.. 8,16,..

.D1

.D2

.M1

.M2

.L1

.L2

.S1

.S2

LDW

LDW

MPY

MPYH

SUB

B

cycles

units

1,9,17.. 2,10,18.. 3,11,.. 4,12,.. 5,13,.. 6,14,.. 7,15,.. 8,16,..

LDW

LDW

LDW

LDW

LDW

.D1

.D2

.M1

.M2

.L1

.L2

.S1

.S2

LDW

LDW

LDW

LDW

LDW

LDW

LDW

LDW

LDW

LDW

LDW

MPY

MPY

MPY

MPYH

MPYH

MPYH

SUB

SUB

SUB

SUB

SUB

SUB

SUB

B

B

B

B

B

B

Loop Kernel until and including loop kernel (cycle 8).

• Within the loop cycle 8, multiple iterations of the loop-execute in parallel. ie, different iterations are processed at same time.

MPY/MPYH multiply data for iteration 3

LDW load data for iterations 8

SUB decrements the counter for iteration 7

B branches for iteration 6

• ie, values being multiplied are loaded into registers 5 cycles prior to cycle when the values are actually multiplied. Before first multiplication occurs, fifth load has just completed.

• This software pipelining is 8 iterations deep.

• If the loop count is 100 until and including loop kernel (cycle 8). (200 numbers)‏

Cycle 1: LDW, LDW (also initialization of count and accumulators A7 and B7)‏

Cycle 2: LDW, LDW, SUB

Cycle 3-5: LDW, LDW, SUB, B

Cycle 6-7: LDW, LDW, MPY, MPYH, SUB, B

• Prolog section is within cycle 1-7

• Loop kernel is in cycle 8

• Epilog section is in cycle 108.

Execution Cycles: until and including loop kernel (cycle 8).

Number of cycles (with software pipelining):

Fixed point = 7+ (N/2) +1

eg: N = 200 ; 7+100+1 = 108

Floating points = 9 + (N/2) + 15

Fixed Point Floating Point

No Optimization2 + (16 X 200) = 32022 + (18 X 200) = 3602

With parallel instructions1 + (8 X 200) = 16011 + (10 X 200) = 2001

Two sums per iterations1 + (8 X 100) = 8011 + (10 X 100) + 7 = 1008

With S/W pipelining7 + (200/2) + 1 = 1089 + (200/2) +15 = 124

Memory Constraints: until and including loop kernel (cycle 8).

• Internal memory is arranged through various banks of memory so that loads and stores can occur simultaneously.

• Since banks are single ported, only one access to each bank is performed per cycle.

• Two memory access per cycle can be performed if they do not access the same bank.

• If multiple access is performed to the same bank, pipeline will stall.

Cross Path Constraints: until and including loop kernel (cycle 8).

• Since there is one cross path in each side of the two datapaths, there can be at most two instructions per cycle using cross path.

eg: Valid code segment (because both available cross paths are utilized )‏

II MPY .M2X A2, B2, B3

eg: Not valid ( because one cross path is used for both instructions)‏

II MPY .M1X A2, B2, A3

Load/store constraints: until and including loop kernel (cycle 8).

• The address register to be used must be on the same side as the .D unit.

eg: Valid code:

LDW .D1 *A1, A2

II LDW .D2 *B1, B2

eg: Invalid code:

LDW .D1 . *A1, A2

II LDW .D2 *A3, B2

eg: Valid code:

LDW .D1 *A0, B1

II STW .D2 A1,*B2

eg: Invalid code:

LDW .D1 *A0, A1

II STW .D2 A2,*B2

Pipelining Effects with More Than One EP within an FP until and including loop kernel (cycle 8).

TMS320C64x forces the pipeline to stall so that EP2 and EP3, within FP1, can each start its dispatching phase in cycles 6 and 7, respectively

• TMS320C64x is a family of 16-bit Very Long Instruction Word (VLIW) DSP from Texas Instruments

• At clock rates of up to 1 GHz, C64x DSPs can process information at rates up to 8000 MIPS

• C64x DSPs can do more work each cycle with built-in extensions.

• They can process all C62x object code unmodified (but not vice-versa)‏

Applications for the C64x forces the pipeline to stall so that EP2 and EP3, within FP1, can each start its dispatching phase in cycles 6 and 7, respectively

• TMS320C64x can be used as a CPU in the following devices:

• Wireless local base stations;

• Remote access server (RAS);

• Digital subscriber loop (DSL) systems;

• Cable modems;

• Multichannel telephony systems;

• Pooled modems;

New extensions forces the pipeline to stall so that EP2 and EP3, within FP1, can each start its dispatching phase in cycles 6 and 7, respectively

• Register file enhancements

• Data path extensions

• Packed data processing

• Increased orthogonality

Register file enhancements forces the pipeline to stall so that EP2 and EP3, within FP1, can each start its dispatching phase in cycles 6 and 7, respectively

• The ’C64x register file has double the number of general-purpose registers than the ’C62x/’C67x cores

• There are 32 32-bit registers per data path A0-A31 for file A and B0-B31 for file B

• A0 may also be used as a condition register bringing the total to six condition registers.

• In all ’C6000 devices, registers A4-A7 and B4-B7 can be used for circular addressing.

Packed data processing forces the pipeline to stall so that EP2 and EP3, within FP1, can each start its dispatching phase in cycles 6 and 7, respectively

• The ’C64x register file supports all the ’C62x data types and extends this by additionally supporting packed 8-bit types and 64-bit fixed-point data types.

• Packed data types store either four 8-bit values or two 16-bit values in a single 32-bit register or four 16-bit values in a 64-bit register pair.

• Besides being able to perform all the ’C62x instructions, the ’C64x also contains many 8–bit and 16–bit extensions to the instruction set.

Eg: MPYU4 instruction performs four 8x8 unsigned multiplies with a single instruction on a .M unit.

Data path extensions forces the pipeline to stall so that EP2 and EP3, within FP1, can each start its dispatching phase in cycles 6 and 7, respectively

• On the ’C64x, all eight of the functional units have access to the register file on the opposite side via a cross path.

• on the ’C62x/’C67x, only six functional units have access to the register file on the opposite side via a cross path; the .D units do not have a data cross path.

• The ’C64x pipelines data cross path accesses allowing multiple units per side to read the same cross path source simultaneously.

• In ’C62x/’C67x, only one functional unit per data path per execute packet could get an operand from the opposite register file.

Additional Functional Unit Hardware forces the pipeline to stall so that EP2 and EP3, within FP1, can each start its dispatching phase in cycles 6 and 7, respectively

• the .L units can perform byte shifts and the .M units can perform bi-directional variable shifts in addition to the .S unit’s ability to do shifts.

• Bit-count and rotate hardware on the .M unit extends support for bit-level algorithms such as binary morphology, image metric calculations and encryption algorithms.

Increased Orthogonality forces the pipeline to stall so that EP2 and EP3, within FP1, can each start its dispatching phase in cycles 6 and 7, respectively

• The .D unit can now perform 32-bit logical instructions in addition to the .S and .L units.

• Also, the .D unit now directly supports load and store instructions for double-word data values

Block diagram forces the pipeline to stall so that EP2 and EP3, within FP1, can each start its dispatching phase in cycles 6 and 7, respectively

L1 Program cache

Direct-mapped

16 K Bytes total

SDRAM

EMIF A

SBSRAM

ZBT RAM

EMIF B

Enhanced

DMA

Controller

(64-channel)‏

CPU CORE

L2

Memory

1024K

bytes

FIFO

SRAM

.

I/O devices

L1 Data cache

2-way set-associative

16 K Bytes total

C64X CPU forces the pipeline to stall so that EP2 and EP3, within FP1, can each start its dispatching phase in cycles 6 and 7, respectively

Architecture Overview forces the pipeline to stall so that EP2 and EP3, within FP1, can each start its dispatching phase in cycles 6 and 7, respectively

• 2 (almost) identical fixed-point data paths that each contain

• 1 ALU (The .L Unit)‏

• 1 Shifter (The .S Unit)‏

• 1 Multiplier (The .M Unit)‏

• 1 register file containing thirty-two 32-bit registers

General-Purpose Register Files executing up to 8 instructions in parallel.

• The C64x register file contains 32 32-bit registers (A0-A31 for file A and B0-B31 for file B);

• can be used for data, pointers or conditions

• Values larger than 32 bits (40-bit long and 64-bit float quantities) are stored in register pairs.

• Packed data types are: four 8-bit values or two 16-bit values in a single 32-bit register, four 16-bit values in a 64-bit register pair.

0

32

31

39

Odd register

Even register

Zero filled

Delay Slots executing up to 8 instructions in parallel.

• Delay slots mean “how many CPU cycles come between the current instruction and when the results of the instruction can be used by another instruction”

• Single Cycle Instructions: 0 delay slots

• 16x16 Single Multiply and .M Unit non-multiply Instructions: 1 delay slot

• Store: 0 delay slots executing up to 8 instructions in parallel.

• If a load occurs before a store (either in parallel or not), then the old data is loaded from memory before the new data is stored.

• If a load occurs after a store, (either in parallel or not), then the new data is stored before the data is loaded.

• C64x Multiply Extensions: 3 delay slots

• Branch: 5 delay slots

• The branch target is in the PG slot when the branch condition is determined in E1. There are 5 slots between PG and E1 when the branch target begins executing useful code again.

Memory executing up to 8 instructions in parallel.

• The C64x has different spaces for program and data memory;

• Uses two-level cache memory scheme;

Internal Memory executing up to 8 instructions in parallel.

• The C64x has a 32-bit byte-addressable memory with the following features:

• Separate data and program address spaces;

• Large on chip RAM, up to 7MB;

• 2-level cache;

• Single internal program memory port with an instruction-fetch bandwidth of 256 bits;

• Two 64-bit internal data memory ports;

Memory Map (Internal and External Memory) executing up to 8 instructions in parallel.‏

• Level 1 Program Cache is 128 Kbit direct mapped

• Level 1 Data cache is 128Kbit 2-way set-associative

• Shared Level 2 Program/Data Memory/Cache of 4Mbit

• Can be configured as mapped memory

• Cache (up to 256 Kbytes)‏

• Combination of the two

Memory Buses executing up to 8 instructions in parallel.

• Instruction fetch using 32-bit address bus and 256-bit data bus

• two 64-bit load buses (LD1 and LD2)‏

• two 64-bit store buses (ST1 and ST2)‏

Interrupts executing up to 8 instructions in parallel.

• 16 prioritized interrupts: INT_00 to INT_15

• INT_00 has the highest priority and is dedicated to RESET. This halts the CPU and returns it to a known state

• The first four interrupts (INT_00 – INT_03) are fixed and non maskable

• INT_01 – INT_03 are generally used to alert the CPU of an impending hardware problem, such as an imminent power failure

• The remaining interrupts are maskable and can be programmed

Interrupt Performance Consideration executing up to 8 instructions in parallel.

• Overhead for all CPU interrupts is 7 cycles

• Interrupt latency is 11 cycles

• Interrupts can be recognized every 2 cycles

• 2 occurrences of a specific interrupt can be recognized in 2 cycles

Peripheral Set executing up to 8 instructions in parallel.

• 2 multichannel buffered audio serial ports

• 2 inter-integrated circuit bus modules (I2Cs)‏

• 3 multichannel buffered serial ports (McBSPs)‏

• 3 32-bit general-purpose timers

• 1 user-configurable 16-bit or 32-bit host-port interface (HPI16/HPI32)‏

• 1 16-pin general-purpose input/output port (GP0) with programmable interrupt/event generation modes

• 1 32-bit glueless external memory interface (EMIFA), capable of interfacing to synchronous and asynchronous memories and peripherals.

ZBT RAM executing up to 8 instructions in parallel.

• Zero Bus Turnaround (ZBT) is a synchronous SRAM architecture optimized for networking and telecommunications applications.

• It can increase the internal bandwidth of a switch fabric when compared to standard SyncBurst SRAM.

• The ZBT architecture is optimized for switching and other applications with highly random READs and WRITEs.

• ZBT SRAMs eliminate all idle cycles when turning the data bus around from a WRITE operation to a READ operation

Packaging – Top View executing up to 8 instructions in parallel.

Packaging - Bottom View executing up to 8 instructions in parallel.

Sum of products example executing up to 8 instructions in parallel.

C code:

int DotP(short* m, short* n, int count) {

int i, product, sum = 0;

for(i = 0; i < count; i++)‏

{

product = m[i] * n[i];

sum+=product;

}

return(sum);

}

TI TMS C64x code:

LOOP:

[A0] SUB .L1 A0, 1, A0

| | [!A0] ADD .S1 A6, A5, A5

| | MPY .M1X B4, A4, A6

| | [B0] BDEC .S2 LOOP, B0

LDH .D1T1 *A3++, A4

LDH .D2T2 *B5++, B4

Another code example executing up to 8 instructions in parallel.

MIPS:

loop: LW R1, 0(R11)‏

MUL R2, R1, R10

SW R2, 0(R12)‏

BGTZ R12, loop

TI TMS C64x:

ADDK .S1 #-4,A11 || LDW .D1 A1,0(A11) || MVK .S2 #-4,B1

ADDK .S1 #-4,A11 || LDW .D1 A1,0(A11) || MUL .M1 A1,A10,A2 || ADDK .S2 #-12,B12

loop: ADDK .S1 #-4,A11 || LDW .D1 A1,0(A11) || MUL .M1 A1,A10,A2 || STW .D2x A2,0(B12) ||

ADD .L2 B12,B1,B12 || BGTZ .S2 B12, loop

ADD .L2 B12, B1, B12 || MUL .M1 A1,A10,A2 || STW .D2x A2,0(B12)‏

ADD .L2 B12, B1, B12 || STW .D2x A2,0(B12)‏

Instruction executing up to 8 instructions in parallel.

Description

Example Application

BITC4

Bit counter

Machine vision

GMPY4

Galois Field MPY

Reed Solomon support

SHFL

Bit interleaving

Convolution encoder

DEAL

Bit de-interleaving

Cable modem

SWAP4

Byte swap

Endian swap

XPNDx

Bit expansion

Graphics

MPYHIx, MPYLIx

Extended precision 16x32 MPYs

Audio

AVGx

Motion compensation

SUBABS4