intel pentium 4 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Intel Pentium 4 PowerPoint Presentation
Download Presentation
Intel Pentium 4

Loading in 2 Seconds...

play fullscreen
1 / 28

Intel Pentium 4 - PowerPoint PPT Presentation


  • 107 Views
  • Uploaded on

Intel Pentium 4. ENCM 515 - 2002 Jonathan Bienert Tyson Marchuk. Overview:. Product review Specialized architectural features (NetBurst) SIMD instructional capabilities (MMX, SSE2) SHARC 2106x comparison. Intel Pentium 4 . Reworked micro-architecture for high-bandwidth applications

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Intel Pentium 4' - niran


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
intel pentium 4
Intel Pentium 4

ENCM 515 - 2002

Jonathan Bienert

Tyson Marchuk

overview
Overview:
  • Product review
  • Specialized architectural features (NetBurst)
  • SIMD instructional capabilities (MMX, SSE2)
  • SHARC 2106x comparison
intel pentium 41
Intel Pentium 4
  • Reworked micro-architecture for high-bandwidth applications
      • Internet audio and streaming video, image processing, video content creation, speech, 3D, CAD, games, multi-media, and multi-tasking user environments
  • These are DSP intensive applications!
    • What about uses other than in PC?
hardware features netburst micro architecture
Hardware Features:(NetBurst micro-architecture)
  • Hyper pipelined technology
  • Advanced dynamic execution
  • Cache (data, L1, L2)
  • Rapid ALU execution engines
  • 400 MHz bus
  • OOE
  • Microcode ROM
hyper pipeline
Hyper Pipeline
  • 20-stage pipeline!!!
  • breaks down complex CISC instructions
    • sub-stages mimic RISC
    • faster execution
filling the pipeline
Filling the pipeline...
  • Review of next 126 instructions to be executed
  • Branch prediction
    • if mispredict must flush 20-stage pipeline!!!
    • branch target buffer (BTB)
    • 4K branch history table (BHT)
    • assembly instruction hints
cache
Cache
  • 8KB Data Cache
  • L1 Execution Trace Cache
    • 12K of previous micro-instructions stored
    • saves having to translate
  • L2 Advanced Transfer Cache
    • 256K for data
    • 256-bit transfer every cycle
      • allows 77GB/s data transfer on 2.4GHz
rapid alu execution engines
Rapid ALU Execution Engines
  • 2 ALUs
    • allow parallel operations
  • Many arithmetic operations take 1/2 cycle
    • each 2X ALU can have 2 operations per cycle
software features
Software Features:
  • Multimedia Extensions (MMX)
    • 8 MMX registers
  • Streaming SIMD Extensions (SSE2)
    • 8 SSE/SSE2 registers
  • Standard x86 Registers
    • EAX, EBX, ECX, EDX, ESI, etc.
    • Register rename to over 100
mmx multimedia extensions
MMX (Multimedia Extensions)
  • Accelerated performance through SIMD
      • multimedia, communication, internet applications
  • 64-bit packed INTEGER data
        • signed/unsigned
sse2 streaming simd extensions
SSE2 (Streaming SIMD Extensions)
  • Accelerate a broad range of applications
        • video, speech, and image, photo processing, encryption, financial, engineering, and scientific applications
  • 128-bit SIMD instruction formats
        • 4 single precision FP values
        • 2 double precision FP values
        • 16 byte values
        • 8 word values
        • 4 double word values
        • 2 quad word values
        • 1 128-bit integer value
simd example 16 tap fir filter real numbers
SIMD Example(16-tap FIR filter - Real numbers)
  • Applications for real FIR filters
      • general purpose filters in image processing, audio, and communication algorithms
  • Will utilize SSE2 SIMD instruction set
thinking about simd
Thinking about SIMD
  • SSE2 instruction format is 128-bits
      • 128-bit SSE2 registers
      • Many data formats!
      • What precision do we want?
  • Lets use 32-bit floating point for coefficients, input, output

4 data sets x 32-bit = 128 bits

parallelizing
Parallelizing
  • Require many single multiplications (coefficients x inputs), then add the results for output!
  • Multiplications…
  • then need to perform additions...
using sse2 format
Using SSE2 format
  • Can hold 4 elements of an array (of 32-bit data) in each 128-bit register
  • 4 single precision floating point ops per cycle (32-bit)
additions
Additions...
  • In both registers, now have 4 32-bit results
    • First add the results into an accumulator register
  • 4 single precision floating point ops per cycle (32-bit)
additions1
Additions...
  • In a register, now have 4 32-bit results
    • however, NO SSE2 instruction to add these 4!
    • But can use other instructions
      • Some BIT INTERTWINING…then add
    • This will give results for several output values!
adi s harc 21k vs p4
ADI SHARC 21k vs. P4

Disadvantages

  • Slower clock speed (40MHz vs 2400MHz)
  • Less opportunities for parallelism (5 vs 11)
  • Much less memory (Cache and System)
    • Limited algorithm applicability
    • Limited applications
  • Older (Less support – compiler)
    • 1994 vs 2001
adi sharc 21k vs p4
ADI Sharc 21k vs. P4

Advantages

  • Hardware loops
  • Easier to program for optimal speed
  • Cheaper
  • Lower power consumption
  • Runs cooler
fir performance
FIR Performance
  • Hard to obtain P4 performance numbers
  • Can estimate based on 2 FP multiplies per clock, clock rate and assumption that pipeline can be kept full.
    • 2 * 2.4GHz ~ 4.8 billion multiplies per second
    • If ~4 multiplies per element & 44000 samples/s
    • FIR length > ~25k taps
  • SHARC => ~ 200 taps (Lab 4)
  • Factor of ~125x
iir performance
IIR Performance
  • Hard to obtain P4 performance numbers
  • No hardware circular buffers
  • Does have BTB, BHT, etc.
  • Prefetches ~256bytes ahead of current position in code.
fft performance
FFT Performance
  • Hard to obtain P4 performance numbers
  • Prime95 uses FFT to calculate Lucas-Lehmer test for Mersenne Primes
    • Involves FFT, squaring and iFFT, etc.
  • 256k points on P4 2.3GHz ~ 10.517ms
  • Compare to SHARC 2048 point FFT ~0.37ms
  • If SHARC could do 256k, 46.25ms (But…)
optimization example
Optimization Example
  • Hard to optimize Pentium 4 assembly
  • Example of multiplying by a constant, 10
  • Taken mainly from: www.emulators.com/docs/pentium_1.htm
multiplying by 10
Multiplying by 10
  • Slowest way:
    • IMUL EAX, 10
  • Usually optimal way (Visual C++ 6.0)
    • LEA EAX, [EAX+EAX*4]
    • SHL EAX, 1
    • Shift – Add – Shift
    • On most x86 processors takes 2 cycles
    • Pentium MMX and before 3 cycles
    • On Pentium 4 takes 6 cycles!
multiplying by 101
Multiplying by 10
  • Optimal for Pentium 4
    • LEA ECX, [EAX + EAX]
    • LEA EAX, [ECX+EAX*8]
    • On most x86 still takes 2 cycles
    • On Pentium 4 takes ~ 3 cycles (OOE - Ops)
    • But on older processors Pentium MMX and before this now takes 4 cycles!
multiplying by 102
Multiplying by 10
  • Best generic case
    • LEA EAX, [EAX + EAX*4]
    • ADD EAX, EAX
    • On most x86 still takes 2 cycles
    • On older processors Pentium MMX and before this now takes 3 cycles again
    • On Pentium 4 this takes 4 cycles
  • Obviously really hard to optimize
references
REFERENCES
  • Intel application note: AP 809 - Real and Complex Filter Using Streaming SIMD Extentions
  • graphics from: http://www6.tomshardware.com/cpu/00q4/001120/p4-01.html