1 / 30

Outline

A HIGH THROUGHPUT PIPELINED ARCHITECTURE FOR H.264/AVC DEBLOCKING FILTER Kefalas Nikolaos , Theodoridis George VLSI Design Lab. Electrical & Computer Eng. Department University of Patras, Greece. Outline. Deblocking filter algorithm Filtering ordering Memory organization

bliss
Download Presentation

Outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A HIGH THROUGHPUT PIPELINED ARCHITECTURE FOR H.264/AVC DEBLOCKING FILTERKefalas Nikolaos, Theodoridis GeorgeVLSI Design Lab.Electrical & Computer Eng. Department University of Patras, Greece

  2. Outline • Deblocking filter algorithm • Filtering ordering • Memory organization • Pipelined architecture • Synthesis results and comparisons • Conclusions and future work

  3. Deblocking Filter Algorithm (1/3) • The deblocking filter is used in H.264/AVC to reduce the blocking artifacts • Improves subjective & objective quality and reduces the bit-rate typically 5-10%. • It is performed on a macroblock (MB) basis after the completion of the macroblock reconstruction stage • It includes a large number of data depended branches – each 4x4 pixel area is processed up to four times • It spends over one-third (1/3)of the total decoding time

  4. Deblocking Filter Algorithm (2/3) • Each MB is processed in 4x4 blocks • The vertical edges are filtered at first rightwards • from edge V0 to edge V3 • Then horizontal ones downwards • from edge H0 to H3 • Each 8 pixels of two adjacent 4x4 sub-blocks are filtered at the same time • The same process repeats for the chroma components

  5. Deblocking Filter Algorithm (3/3) • Each sub-edge shares a BS value • The BS along with two thresholds α,βdecides the filtering strength of each sub-edge • A filter samples flag is calculated • Three filter types are used • Strong filter (4- or 5-tap filter) • Weak filter • No filtering

  6. Outline • Deblocking filter algorithm • Filtering ordering • Memory organization • Pipelined architecture • Synthesis results and comparisons • Conclusions and future work

  7. Filtering Order • During filtering all four sub-edges of each sub-block are filtered and almost all pixels are involved and updated • A suitable filtering order is needed to: • Reduce the size of the on-chip memory for buffering intermediate data • Increase data reuse • Reduce the external memory accesses • Simplify control and steering logic • Avoid pipeline stalls due to data and resource hazards

  8. Proposed Filtering Order • The vertical sub-edges are filtered in raster scan-order followed by the horizontal ones • The filtering direction is not changed before all vertical edges of luma and chroma are filtered • The proposed order is in accordance to the standard

  9. Outline • Deblocking filter algorithm • Filtering ordering • Memory organization • Pipelined architecture • Synthesis results and comparisons • Conclusions and future work

  10. Memory Organization (1/2) Four single port memories are employed (sizes in bits) • Current-A (CM-A) 96x32 • Current-B (CM-B) 96x32 • Left _mem (LM) 32x32 • Upper_mem (UM) 2xFWx32 + 2x(FW/16)x32 • Transpose buffers TR-P and TR-Q (4x32) – typical systolic array All internal buses are 32 bits

  11. Memory Organization (2/2)

  12. Outline • Deblocking filter algorithm • Filtering ordering • Memory organization • Pipelined architecture • Synthesis results and comparisons • Conclusions and future work

  13. Algorithm Features • Deblocking filter algorithm computational intensive operations • LUT operations – retrieve values α(IndexA), β(IndexB), c1(Index A, BS) • BS calculation • Weak Filter BS(1~3) filtering, δcalculation and clipping operations • Strong Filter BS(4) • The introduced pipeline exploits specific algorithmic features • BS is the same for all micro-edges of a sub-edge for the luma component • BS of the luma component is reused for the chroma components • For the (4:2:0) format BS changes every 2 micro-edges in chroma components

  14. Proposed Pipeline Organization

  15. Pipeline Operation • Each sub-block needs 4 cycles to be processed • The BS unit spends 4 cycles (BS calculation & LUT operations) • BS and LUT operations are do not depend on pixel values • BS calculation & LUT operations are overlapped with the filtering operations for the luma component • Four initialization cycles are needed to calculate the BS and the α,β, c1 for the first luma sub-block

  16. BS=4 Filtering Filter equations modified to improve delay & area BS=4 – 13 adders instead of 28 Total components Adders: 13+14+4=31

  17. Pipeline Benefits • LUT operations and BS calculation are not squeezed in a single pipeline stage • Bs Unit has 4-cycles • The filtering operations are expanded in three pipeline stages • The BS values are reused for filtering the chroma components • Modification of the original filtering equations (improve performance & area) • The proposed ordering eases control logic and memory addressing avoiding any potential critical path increase

  18. Edge Filter Process

  19. Vertical Edge Filter Process • Total cycles = 4*27= 108 • If two port memory has been used then total cycles = 4x24=96 which is the optimum

  20. Processing Cycles • Vertical Edges: 108 cycles • Horizontal Edges: 108 cycles • Initialize: 10 cycles • 6 fetch coding info, initialize control • 4 1st BS calculation • Normal operation: 226 cycles • For the last row (edges 27, 31, 35, 41, 45): 5x4=20 extra cycles • Resource hazard (Bus conflict) • For the last MB in frame 12 extra cycles are needed (edges 39, 43, 47) • Resource hazard (Bus conflict) • Worst case total cycles: 258

  21. Outline • Deblocking filter algorithm • Filtering ordering • Memory organization • Pipelined architecture • Synthesis results and comparisons • Conclusions and future work

  22. Experimental Setup • Synthesis Setup • Synopsys design compiler • TSMC 0.18um • FPGA proven • Stand alone, compared with the JM reference software • It has also verified as a part of a H.264 hardware encoder • It achieves 280 MHz in Virtex 5 speed grade 3

  23. Synthesis Results and Comparisons 1:1P: Single-Port, 2P: Two-Port, 2::FW: Frame width, 3: Filtering cycles only, 4: Filtering cycles only, 5: It takes 246 cycles to filter a MB at the right frame boundary, 6: It takes 246 cycles to filter a MB at the bottom frame row, 7: The 2x(FW/16)x32 bits are stored in upper memory

  24. Conclusions • A novel high speed pipeline architecture for the H.264/AVC deblocking filter is proposed • It operates at 400 MHz and occupies 19.2 Kgates in 0.18 um CMOS technology • It achieves 216 and 54 fps for Full and Ultra-HD frames, respectively • Only single port memories are employed • No external memory accesses are needed during filtering • Parameters and neighbors are store internally • Only fully filtered data are written to external memories

  25. Questions ???

  26. Hardware Architecture (Pipeline organization) 5/ Threshold Calculation

  27. BS=4 Filtering

  28. Deblocking Filter Algorithm 3/3 • Each sub-edge between two adjacent 4x4 luma sub-blocks share a Boundary Strength (BS) • The BS value along with two threshold variables, α and β, decide the filtering strength of the sub-edge

  29. Hardware Architecture (Pipeline organization) 5/ Bs 1,2,3 filter

  30. Deblocking Filter Algorithm 4/4 • Boundary strength across horizontal edges • The boundary strength is calculated for each sub-edge for the luma component • It is reused for the chroma components in 2:1 ratio for 4:2:0 format

More Related