1 / 30

Parallel H.264 Decoding on an Embedded Multicore Processor

Parallel H.264 Decoding on an Embedded Multicore Processor. Arnaldo Azevedo 1 , Cor Meenderink 1 , Ben Juurlink 1 Andrei Terechko 2 , Jan Hoogerbrugge 2 , Mauricio Alvarez 3 , Alex Ramirez 3,4 1 - Delft University of Technology, Netherlands 2 - NXP, Netherlands

xander
Download Presentation

Parallel H.264 Decoding on an Embedded Multicore Processor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel H.264 Decoding on an EmbeddedMulticore Processor Arnaldo Azevedo1, Cor Meenderink1, Ben Juurlink1 Andrei Terechko2, Jan Hoogerbrugge2, Mauricio Alvarez3, Alex Ramirez3,4 1 - Delft University of Technology, Netherlands 2 - NXP, Netherlands 3 - Barcelona Supercomputing Center, Spain 4 - Universitat Politecnica de Catalunya, Spain HIPEAC (The 4th International Conference on High Performance and Embedded Architectures and Compilers) 2009

  2. Outline • Introduction • 3D-Wave • 3D-Wave Implementation • Experimental Results • Conclusions

  3. Introduction • Industry shift to multicores • Increasing demand for higher media quality/resolution • Efficient and scalable exploitation of multicore architectures for video coding • H.264 is widely used and computationally demanding • Decoding is part of encoding and more challenging

  4. Parallel H.264 Decoding The H.264 Decoder Encoded Bitstream Inverse Quantization Inverse DCT Stream Parsing Entropy Decoder Deblocking + Spatial Prediction Motion Compensation Reference Frames Reconstructor Data-Parallel Processing The H.264 decoding process http://www.powercam.cc/slide/1580 Parser

  5. Slice 1 Slice 2 Slice 3 H.264 Parallelization • Frame-level • Motion Compensation introducesinter-frame dependencies • Frame-level parallelism is very limited • Slice-level • Slice-level parallelism is uncertain and increase bitrate P3 P6 P9 I0 B4 B1 B2 B5

  6. Intra Intra DF Intra Intra DF Current MB H.264 ParallelizationMacroBlock-level 2D-Wave: exploits MB-level parallelism

  7. Intra Intra DF Intra Intra DF Current MB H.264 ParallelizationMacroBlock-level 2D-Wave: Exploits MB-level parallelism Full HD: up to 60 MBs in parallel

  8. H.264 Parallelizationoverview current strategies • Frame-level: • very limited parallelism • Slice-level: • uncertain parallelism • increases bitrate • MB-level: • Reasonable parallelism • None of these is sufficient to leverage a many-core!

  9. 3D-Wave motion compensation frame 0 (I) frame 1 (P)‏ frame 2 (P)‏

  10. 3D-Wavemaximum parallelism For full HD: Maximum availableparallelism ranges from 5000-9000 MBs! Note: This requires >200frames in flight.

  11. 3D-Wave Implementation • 3D-Wave was implemented on an NXP multicore consisting of TM3270 Trimedias • TM3270 was projected for SD video processing • VLIW-based media-processor with SIMD support • In-house simulator capable of simulating up to 64 cores • 2D-Wave was already implemented • Tail submit (proposed by Hoogerbrugge, Terechko)[13] • Checks the right and down-left MBs • Execute one of them if ready, send other to TQ [13] Hoogerbrugge, J., Terechko, A.: “A Multithreaded Multicore System for Embedded Media Processing,” Transactions on High-Performance Embedded Architectures and Compilers 2008.

  12. 3D-Wave ImplementationReference Frame Buffer Structure Reference Frame Buffer Frame 0 Frame 1 Frame 2 Frame 3 Frame 4 Decoder Sync info Frame 5 Reference Frame Buffer Structure

  13. 3D-Wave ImplementationReference Frame Buffer Structure Decoder Frame 0 Frame 1 Frame 3 Frame 4 Frame 2 Sync info Sync info Sync info Sync info Sync info Parallel Reference Frame Buffer Structure

  14. 3D-Wave ImplementationReference Frame Buffer Structure Decoder Decoder Decoder Frame 0 Frame 1 Frame 3 Frame 4 Frame 2 Sync info Sync info Sync info Sync info Sync info Parallel Reference Frame Buffer Structure

  15. Ref MB F1;MB(1,3)‏ NULL 3D-Wave ImplementationInter frame dependencies • mb_decode checks inter frame dependencies • On failure, it inserts the MB in the Kick-Off List of the Ref MB Frame 0 Frame 1

  16. Ref MB F1;MB(1,3)‏ NULL 3D-Wave Implementation Inter frame dependencies • Decoding process continues normally Frame 0 Frame 1

  17. 3D-Wave Implementation Inter frame dependencies • mb_decode checks Kick-Off List and submits subscribed tasks Frame 0 Frame 1 Ref MB F1;MB(1,3)‏ NULL

  18. 3D-Wave Implementation Inter frame dependencies • And the decoding process carries on Frame 0 Frame 1 Ref MB NULL

  19. 3D-Wave ImplementationFrame Scheduling • 3D-Wave can have many of frames in flight • Practical implementation requires few frames in flight • A policy was developed to limit the number of frames in flight • Implementation • uses the Kick-Off List • subscribes the first MB of the next frame to a specific MB in the current frame • position of the MB defines number of frames in flight

  20. 3D-Wave ImplementationFrame Priority • Frame latency is an important factor in video decoding • 3D-Wave interleaves the processing of all frames in flight • Frame Priority is necessary to limit frame latency in 3D-Wave • Implementation • splits the Task Queue(TQ) into highandlow priority task queues • sends the tasks of the frame next-in-line to the high priority task queue • checks if there are tasks in the high priority TQ, executes from the low priority TQ otherwise

  21. Experimental Results • Use the NXP H.264 decoder that is highly optimized. • Machine-dependent optimizations (e.g. SIMD operations) • Machine-independent optimizations (e.g. code restructuring) • The experiments use all 4 videos from the HD-VideoBench[10]. [10] Alvarez, M., Salami, E., Ramirez, A., Valero, M.: “HD-VideoBench: A Benchmark for Evaluating High Definition Digital Video Applications,” IEEE International Symposium on Workload Characterization 2007.

  22. Experimental ResultsMethodology • Entropy Decoding results of the entire sequence are buffered • Sequence contains only I and P frames with one slice • All frames are scheduled to execute at once • Reference Frame Buffer keeps all the frames of the sequence • Presented results are for 25 frames (1 second) of Rush_Hour Full High Definition(FHD) • On a single core, 2D-Wave can decode 39 SD,18 HD, and 8 FHD frames per second, respectively.

  23. Experimental ResultsScalability • Efficiency of more than 80% for 64 cores • Start-up and ramp-down times of short sequence limit efficiency • 64 cores is 16x faster than real-time for FHD

  24. Experimental ResultsFrame Scheduling FHD Rush_Hour decoding on 16 cores • Different colors represent different frames • Frame Scheduling limits the number of frames in flight • Performance loss is < 5% for at most 6 frames in flight

  25. Experimental ResultsFrame Scheduling and Priority FHD Rush_Hour decoding on 16 cores • Frame Priority reduces frame latency to the same as 2D-Wave (10ms) • The latency of the 1st frame: 58.5ms  Frame Scheduling(15.1ms)  Frame Scheduling and Priority(9.2ms) • Does not reduce performance significantly (< 1%)

  26. Experimental Results Bandwidth Requirements • Bandwidth required for 64 cores is approximately 21 GB/s • 3D-Wave is 20% more bandwidth efficient than 2D-Wave • Scheduling and Priority reduce locality and increase bandwidth

  27. Conclusions • 3D-Wave scales with high efficiency to large number of cores • 3D-Wave allows efficient use of many-cores architectures for video processing • Frame priority reduces latency to its minimum

  28. References • [3] Meenderinck, C., Azevedo, A., Alvarez, M., Juurlink, B., Ramirez, A.: “Parallel Scalability of H.264,” First Workshop on Programmability Issues for Multi-Core Computers 2008. • [10] Alvarez, M., Salami, E., Ramirez, A., Valero, M.: “HD-VideoBench: A Benchmark for Evaluating High Definition Digital Video Applications,” IEEE International Symposium on Workload Characterization 2007. • [13] Hoogerbrugge, J., Terechko, A.: “A Multithreaded Multicore System for Embedded Media Processing,” Transactions on High-Performance Embedded Architectures and Compilers 2008. • M. Alvarez, A. Ramirez, M. Valero, A. Azevedo, C.H. Meenderinck, B.H.H. Juurlink, “Performance Evaluation of Macroblock-level Parallelization of H.264 Decoding on a cc-NUMA Multiprocessor Architecture,” The 4CCC: 4th Colombian Computing Conference, Bucaramanga, Colombia, April 2009. • A. Azevedo, B.H.H. Juurlink, C.H. Meenderinck, A. Terechko, J. Hoogerbrugge, M. Alvarez, A. Ramirez, M. Valero, “A Highly Scalable Parallel Implementation of H.264,” Transactions on High-Performance Embedded Architectures and Compilers (HiPEAC), September 2009.

More Related