1 / 27

Software Performance Tuning Project Monkey’s Audio

Software Performance Tuning Project Monkey’s Audio. Prepared by: Meni Orenbach Roman Kaplan. Advisors: Liat Atsmon Kobi Gottlieb. MAC – Ape File Encoder. Monkey’s Audio – a lossless audio codec Can Compress at different levels Can be decompressed back to a Wav file

Download Presentation

Software Performance Tuning Project Monkey’s Audio

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Software Performance Tuning Project Monkey’s Audio Prepared by: Meni Orenbach Roman Kaplan Advisors: Liat Atsmon Kobi Gottlieb

  2. MAC – Ape File Encoder • Monkey’s Audio – a lossless audio codec • Can Compress at different levels • Can be decompressed back to a Wav file • Used to save memory while maintaining all the original data • Playable

  3. PlatformAnd Benchmark Used • Platform: Intel Pentium Core i7 3GB of RAM and with a Windows Vista operating System. • Benchmark: - 238MB song. -Original Encoding Duration: 98.9 Sec

  4. Algorithm Description • The input file is read frame by frame • Every frame contains a constant number of channels • Channels encoded with dependency between them • Every frame is encoded and immediately written

  5. MultiThread Here! MultiThread Here! MultiThread Here! The Encoding Process

  6. Function Data flow Encoding the error for every channel Encode with a Predictor Encoding every Frame Encoding every Frame Most time Consuming functions

  7. Optimization Method • Dealing with the most time consuming functions • Two approaches were taken: • Multi-threading • SIMD

  8. Optimization Method 1: Threads • Monkey’s Audio was managed by a single thread • Threads should maintain 1:1 bit compatibility • Changing the flow of the program is required

  9. Changing The Program Flow • Originally: • Each frame is encoded and written immediately • After The Change: • Each frame is encoded and written to a buffer • The buffer is filled through the encode process • Write the buffer once all previous frames have been encoded and written

  10. Our Implementation We use the following threads: • Main thread Transfers frame data to the encode thread • Write thread Writes the encoded buffers to the output file • Encode threads Encodes the frame it is given Note: we use N+2 threads, when N is the number of threads available.

  11. Data Structures Used ThreadParam – a linked list of objects that contains the encoded data EncodeParam – an object containing data needed to encode a frame WriteParam – an object containing data needed to write to the output FramePredictor - global array that signal dependency between frames

  12. Threads Schema

  13. Dependencies Between Frames Once a frame finished encoding, there may be a left over of data, which is dealt with in 2 ways: Writing the left over data after the encoded frame Re encode the left over data with the next frame We always write the left over data after the encoded frame

  14. Dealing With DependenciesBetween Frames • Using the write thread to start a new encode thread • Remove the ‘wrongly encoded’ frame from the list • Keep encoding the rest normally • Keep writing to the output file in the right order!

  15. The Problem There is also a data leftover between frames This dependency is unpredictable It is impossible to maintain 1:1 bit compatibility We ‘guess’ the best value so we don’t lose data!

  16. Results: Vtune Thread Profiler

  17. Results: Vtune Thread Checker

  18. MultiThreading Conclusion • Total speedup from using MT: x3.15!

  19. Explaining The Speedup When considering Amdahl’s law we have 2 serial parts (reading the first frames and encoding the last frame) that takes about 8% of our benchmark so we get: In addition while implementing our solution, in order to deal the dependencies we added ~20% instruction, thus we expect:

  20. Optimization Method 2: SIMD Original Code is written using MMX technology Operations with only 16bit Integer arrays Two main functions we used SSE on: Adapt() CalculateDotProduct() Note:These functions written entirely in ASM

  21. Adapt()- Improvements Add and Sub instructions on arrays of 16 bit Integers (supported in MMX) Each iteration goes over 32 sequential array elements The input and output arrays were aligned to prevent ‘Split loads’

  22. Adapt() – Main Loop Old code movq mm0, [eax] paddw mm0, [ecx] movq [eax], mm0 movq mm1, [eax + 8] ... movq mm3, [eax + 24] paddw mm3, [ecx + 24] movq [eax + 24], mm3 New code (aligned) movdqa xmm0, [eax] movdqa xmm2, [ecx] paddw xmm0, xmm2 movdqa [eax], xmm0 movdqa xmm1, [eax + 16] movdqa xmm3, [ecx + 16] paddw xmm1, xmm3 movdqa [eax + 16], xmm1 16 Vs. 12 instructions per iteration MMX register is 8 byte SSE register is 16 byte Note: There is equivalent loop with SUB operations

  23. SIMD - CalculateDotProduct() Multiply-Add of an 16bit Integers array. Each iteration goes over 32 array elements. Speedup will be calculated for both functions together.

  24. CalculateDotProduct() Old code movq mm0, [eax] pmaddwd mm0, [ecx] paddd mm7, mm0 movq mm1, [eax + 8] ... movq mm3, [eax + 24] pmaddwd mm3, [ecx + 24] paddd mm7, mm3 New code (aligned) movdqa xmm0, [eax] movdqa xmm4, [ecx] pmaddwd xmm0, xmm4 paddd xmm7, xmm0 movdqa xmm1, [eax + 16] movdqa xmm4, [ecx + 16] pmaddwd xmm1, xmm4 paddd xmm7, xmm1 Multiply-Add 16 Vs. 12 instructions per iteration • Each iteration is Multiply-Adding 32 array elements

  25. SIMD Speedup Achieved Adapt() local speedup: x1.72 Overall speedup: x1.2 CalculateDotProduct() local speedup: x1.62 Overall speedup: x1.2 Total speedup using SIMD: x1.4!

  26. Intel Tuning Assistant No Micro-Architectural problems found in the optimized code.

  27. Final Results A total speedup of x4.017 was achieved by using only MT and SIMD

More Related