1 / 62

Accelerating Applications using FPGAs Satnam Singh, Microsoft Research, Cambridge UK

Accelerating Applications using FPGAs Satnam Singh, Microsoft Research, Cambridge UK. A Heterogeneous Future. Example Speedup: DNA Sequence Matching. Why are regular computers not fast enough?. FPGAs are the Lego of Hardware. multiple independent multi-ported memories. hard and soft

anoki
Download Presentation

Accelerating Applications using FPGAs Satnam Singh, Microsoft Research, Cambridge UK

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Accelerating Applications using FPGAsSatnam Singh, Microsoft Research, Cambridge UK

  2. A Heterogeneous Future

  3. Example Speedup: DNA Sequence Matching

  4. Why are regular computers not fast enough?

  5. FPGAs are the Lego of Hardware

  6. multiple independent multi-ported memories hard and soft embedded processors fine-grain parallelism and pipelining

  7. The heart of an FPGA

  8. LUT4 (OR)

  9. LUT4 (AND)

  10. LUTs are higher order functions i3 i2 i1 i2 i1 o o o i o i1 i0 i0 i0 lut1 lut2 lut3 lut4 inv = lut1 notand2 = lut2 (&&) mux = lut3 (ls d0 d1 . if s then d1 else d0)

  11. FPGAs as Co-Processors XD2000i FPGA in-socket accelerator for Intel FSB XD2000F FPGA in-socket accelerator for AMD socket F XD1000 FPGA co-processor module for socket 940

  12. What kind of problems fit well on FPGA?

  13. scientific computing data mining search image processing financial analytics opportunity challenge

  14. Fibonacci Example 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, ...

  15. entity fib is port (signalclk, rst : in bit ; signalfibnr : out natural) ; endentityfib ; architecture behavioural of fib is signallastFib, currentFib : natural ; begin compute_fibs : process begin waituntilclk'eventandclk='1' ; ifrst = '1' then lastFib <= 0 ; currentFib <= 1 ; else currentFib <= lastFib + currentFib ; lastFib <= currentFib ; endif ; end process compute_fibs ; fibnr <= currentFib ; end architecture behavioural ;

  16. demonstration...

  17. FPGA hardware (VHDL) GPU code (Accelerator) data parallel descriptions C++ SMP

  18. The Accidental Semi-colon ;

  19. Kiwi gate-level VHDL/Verilog Kiwi C-to-gates structural parallel imperative imperative (C) thread 1 ; ; thread 3 ; thread 2 jpeg.c

  20. Kiwi Library circuit model Kiwi.cs JPEG.cs Visual Studio Kiwi Synthesis multi-thread simulation debugging verification circuit implementation JPEG.v

  21. circuit C to gates Thread 1 parallel program circuit C to gates Thread 2 C# C to gates circuit Thread 3 circuit Thread 3 C to gates Verilog for system

  22. Our Implementation • Use regular Visual Studio technology to generate a .NET IL assembly language file. • Our system then processes this file to produce a circuit: • The .NET stack is analyzed and removed • The control structure of the code is analyzed and broken into basic blocks which are then composed. • The concurrency constructs used in the program are used to control the concurrency / clocking of the generated circuit.

  23. System Composition • We need a way to separately develop components and then compose them together. • Don’t invent new language constructs: reuse existing concurrency machinery. • Adopt single-place channels for the composition of components. • Model channels with regular concurrency constructs (monitors).

  24. Writing to a Channel publicclassChannel<T> { T datum; bool empty = true; publicvoid Write(T v) { lock (this) { while (!empty) Monitor.Wait(this); datum = v; empty = false; Monitor.PulseAll(this); } }

  25. Reading from a Channel public T Read() { T r; lock (this) { while (empty) Monitor.Wait(this); empty = true; r = datum; Monitor.PulseAll(this); } return r; }

  26. user applications domain specific languages rendezvous join patterns transactional memory data parallelism systems level concurrency constructs threads, events, monitors, condition variables

  27. classFIFO2 { [Kiwi.OutputWordPort(“result“, 31, 0)] publicstaticint result; staticKiwi.Channel<int> chan1 = newKiwi.Channel<int>(); staticKiwi.Channel<int> chan2 = newKiwi.Channel<int>();

  28. publicstaticvoid Consumer() { while (true) { inti = chan1.Read(); chan2.Write(2 * i); Kiwi.Pause(); } } publicstaticvoid Producer() { for (inti = 0; i < 10; i++) { chan1.Write(i); Kiwi.Pause(); } }

  29. publicstaticvoid Behaviour() { ThreadProducerThread = newThread(newThreadStart(Producer)); ProducerThread.Start(); ThreadConsumerThread = newThread(newThreadStart(Consumer)); ConsumerThread.Start();

  30. Filter Example thread one-place channel

  31. publicstaticint[] SequentialFIRFunction(int[] weights, int[] input) { int[] window = newint[size]; int[] result = newint[input.Length]; // Clear to window of x values to all zero. for (int w = 0; w < size; w++) window[w] = 0; // For each sample... for (inti = 0; i < input.Length; i++) { // Shift in the new x value for (int j = size - 1; j > 0; j--) window[j] = window[j - 1]; window[0] = input[i]; // Compute the result value int sum = 0; for (int z = 0; z < size; z++) sum += weights[z] * window[z]; result[i] = sum; } return result; }

  32. Transposed Filter

  33. staticvoidTap(inti, bytew, • Kiwi.Channel<byte> xIn, • Kiwi.Channel<int> yIn, • Kiwi.Channel<int> yout) • { • bytex; • int y; • while(true) • { • y = yIn.Read(); • x = xIn.Read(); • yout.Write(x * w + y); • } • }

  34. Inter-thread Communication and Synchronization // Create the channels to link together the taps for (int c = 0; c < size; c++) { Xchannels[c] = newKiwi.Channel<byte>(); Ychannels[c] = newKiwi.Channel<int>(); Ychannels[c].Write(0); // Pre-populate y-channel registers with zeros }

  35. // Connect up the taps for a transposed filter for (inti = 0; i < size; i++) { • int j = i; // Quiz: why do we need the local j? ThreadtapThread = newThread(delegate() { Tap(j, weights[j], Xchannels[j], Ychannels[j], Ychannels[j+1]); }); tapThread.Start(); }

  36. using System; usingSystem.Collections.Generic; usingSystem.Text; usingMicrosoft.Research.DataParallelArrays; usingPA = Microsoft.Research.DataParallelArrays.ParallelArrays; usingIPA = Microsoft.Research.DataParallelArrays.IntParallelArray; namespaceForOxford { classProgram { staticvoid Main(string[] args) { PA.InitGPU(); IPA is1 = newIPA(4, newint[] { 1, 2, 3, 4 }); IPA is2 = newIPA(4, newint[] { 5, 6, 7, 8 }); IPA is3 = newIPA(4, is1.Shape); is3 = PA.Add(is1, is2); IPA result = PA.Evaluate(is3); int[] ra1; PA.ToArray(result, out ra1); foreach (intiin ra1) Console.Write(i + " "); Console.WriteLine(""); } } }

  37. Example: Bitmap Blur(Using Accelerator v1.1.1) usingPA = Microsoft.Research.DataParallelArrays.ParallelArrays; usingFPA = Microsoft.Research.DataParallelArrays.FloatParallelArray; float[,] Blur (float[] kernel) { FPA pa = newFPA(bitmap); // Convolve in X direction FPAresultX = newFPA(0, pa.Shape); for (inti = 0; i < kernel.Length; i++) { resultX += PA.Shift(pa, 0, i) * kernel[i]; } // Convolve in Y direction. FPAresultY = newFPA(0, pa.Shape); for (inti = 0; i < kernel.Length; i++) { resultY += PA.Shift(resultX, i, 0) * kernel[i]; } float [,] result; PA.ToArray (resultY, out result); return result; }

  38. Expression Graphs rX FPA pa = new FPA(bitmap); // Convolve in X direction FPA rX = new FPA(0, pa.Shape); for (inti = 0; i < kernel.Length; i++) { rX += PA.Shift(pa, 0, i) * kernel[i]; } Shift (0,0) Shift (0,1) * pa k[0] + rX + * k[1] + …

  39. classProgram { staticvoid Main(string[] args) { IPA.InitGPU(); • IPA ipa1 = newIPA(5, newint[] {1, 2, 3, 4, 5}) ; • IPA ipa2 = newIPA(5, newint[] {10, 20, 30, 40, 50}) ; • IPA ipa3 = newIPA(5, newint[] {21, 5, 7, 4, 8}); • IPA ipa4 = newIPA(5, newint[] {4, 1, 7, 2, 5}) ; IPAipa5 = newIPA(5, ipa1.Shape); ipa5 = PA.Add(is1, is2); • IPA result = PA.Multiply (ipa4, • (PA.Subtract (ipa3, PA.Add(ipa1, ipa2)))); int[] ra1; PA.ToArray(result, out ra1); foreach (intiin ra1) Console.Write(i + " "); Console.WriteLine(""); } }

More Related