1 / 81

Heiko Schröder, 2003

ROUTING ? Sorting? Image Processing? Sparse Matrices?. Reconfigurable Meshes !. Heiko Schröder, 2003. Reconfigurable architectures. FPGAs reconfigurable multibus reconfigurable networks (Transputers, PVM) dynamically reconfigurable mesh Aim: efficiency

dinh
Download Presentation

Heiko Schröder, 2003

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ROUTING ? Sorting? Image Processing? Sparse Matrices? Reconfigurable Meshes ! Heiko Schröder, 2003

  2. Reconfigurable architectures • FPGAs • reconfigurable multibus • reconfigurable networks (Transputers, PVM) • dynamically reconfigurable mesh Aim: efficiency special purpose --> general purpose architectures

  3. contents 1.) Motivation for the reconfigurable mesh 2.) Routing (and sorting): • better than PRAM • better than mesh 3.) Image processing 4.) Sparse matrix multiplication 5.) Bounded bus length

  4. cut 8 2 0 6 3 7 9 4 5 1 PRAM 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 diameter O(1) bisection width (n) EREW CRCW

  5. Diameter ( ) bisection width ( ) 2D mesh Mesh/Torus

  6. 0 00 10 0-D 1-D 2-D 1 01 11 0 1 000 010 001 011 3-D 4-D 100 110 101 111 Hypercube diameter O(log n) bisection width (n)

  7. low cost reconfigurable mesh reconfigurable mesh = mesh + interior connections diameter 1 !! 15 positions

  8. 0 1 0 0 1 0 0 “V” * * global OR Time: O(1) on RM -- (log n) on EREW-PRAM

  9. 0 1 2 3 4 * 5 6 7 8 9 Prefix sum 0 1 1 0 1 0 0 1 1 1 Time : O(1) Area: (nxn) Fast but expensive

  10. 0 1 1 0 1 1 * 1 mod 3 Modulo 3 counter Time: O(1) on RM (log n / log log n) on CRCW-PRAM

  11. 0 1 1 0 1 1 * 1 mod k modulo k2 counter (ranking) • 2 digit numbers to the basis of k represent all numbers smaller than k2. • 1.) determine x mod k (=lsd) • 2.) count number of “wraps” (=msd). --> modulo k2 counting in 2 steps on a k x k2 array

  12. 1 2 1 2 1 2 1 2 1 2 3 4 1 2 3 4 1 2 3 4 5 6 7 8 enumeration / prefix sum 1 1 1 1 1 1 1 1 time: O(log n) wire efficiency ! -- (compared with tree) 1/2 number of processors

  13. n x n permutation routing - 2 steps 2 steps !!!

  14. Kunde’s all-to-all mapping Sorting: sort blocks all-to-all (columns) sort blocks all-to-all (rows) o-e-sort blocks

  15. broadcast (1) rank (2) broadcast (1) sorting in constant time block Sort blocks: Complete sort: sort blocks all-to-all (2) sort blocks all-to-all (2) o-e-sort blocks

  16. better than PRAM --- but useless!!!

  17. Kunde’s all-to-all mapping n x n

  18. vertical all-to-all

  19. horizontal all-to-all

  20. k/2 steps Use of bus -- no conflict 1 step (k/2)2 steps 2 steps 3 steps 3 steps 2 steps 1 step

  21. x 2 x 2 /2 T = n/2 all-to-all sorting in optimal time Kunde / Schröder (k/2)2 steps k=n1/3 each step takes n1/3 time --> T= n/4 Sorting: sort blocks (O(n2/3)) all-to-all (n/2) sort blocks (O(n2/3)) all-to-all (n/2) o-e sort blocks (O(n2/3)) (snake like order of blocks) time: n + o(n)

  22. Sorter for n keys Bisection of data with k wires Sorting time n/k Why optimal?

  23. Use of theorem 1.) n keys on a kxk RM: Time n/k Proof: Wherever the data is stored there is always a bisectionof length k -- this can be demonstrated sweeping left right through the array. Q.e.d. 2.) nxn keys on an nxn RM: Time n. Proof: trivial

  24. n + o(n) Optimal --- but ...

  25. 1 2 1 2 1 2 1 2 1 2 3 4 1 2 3 4 1 2 3 4 5 6 7 8 enumeration / prefix sum 1 1 1 1 1 1 1 1 time: O(log n) wire efficiency ! -- (compared with tree) 1/2 number of processors

  26. C D A ABCD-routing • move and smooth B Row-major enumeration of A, B, C and D packets within each quadrant in time 4 log n. Determine destination position of each packet.

  27. 1 2 3 4 5 6 7 1 2 3 4 5 6 7 move 8 9 8 9 10 10 collect 1 1 2 3 4 2 3 4 5 6 5 6 7 7 smooth 8 9 8 9 10 10 elementary steps

  28. A B C D A B C D collect move smooth time analysis time: 3 x n/2 T=3n+o(n)

  29. 4 destination squares time: 3n + 4 log n 16 destination squares time: 2n + 16 log n 64 destination squares time: 12/7 n + 64 log n T < 2n mesh-diameter: 2n

  30. enough of routing/sorting Constant factor ! Can we do better ? What kind of problems ? Image processing Sparse problems !

  31. Image processing • Border following • Edge detection • Component labeling • Skeletons • Transforms

  32. Object Define border (candidates) Set bus Component labelling While own label is not received: 1.) Candidates brake bus and send their label a) clockwise b) anti-clockwise 2.) Candidates switch off and restore bus if they see smaller label Time: O(1) -- O(log n)

  33. Transforms • Wavelet transform: Time log n on RM -- time n on mesh • FFT: Timenon RM and mesh • Hough transform: Time m x log n on RM -- time m x n on mesh

  34. j i systolic matrix multiplication B time: n A C

  35. Time: n (nxn mesh) A and B column sparse (k2) A and B row sparse (k2) A row sparse, B column sparse (k2) A column sparse, B row sparse sparse matrix multiplication x = A B C

  36. 3 3 3 3 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 2 2 2 2 3 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 2 3 3 3 3 3 3 3 3 3 1 1 1 1 1 unlimited bus length • ring broadcast 1 2 3

  37. A row-sparse B column-sparse Repeat r times Begin horizontal ring broadcast aik Repeat c times vertical ring broadcast bkj End. c B r A C

  38. i i A and B column-sparse Repeat c1 times Begin horizontal ring broadcast aik (meets bkj) Repeat c2 times vertical ring broadcast product to final position End. c2 elements }k{ c1 T A B/C j

  39. lower bound (c,r) c=r=k k=3 n=48 B A C

  40. splitting the problem Repeat k times Begin vertical ring broadcast Repeat s times horizontal ring broadcast End. s r A=A + A s r C=A B+A B s r C=C+C k B-elements first s s T s s A B/C time: ks

  41. s r A=A + A first s s s A CR A has nk non-zero elements  Ar has at most nk/s non-zero rows  for s= n Ar has at most k n non-zero rows. As B is a RR- problem  it takes time k n .

  42. k B-elements k A-elements T r r A A B/C elements Ar B calculating products time: k2

  43. j rout time: only elements per column column sum i-1 i i+1 row i time: log n

  44. rout time: routing within columns

  45. Reconfigurable architectures Reconfigurable mesh ? constant diameter ! No !!! Physical laws!

  46. Physical limits • 30cm/ns • on chip: 1cm/ns • --> bounded bus length c=300 000 km/sec good idea !

  47. 1 2 2 3 3 3 3 3 1 1 2 2 2 2 2 3 3 3 3 1 1 1 1 1 2 2 3 1 1 2 2 2 2 2 3 3 1 1 1 2 2 1 1 2 2 3 3 3 3 3 1 1 1 1 2 3 3 bounded broadcast 1 2 3 time: k + n/l

  48. creating main stations 1 2 3 1 2 3 1 2 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 time: k

  49. A row-sparse B column-sparse Create main stations 1,…,k for A and B (time: n/l+k) For i=1,…,k do Begin horizontal ring broadcast i of A For j=1,…,k do vertical ring broadcast j of B End. k B k A C

  50. i k elements k i T A B/C A and B column-sparse Create main stations 1, … , k for A (time: n/l+k) For i=1,…,k do Begin horizontal ring broadcast i k bounded vertical broadcasts of products merging new products End.

More Related