1 / 34

FPGA Logic Cluster Design

FPGA Logic Cluster Design. Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223. How Much Logic Should Go in an FPGA Logic Block?. Vaughn Betz, Jonathan Rose IEEE Design & Test of Computers 15(1): 10-15 (1998). Three Questions.

kirk
Download Presentation

FPGA Logic Cluster Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223

  2. How Much Logic Should Go in an FPGA Logic Block? Vaughn Betz, Jonathan Rose IEEE Design & Test of Computers 15(1): 10-15 (1998)

  3. Three Questions • How many inputs should the FPGA routing provide to a cluster of LUTs? (I) • Routing flexibility vs. area • As the number of LUTs in a logic cluster changes, how should the FPGA’s routing architecture change? (Fc) • How many LUTs should be included in a cluster? (N)

  4. Experimental Methodology • 20 MCNC Benchmarks • Well-established • A bit old, even by 1998 standards • Sadly, still in use • 4-LUT Architecture • Fs = 3 • Vary other parameters to see what works best

  5. Area Model • Count the number of min-width transistors required to implement a benchmark circuit in an FPGA architecture • Normalized Area (Num min-width transistors used) / (Num BLEs used)

  6. How many cluster inputs do we need? Input sharing and output re-use within a logic cluster We hit near 100% utilization when I = 50-60% of the total number of BLE inputs We can pack BLEs together to share common inputs Re-use locally generated outputs Works because the packing algorithm was effective!

  7. Visual Depiction Fanout Use the feedbacks! I = ~0.6KN is pretty good

  8. The Packer was Effective! It packed BLEs together to share common inputs It re-use locally generated outputs via the feedbacks

  9. Cluster inputs vs. Cluster size Approx. (2N + 2) N = 1 BLE uses 3.5/4 inputs (on average) N = 16 BLEs uses 19.7 / 64 inputs, on average

  10. Commercial FPGAs • Altera Flex 8000 FPGA uses a cluster of size N=8 with I=24 • Results suggest to reduce I to 18 (save area) • Xilinx 5200 FPGA uses a cluster of size N=4 with I=16 • Results suggest to reduce I to 10 (save area)

  11. Routing Flexiblity vs. Cluster Size • Set Fc = W/N • Each routing track is driven by one LUT output pin in the cluster

  12. Area Efficiency vs. Cluster Size I is set to achieve 98% logic utilization N=2 BLEs introduces intra-cluster routing Area efficiency rapidly degrades beyond this point Reduce routing between logic blocks

  13. Conclusions • I = 2N + 2 for N < 16 • Slow, linear growth • Reduce Fc • Works because LUT inputs are equivalent • Cluster area efficiency is within 10% for 1 < N < 8 • Large clusters reduce the size of the placement problem and increase FPGA speed

  14. The Effect of LUT and Cluster Size on Deep-Submicron FPGA Performance and Density Elias Ahmed, Jonathan Rose IEEE Transactions on VLSI Systems 12(3): 288-298 (2004)

  15. Contributions • Vary LUT size (K) from 2 to 7 • Vary cluster size (N) from 1 to 10 LUTs • Experimentally determine the number of cluster inputs (I) as a function of K and N • Clustering small LUTs (K=2,3) produces good area results, but bad performance (~2x worse) • LUTs of size (K=4,5,6), clusters of size (N=3…10) yield the best area-delay product

  16. CAD Flow

  17. Inputs Req.’d for 98% Area Utilization I = ½K(N+1)

  18. Total Area • Intra-cluster routing area is 25-35% of the total area • LUT sizes of K = 4,5 are the most area efficient for all cluster sizes • Reduction in total area as cluster size increases from 1-3 for all LUT sizes • As clusters are made larger (N > 4) there is little impact on total FPGA area

  19. Total Intra-cluster Routing Area The increase in cluster size far outweighs the rate of decrease in the number of clusters: hence the upward trend

  20. #Clusters and Area/Cluster vs. K 25-35% N = 1 BLE per Cluster

  21. LUT area vs. Intra-cluster Mux Area LUT area dominates Intra-cluster routing area is 25-35% of logic cluster area

  22. Intra-cluster Routing Area as a Function of LUT Size Total intra-cluster routing area decreases near-linearly from K = 3 to 7

  23. Total Intra-cluster Routing Area • Routingarea decreases linearly with LUT size • Increasing LUT sizes decreases the number of clusters used faster than the rate of increase in routing area per cluster • Depends on good CAD tools The product of these two curves gives the total inter-cluster routing area.

  24. Critical Path Delay vs. LUT Size • As N and K increase • LUT delay and the delay through a single cluster increases • The number of LUTs and clusters in series on the critical path decreases • Reduced global routing delay • Increasing both N and K has a positive effect • Benefits saturate as N and K get large

  25. Intra-cluster Delay vs. LUT Size • Intra-cluster delay decreases as K increases • Reduction in number of BLE levels on critical path • Intra-cluster delay increases as N increases • Larger intra-cluster cluster muxes are slower • The delay through these muxes is still much faster than global routing delay

  26. BLE Delay vs. K BLE delay increases linearly as K increases (intuitive) • Number of BLEs on the critical path decreases quadratically as K increases • Fewer, but larger, BLEs

  27. Global Routing Delay vs. K • As K increases • Fewer LUTs on the critical path • Fewer global routing links • As N increases • More opportunities to use faster intra-cluster routing

  28. Critical Path Delay (K = 4) • K remains constants • No reduction in number of BLEs on critical path • N increases • BLE and intra-cluster routing delay increase • More logic implemented internally within clusters • Can use faster intra-cluster routing instead of global routing

  29. Critical Path Delay vs. LUT Size (Recap) • Increasing N beyond 3 has minimal effects • Limited effectiveness of clustering • Architectural weakness? • Semi-effective CAD tools?

  30. Number of Logic Clusters on Critical Path • The number of logic levels decrease with • increasing N and K • For a given K, most of the reduction is from N = 1 to 3 • The majority of the critical path delay was reduced in this range • Increasing N is less effective when K is large

  31. BLE Fanout vs. LUT Size • Larger LUTs have larger average fanout • Harder to ensure that increasing N will result in fewer cluster levels on the critical path • Smaller LUTs have better response to increasing N because each LUT • has a relatively small fanout • Adding an extra BLE to the cluster guaranteed some reduction in the number of logic levels

  32. Area-Delay Product • Large Delays • Many BLEs on critical path • Slightly larger area requirement Large area cost for K=7 outweighs marginal delay improvement

  33. Caveats • Quality of CAD tools • Mix of benchmark circuits • Limited exploration of routing parameter design space • Parameters were derived from N = K = 4

  34. Best Overall Results and Summary • To achieve 98% LUT utilization, set I = ½K(N+1) • Small LUT sizes are not area efficient and have poor performance characteristics • Future challenges • Reduce number of BLEs on critical path without resorting to larger LUTs • Reduce intra-cluster routing delays

More Related