Application Performance through Hardware Acceleration

Dan Legorreta, Moshe Looks, Shobana Padmanabhan CSE 560 Oct 2005 Application Performance throughHardware Acceleration

[Hierarchical] Clustering [in Hardware] • Clustering • Assign points in a space to non-overlapping clusters • Minimize inter-cluster distances • Maximize intra-cluster distances • Hierarchical Clustering • Cluster the clusters; generates a tree (dendogram) showing hierarchical structure of the data • Agglomerative (bottom-up) or Partitioning (top-down) • Why do it in hardware? • Clustering often applied to biology or internet data with millions of items to cluster, and thousands of dimensions • Clustering may be applied to high-volume datastreams • Clustering algorithms are slow ~ O(n2d) or worse

What’s Been Done? • K-means, the most popular flat clustering algorithm, has been implemented in hardware: • M. Estlick, M. Leeser, J. Theiler, and J. J. Szymanski, “Algorithmic Transformations in the Implementation of K-means Clustering on Reconfigurable Hardware” (FPGA2001). • 17 citations, incl. other hardware implementations of flat clustering algorithms • Hierarchical Clustering • M.Y. Niamat, D. Bitter, and M.M. Jamali, “FPGA Implementation of Hierarchical Clustering Algorithms” (ISCAS1998). • Simple agglomerative clustering on 8 Xilinx 4003APC84 FPGAs • They just coded in VHDL and simulated it; no results given! • No other papers found • No known experimental results or implementations of top-down hierarchical clustering in hardware!

Liquid architecture platform Workstation program FPGA gcc SRAM / SDRAM Memory Controller 001010 110110 001110 Core Cache Controller Address/ Data bus AHB I-CACHE D-CACHE Command Controller Control S/W Interface Clustering application FPX LEON 001010 110110 001110 • LEON - SPARC8 compatible & • Open soft core

Application runtime Workstation Non-intrusive, cycle-accurate profiling from hardware implementation FPGA SRAM / SDRAM Memory Controller 001010 110110 001110 Core Cache Controller Statistics Module Address/ Data bus AHB I-CACHE D-CACHE Command Controller Control S/W Interface Request Timings FPX dotproduct 70% LEON

Improve performance through hardware implementation + dot product

Improve performance through hardware implementation Workstation program FPX FPGA gcc LEON SRAM / SDRAM Memory Controller 001010 110110 001110 Core Cache Controller + dot product Address/ Data bus AHB I-CACHE D-CACHE Command Controller Control S/W Interface APB

Hardware acceleration Workstation program FPX FPGA gcc LEON SRAM / SDRAM Memory Controller 001010 110110 001110 Core Cache Controller + dot product Address/ Data bus AHB I-CACHE D-CACHE Command Controller Control S/W Interface APB 001010 110110 001110

Dot product implementation Core Cache Controller Address/ Data bus AHB I-CACHE D-CACHE FPGA 0x800000D0 #2 #3 bitV #1 LEON 0x800000D4 Dot product circuit 001010 110110 001110 #2 #3 bitV #1 APB Memory Controller 0x800000D8 #2 #3 bitV #1 0x800000DC stat re result Command Controller

Plan • Changes: • APB device with memory-mapped registers, instead of changing compiler. • Due to the overhead with APB, we are planning to also look at co-processor interface. • New schedule: • APB implementation, including dot-product, this week. • Co-processor interface, as much as possible, from next week.

Application Performance through Hardware Acceleration