Customizable Domain-Specific Computing Proposal for NSF “Expedition in Computing” Program

Customizable Domain-Specific Computing Proposal for NSF “Expedition in Computing” Program Point of Contact: Prof. Jason Cong cong@cs.ucla.edu Participating Universities: UCLA (lead), Rice, Ohio-State, and UC Santa Barbara (Complete list of PI/Co-PI available inside)

Focus: Power/Energy Efficient ComputationCurrent Solution: Parallelization Parallelization Source: ShekharBorkar, Intel

Our Proposal: Beyond Parallelization – Customizable Domain-Specific Computing Parallelization Customization Adapt the architecture to application Source: ShekharBorkar, Intel

Motivation and Vision A few facts We have sufficient computing power for most applications Each user/enterprise need high computing power for only limited tasks in his/her application-domain Application-specific integrated circuits (ASIC) can lead to 10,000x+ better power performance efficiency, but too expensive to design and manufacture Our vision and approach A general, customizable platform for the given domain(s) Can be customized to a wide-range of applications in the domain with novel compilation and runtime systems Can be massively produced with cost efficiency Can be programmed efficiently Goal: A “supercomputer-in-a-box” with 100x performance/power improvement via customization for the intended domain(s) Analogy: Advance of civilization via specialization/customization

Application Domains: Medical Image Processing & Hemodynamic Simulation Medical imaging has transformed healthcare An in vivo method for understanding disease development and patient condition Estimated to be $100 billion/year More powerful & efficient computation can help Fewer exposure using compressive sensing with lower sampling frequency Better clinical assessment using improved registration and segmentation algorithms to provide quantitative measures of disease (e.g., cancer) Hemodynamic simulation Very useful for surgical procedures involving blood flow and vasculature Both may take hours to days to construct Clinical requirement: 1-2 min Magnetic resonance (MR) angiography of an aneurysm Intracranial aneurysm reconstruction with hemodynamics

Application Domains: Medical Image Processing Pipeline compressive sensing reconstruction total variational algorithm denoising fluid registration registration level set methods segmentation Navier-Stokesequations analysis

Application Domains: Medical Image Processing Pipeline compressive sensing • 3D median filter: For each voxel, compute the median of the 3 x 3 x 3 neighboring voxels • Bi-harmonic registration (Using the same algorithm on all platforms) iterative, local or global communicationdense and sparse linear algebra, optimization methods reconstruction • These algorithms have diverse computation & communication patterns • A single, homogeneous system cannot perform very well on all of these algorithms • Need architecture customization and hardware-software co-optimization • Include many common computation kernels (“motifs”) • Applicable to other domains • CPU (Xenon 2.0 GHz) • 1x • ~100 W • GPU (Tesla C1060) • 93x • ~150 W • FPGA (xc4vlx100) • 11x • ~5W • CPU (Xenon 2.0 GHz) • Quick select • 1x • ~100 W • GPU (Tesla C1060) • Median of medians • 70x • ~140 W • FPGA (xc4vlx100) • Bit-by-bit majority voting • 1200x • ~3 W total variational algorithm Non-iterative, highly parallel, local & global communication sparse linear algebra, structured grid, optimization methods denoising fluid registration parallel, global communicationdense linear algebra, optimization methods registration level set methods local communication dense linear algebra, spectral methods, MapReduce segmentation Navier-Stokesequations analysis local communicationsparse linear algebra, n-body methods, graphical models

Overview of the Proposed Research Customizable Heterogeneous Platform (CHP) DRAM I/O CHP $ $ $ $ DRAM CHP CHP FixedCore FixedCore FixedCore FixedCore CustomCore CustomCore CustomCore CustomCore Domain-specific-modeling(healthcare applications) ProgFabric ProgFabric ProgFabric ProgFabric Application modeling Reconfigurable RF-I bus Reconfigurable optical bus Transceiver/receiver Optical interface Domain characterization Architecture modeling CHP creationCustomizable computing engines Customizable interconnects CHP mappingSource-to-source CHP mapperReconfiguring & optimizing backendAdaptive runtime Customization setting Invoke many times Design once

CHP Creation – Design Space Exploration Core parameters • Frequency & voltage • Datapath bit width • Instruction window size • Issue width • Cache size & configuration • Register file organization • # of thread contexts • … Customizable Heterogeneous Platform (CHP) $ $ $ $ NoC parameters • Interconnect topology • # of virtual channels • Routing policy • Link bandwidth • Router pipeline depth • Number of RF-I enabled routers • RF-I channel and bandwidth allocation • … FixedCore FixedCore FixedCore FixedCore CustomCore CustomCore CustomCore CustomCore Custom instructions & accelerators • Amount of programmable fabric • Shared vs. private accelerators • Custom instruction selection • Choice of accelerators • … Reconfigurable RF-I bus Reconfigurable optical bus Transceiver/receiver Optical interface ProgFabric ProgFabric ProgFabric ProgFabric Key questions: Optimal trade-off of efficiency & customizability Which options to fix at CHP creation? Which to be set by CHP mapper?

CHP Mapping – Compilation and Runtime Software Systems for Customization Goal: Efficientcompiler and runtime support to map domain-specific specification to customizable hardware Adapt the CHP to a given application for drastic performance/power efficiency improvement Domain-specific applications Abstract execution Programmer Domain-specific programming model(Domain-specific coordination graph and domain-specific language extensions) Application characteristics CHP architecture models Source-to source CHP Mapper C/C++ code Analysis annotations C/SystemC behavioral spec C/C++ front-end RTL Synthesizer (xPilot) Performance feedback Reconfiguring and optimizing back-end Binary code for fixed & customized cores Customized target code RTL for prog fabric Adaptive runtimeLightweight threads and adaptive configuration CHP architectural prototypes (CHP hardware testbeds, CHP simulation testbed, full CHP)

Center for Domain-Specific Computing (CDSC) Organization A diversified & highly accomplished team: 8 in CS&E; 1 in EE; 2 in medical school; 1 in applied math Aberle Baraniuk Bui Chang Cheng Cong (Director) Palsberg Potkonjak Reinman Sadayappan Sarkar(Associate Dir) Vese

Milestones

Milestones for Experimental Platforms • Prototype 1a: Heterogeneous integration of off-the-shelf CMPs + GPUs + FPGAs, e.g., • Intel Xeon CPU + Xilinx V5 FPGA (via FSB) + Nvidia Tesla GPU (via PCI-express 2.0) • Initial HW platform for CHP compilation and runtime system development • Prototype 1b: RF-interconnect prototype • RF-I implementation at 45nm CMOS with multiple digital cores/traffic generators • Performance, power, and reliability study • Prototype 2: final CHP implementation for the proposed healthcare domains • Single-chip integration or 3D integration Programmable fabric RF-I tape-out at IBM 90nm CMOS 3D CHP Fine-grain Cores DCT Unit Layer 2 Fixed core Customizable core Layer 1 Shared cache

Integrated Research and Education New courses planned based on the research “Architecture and Compilation for Domain-specific Computing” “Computational Techniques for Medical Imaging” “Programming Models and Application Development for Domain-specific Computing” With projects for new domain, e.g., scientific computing, VLSI CAD, and digital entertainment May be jointly taught (multi-disciplinary) Developed and shared via Connexions (cnx.org), an open-access education platform now with over 1M users/month (based at Rice) Graduate student training Estimated around 18 students in total in four campuses Seminars and workshops on interdisciplinary research, career development, ethics, entrepreneurship … Undergraduate student training 10 summer research fellowship each year, via UCLA FOCUS, Rice AGEP and similar programs Outreach to high-school students 5-7 high-school summer scholarship each year, via UCLA SMARTS programs

Outreach Partner: Frontier Opportunities in Computing for Underrepresented Students (FOCUS) Aims to increase the number of under-represented minorities interested in computing disciplines Currently has 50 underrepresented undergraduates: 23 in CS 27 in CSE http://ceed.ucla.edu 2007 summer research poster competition The first prize winner

Outreach Partner: Science Mathematics Achievement and Research Technology for Students (SMARTS) A six-week summer college preparation program at UCLA Engage underrepresented students in science, technology, engineering and math training SMARTS activities Course related activities Math courses (Intro to Statistics and AP Calculus Readiness) SAT preparation Research activities Will have CDSC faculty and graduate students involved to serve as mentors and provide projects This year, SMARTS program has over 80 applicants 30-35 will be admitted (due to limitation of funding)

Knowledge Transfer Main outcome of the project CHP prototypes Compilation and runtime system for CHP mapping Application drivers – original source code & modified code with domain-specific modeling General methodology for customizable computing (mainly through publications) #1 – 3 will be shared with the research community via web as they become available Industrial partners Altera, IBM, Intel, Magma, Mentor Graphics, Nvidia, Xilinx More will be contacted and included if the project is officially funded Campus partners UCLA Institute of Digital Research and Education (IDRE) Institute of Pure and Applied Mathematics (IPAM) UCLA Wireless Health Institute (WHI) Technology transfer experience Impact via industrial partners: IBM, Intel, Xilinx … Startups: Aplus (acquired by Magma in 2003), AutoESL (Magma and Xilinx were investors)

Why an Expedition • Address a fundamental problem – energy efficient computing • What’s beyond parallelization? • Our proposal – a transformative approach using customization • Many challenging research topics • Domain-specific modeling/specification • Novel architecture & microarchitecture for customization • Compilation and runtime software to support intelligent customization • New research in testing, verification, reliability, etc in customizable computing • Integrated effort in modeling, HW, SW, & application development • Demonstration in a critical application domain • Healthcare has a significant impact to economy and society • Can greatly benefit from customizable domain-specific computing

Customizable Domain-Specific Computing Proposal for NSF “Expedition in Computing” Program