1 / 84

Advanced Computer Architecture ML Accelerators: Why?

This course explores the concept of ML accelerators and why they are essential in modern computer architecture. Topics include project proposals, literature survey, Flynn's taxonomy of computers, systolic architectures, multi-core processors, and the advantages and disadvantages of each.

shiflett
Download Presentation

Advanced Computer Architecture ML Accelerators: Why?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ADVANCED COMPUTER ARCHITECTURE ML Accelerators: Why Samira Khan University of Virginia Feb 4, 2019 The content and concept of this course are adapted from CMU ECE 740

  2. AGENDA • Review from last lecture • Single core->multi core->accelerator • ML accelerators: why?

  3. LOGISTICS • Project list • Posted in Piazza • Be prepared to spend time on the project • Sample project proposals from many different years • Posted in Piazza • Project Proposal Due on Feb 11, 2019 • Project Proposal Presentations: Feb 13, 2019 • Can can present using your own laptop • Groups: 1 or 2 students

  4. Project Proposal • Problem: Clearly define what is the problem you are trying to solve • Novelty: Did any other work try to solve the problem? How did they solve it? What are the shortcomings? • Key Idea: What is the initial idea? Why do you think it will work? How is your approach different from the prior work?  • Methodology: How will you test and evaluate your idea? What tools or simulators will you use? What are the experiments you need to do to prove/disprove your idea?  • Plan: Describe the steps to finish your project. What will you accomplice at each milestone? What are the things you must need to finish? Can you do more? If you finish it can you submit it to a conference? Which conference do you think is a better fit for the work?

  5. LITERATURE SURVEY • Goal: Critically analyze related work to your project • Pick 2-3 papers related to your project • Use the same format as the reviews • What is the problem the paper is solving • What is the key insight • What are the advantages and disadvantages • How can you do better • Will become the related work in your proposal

  6. FLYNN’S TAXONOMY OF COMPUTERS • Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 • SISD: Single instruction operates on single data element • SIMD: Single instruction operates on multiple data elements • Array processor • Vector processor • MISD: Multiple instructions operate on single data element • Closest form: systolic array processor, streaming processor • MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) • Multiprocessor • Multithreaded processor

  7. WHY SYSTOLIC ARCHITECTURES? • Idea: Data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memory • Similar to an assembly line of processing elements • Different people work on the same car • Many cars are assembled simultaneously • Why? Special purpose accelerators/architectures need • Simple, regular design (keep # unique parts small and regular) • High concurrency  high performance • Balanced computation and I/O (memory) bandwidth

  8. SYSTOLIC ARRAYS: PROS AND CONS • Advantage: • Specialized (computation needs to fit PE organization/functions)  improved efficiency, simple design, high concurrency/ performance  good to do more with less memory bandwidth requirement • Downside: • Specialized  not generally applicable because computation needs to fit the PE functions/organization

  9. MULTI-CORE • Idea: Put multiple processors on the same die. • Technology scaling (Moore’s Law) enables more transistors to be placed on the same die area • What else could you do with the die area you dedicate to multiple processors? • Have a bigger, more powerful core • Have larger caches in the memory hierarchy • Integrate platform components on chip (e.g., network interface, memory controllers)

  10. WHY MULTI-CORE? • Alternative: Bigger, more powerful single core • Larger superscalar issue width, larger instruction window, more execution units, large trace caches, large branch predictors, etc + Improves single-thread performance transparently to programmer, compiler - Very difficult to design (Scalable algorithms for improving single-thread performance elusive) - Power hungry – many out-of-order execution structures consume significant power/area when scaled. Why? - Diminishing returns on performance - Does not significantly help memory-bound application performance (Scalable algorithms for this elusive)

  11. MULTI-CORE VS. LARGE SUPERSCALAR • Multi-core advantages + Simpler cores  more power efficient, lower complexity, easier to design and replicate, higher frequency (shorter wires, smaller structures) + Higher system throughput on multiprogrammed workloads  reduced context switches + Higher system throughput in parallel applications • Multi-core disadvantages - Requires parallel tasks/threads to improve performance (parallel programming) - Resource sharing can reduce single-thread performance - Shared hardware resources need to be managed - Number of pins limits data supply for increased demand

  12. WHY MULTI-CORE? • Alternative: Bigger caches + Improves single-thread performance transparently to programmer, compiler + Simple to design - Diminishing single-thread performance returns from cache size. Why? - Multiple levels complicate memory hierarchy

  13. CACHE VS. CORE

  14. WHY MULTI-CORE? • Alternative: Integrate platform components on chip instead + Speeds up many system functions (e.g., network interface cards, Ethernet controller, memory controller, I/O controller) - Not all applications benefit (e.g., CPU intensive code sections)

  15. Multicore Decade? We have relied on multicore scaling for over five years. ? 2000 2005 2010 2015 i7 980x Hex-Core Pentium Extreme Dual-Core Core 2 Quad-Core How much longer will it be our primary performance scaling technique?

  16. Finding Optimal Multicore Designs For next 5 technology generations, find the best performing multicore from a comprehensive design space search for each of the PARSEC benchmarks Comprehensive design space: • Fixed area budget • Fixed power budget • Two sets of CMOS scaling projections • Optimal core and diverse multicore organizations • Parallel benchmarks

  17. Symmetric Multicore Projections 18x 3.4x in 10 years Symmetric multicores alone will not sustain the multicore era.

  18. Multicore Solutions Asymmetric Topologies 3.5x

  19. Multicore Solutions Dynamic Topologies 3.5x [Chakraborty (2008), Suleman et al (2009)]

  20. Multicore Solutions Composed/Fused Topologies 3.7x [Ipek et al (2007), Kim et al (2007)]

  21. Multicore Solutions 2.7x

  22. Multicore Era Projections 18x 3.7x The best designs speed up 14% per year rather than the recent trend of 34% per year

  23. WITH MULTIPLE CORES ON CHIP • What we want: • N times the performance with N times the cores when we parallelize an application on N cores • What we get: • Amdahl’s Law (serial bottleneck) • Bottlenecks in the parallel portion

  24. CAVEATS OF PARALLELISM • Amdahl’s Law • f: Parallelizable fraction of a program • N: Number of processors • Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” AFIPS 1967. • Maximum speedup limited by serial portion: Serial bottleneck • Parallel portion is usually not perfectly parallel • Synchronization overhead (e.g., updates to shared data) • Load imbalance overhead (imperfect parallelization) • Resource sharing overhead (contention among N processors) 1 Speedup = f + 1 - f N

  25. THE PROBLEM: SERIALIZED CODE SECTIONS • Many parallel programs cannot be parallelized completely • Causes of serialized code sections • Sequential portions (Amdahl’s “serial part”) • Critical sections • Barriers • Serialized code sections • Reduce performance • Limit scalability • Waste energy

  26. Why Diminishing Returns? Transistor area is still scaling Voltage and capacitance scaling have slowed Result: designs are power, not area, limited

  27. Dark Silicon At 8 nm: At 22 nm: 71% 51% 26% Sources of Dark Silicon: Power + Limited Parallelism 17%

  28. Conclusions ? Unicore Era Multicore Era Multicore performance gains are limited Need at least 18%-40% per generation from architecture alone without additional power

  29. Efficiency Innovation Specialization

  30. NN Accelerators

  31. How Does the BrainWork? • The basic computational unit of the brain is aneuron • 86B neurons in thebrain • Neurons are connected with nearly 1014 – 1015 synapses • Neurons receive input signal from dendrites and produce output signal along axon, which interact with the dendrites of other neurons via synapticweights • Synaptic weights – learnable & control influencestrength Image Source:Stanford 10

  32. Neural Networks: WeightedSum Image Source:Stanford

  33. Many WeightedSums Image Source:Stanford

  34. What is DeepLearning? “Volvo XC90” Image Image Source: [Lee et al., Comm. ACM2011] 17

  35. Why is Deep Learning HotNow? Big Data Availability GPU Acceleration New ML Techniques 350Mimages uploaded per day 2.5Petabytes of customer datahourly 300 hours of videouploaded everyminute

  36. ImageNetChallenge Image Classification Task: 1.2M training images • 1000 objectcategories Object DetectionTask: 456k training images • 200 objectcategories

  37. ImageNet: Image ClassificationTask Top 5 Classification Error(%) 30 large error ratereduction 25 due to DeepCNN 20 15 10 5 0 2010 2011 Hand-craftedfeature- baseddesigns 2012 2013 2014 2015 Human Deep CNN-baseddesigns [Russakovsky et al., IJCV2015] 20

  38. GPU Usage for ImageNetChallenge

  39. EstablishedApplications • Image • Classification: image to objectclass • Recognition: same as classification (except forfaces) • Detection: assigning bounding boxes toobjects • Segmentation: assigning object class to everypixel • Speech & Language • Speech Recognition: audio to text • Translation • Natural Language Processing: text tomeaning • Audio Generation: text to audio • Games

  40. Deep Learning onGames Google DeepMindAlphaGo

  41. EmergingApplications • Medical (Cancer Detection,Pre-Natal) • Finance (Trading, Energy Forecasting,Risk) • Infrastructure (Structure Safety andTraffic) • Weather Forecasting and Event Detection http://www.nextplatform.com/2016/09/14/next-wave-deep-learning-applications/ 24

  42. Deep Learning for Self-drivingCars

  43. DNN Terminology101 Neurons DNN Terminology101 Image Source:Stanford

  44. DNN Terminology101 Synapses DNN Terminology101 Image Source:Stanford

  45. DNN Terminology101 Each synapse has a weight for neuronactivation ⎛ • ⎞ • Xi ⎟ • ⎠ 3 Yj activation⎜Wij ⎝i1 Y1 W11 X1 Y2 X2 Y3 X3 Y4 W34 Image Source:Stanford

  46. DNN Terminology101 Weight Sharing: multiple synapses use the same weight value ⎛ • ⎞ • Xi ⎟ • ⎠ 3 Yj activation⎜Wij ⎝i1 Y1 W11 X1 Y2 X2 Y3 X3 Y4 W34 Image Source:Stanford

  47. DNN Terminology101 Layer1 L1 Neuronoutputs a.k.a.Activations L1 Neuroninputs e.g. imagepixels Image Source:Stanford

  48. DNN Terminology101 L2 Input Activations Layer2 L2 Output Activations Image Source:Stanford

  49. DNN Terminology101 Fully-Connected: all i/p neurons connected to all o/pneurons Sparsely-Connected Image Source:Stanford

  50. DNN Terminology101 Feedback FeedForward Image Source:Stanford

More Related