1 / 22

Technology Impacts from the New Wave of Architectures for Media-rich Workloads

VLSI Technology Symposium 2011. Technology Impacts from the New Wave of Architectures for Media-rich Workloads. Samuel Naffziger AMD Corporate Fellow June 14 th , 2011. Outline. Introduction The new workloads and demands on computation Characteristics of serial and parallel computation

boone
Download Presentation

Technology Impacts from the New Wave of Architectures for Media-rich Workloads

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. VLSI Technology Symposium 2011 Technology Impacts from the New Wave of Architectures for Media-rich Workloads Samuel Naffziger AMD Corporate Fellow June 14th, 2011

  2. Outline • Introduction • The new workloads and demands on computation • Characteristics of serial and parallel computation • The Accelerated Processing Unit (APU) architecture • APU architecture implications for technology • Summary

  3. The Big Experience/Small Form Factor Paradox Immersive and interactive performance Workloads Standard-definition Internet FormFactors Early Internet and Multimedia Experiences *Resting battery life as measured with industry standard tests.

  4. Focusing on the experiences that matter Consumer PC Usage NewExperiences Email Web browsing Office productivity Listen to music Online chat Watching online video Photo editing Personal finances Taking notes Online web-based games Social networking Calendar management Locally installed games Educational apps Video editing Internet phone Accelerated Internet and HD Video Simplified Content Management ImmersiveGaming 80% 0% 60% 20% 40% 100% Source: IDC's 2009 Consumer PC Buyer Survey

  5. People Prefer Visual Communications Visual Perception Verbal Perception Words are processedat only 150 wordsper minute Pictures and videoare processed 400 to2000 times faster • Rich visual experiences • Multiple content sources • Multi-Display • Stereo 3D Augmenting Today’s Content:

  6. The Emerging World of New Data Rich Applications The Ultimate Visual Experience™ Fast Rich Web content, favorite HD Movies, games with realistic graphics CyberLink Power Director 9 CyberLink Media Espresso 6 ArcSoft TotalMedia®  Theatre 5 ArcSoft Media Converter® 7 • Communicating • IM, Email, Facebook • Video Chat, NetMeeting • Using photos • Viewing& Sharing • Search, Recognition, Labeling? • Advanced Editing Corel VideoStudio Pro Corel Digital Studio 2010 Microsoft® PowerPoint® 2010  Nuvixa Be Present Windows Live Essentials Internet Explorer 9 • Using video • DVD, BLU-RAY™, HD • Search, Recognition, Labeling • Advanced Editing & Mixing • Gaming • Mainstream Games • 3D games • Music • Listening and Sharing • Editing and Mixing • Composing and compositing ViVu Desktop Telepresence Codemasters F1 2010 Viewdle Uploader

  7. New Workload Examples: Changing Consumer Behavior Approximately 9 billion video files owned are high-definition 24 hours of video uploaded to YouTube every minute 50 million + digital media files added to personal content libraries every day 1000 images are uploaded to Facebook every second 7

  8. What Are the Implications for Computation? • Insatiable demand for high bandwidth processing • Visual image processing • Natural user interfaces • Massive data mining for associative searches, recognition • Some of these compute needs can be offloaded to servers, some must be done on the mobile device • Similar compute needs and massive growth in both spaces How must CPU architecture change to deal with these trends?

  9. Parallel and Serial Computation Data Parallel Code … i=0 i++ load x(i) fmul store cmp i (1000000) bc Serial Code Loop 1M times for 1M pieces of data … … i=0 i++ load x(i) fmul store cmp i (16) bc i,j=0 i++ j++ load x(i,j) fmul store cmp j (100000) bc cmpi (100000) bc 2D array representing very large dataset … Loops, branches and conditional evaluation … Conditional branches

  10. GPU/CPU Design Differences CPU (Serial compute) GPU (parallel compute)

  11. Three Eras of Processor Performance Heterogeneous Systems Era Multi-Core Era • Enabled by: • Moore’s Law • Desire for Throughput • 20 years of SMP arch • Constrained by: • Power • Parallel SW availability • Scalability • Enabled by: • Moore’s Law • Abundant data parallelism • Power efficient GPUs • Temporarily constrained by: • Programming models • Communication overheads • Workloads o Targeted Application Performance we are here Throughput Performance o we are here Time (Data-parallel exploitation) Time (# of Processors) Single-Core Era • Enabled by: • Moore’s Law • Voltage & Process Scaling • Micro Architecture • Constrained by: • Power • Complexity o ? Single-thread Performance we are here Time

  12. Heterogeneous Computing with an APU Architecture 2010 IGP-based (“Danube”) Platform 2011 APU-based (“Llano”) Platform ~17 GB/sec ~17 GB/sec CPU Cores DDR3 DIMM Memory CPU Chip CPU Cores MC DDR3 DIMM Memory UNB APU Chip UVD UNB / MC ~7 GB/sec GPU Graphics requires memory BW to bring full capabilities to life GPU UVD ~27 GB/sec FCH Chip SB Functions PCIe ~27 GB/sec Optional GPU PCIe® Bandwidth pinch points and latency hold back the GPU capabilities Integration Provides Improvement • Eliminate power and latency of extra chip crossing • 3X bandwidth between GPU and Memory! • Same sized GPU is substantially more effective • Power efficient, advanced technology for both CPU and GPU

  13. The Challenges of Integration GPU CPU Performance Density Thick, fast metal Big devices Dense, thin metal, small devices Flop count for Llano GPU =3.5M Flop count for 4 Llano CPU cores=0.66M CPU flop area = 2.14 GPU flop area = 1.0

  14. How to Balance the Metal Stack? Cu Resistivity without barrier With barrier 2.5 2.4 GPU 2.3 2.2 CPU 2.1 Performance 2 Resistivity (uohm-cm) 1.9 Add metal layers? • Thin, dense layers for the GPU • Thick, low resistance layers for the CPU • Cost issues? • Via resistance? Technology improvements in BEOL are required 1.8 1.7 1.6 1.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Line Width (um) Density With the 20nm node, even local metal will be seeing large RC increase  compromises more difficult

  15. Device Optimization CPU GPU Performance Speed vs. Leakage Device Ioff To achieve breakthrough APU performance, the Llano GPU has ~5X the flops and ~5X the device count of the CPUs Broader span of devices required RO speed desired device range A broader device suite is required CPU RVT LC-RVT HVT LC-HVT LVT GPU

  16. Power Transfers Temperature GPU-centric data parallel workload CPU-centric serial workload Balanced workload Voltage range is critical to enabling the efficient power transfers that make for compelling APU performance

  17. Operating Voltage Range Operating voltage requirements: • Low voltage necessary for power efficiency • High voltage necessary for a snappy user experience enabled by turbo mode

  18. Operating Voltage Challenges • To maintain cost effective performance growth with technology node, the GPU must: • Hold power density constant • Exploit density gains to add compute units • This necessarily drives operating voltage down • This would be good for energy efficiency except … • Variation impacts are much greater at low voltage

  19. The Operating Voltage Challenge FD devices should enable maintaining the functional range for a generation or two Will turbo modes be too compromised? What’s next? D Many barriers to maintaining both high and low voltage as technology scales • TDDB vs. SCE control • ULK breakdown vs. denser pitches • Variation control Poly Current Flow -> S Fin BOX

  20. 3D Integration to the Rescue? DRAM • Stacking offers many attractive benefits • Higher bandwidth to local memory • Enables parallel and serial compute die to be in their own separate optimized technology – interconnect speed vs. density, device optimization etc. • Allows IO and southbridge content to remain in older, more analog-friendly technology Heat Sink TIM (Thermal Interface Material) DRAM Micro-bumps Metal Layers Analog Die (SB, Power) Metal Layers Through Silicon Vias (TSVs) GPU Die Metal Layers CPU Die Metal Layers South Bridge Package Substrate

  21. 3D Integration Challenges • Economical 3D stacking in high volume manufacturing presents many challenges • Benefits must exceed the additional costs of TSVs, and yield fallout • Logistics of testing and assembling die from multiple sources can be immense • Countless mechanical and thermal issues to solve in high volume mfg Clearly 3D provides compelling solutions to many problems, but the barriers to entry mean heavy R&D $$ and partnerships required

  22. Summary • Insatiable demand for high bandwidth computation • Visual image processing • Natural user interfaces • Massive data mining for associate searches, recognition • Some of these compute needs can be offloaded to servers, some must be done on the mobile device • Similar compute needs and massive growth in both spaces • Combined serial and parallel computation architectures are key in both spaces • Huge technology challenges to meeting this opportunity • Interconnect scaling is hitting a wall that must be overcome • A broad device suite is necessary that operates efficiently at low voltage while enabling high speed for response time • 3D integration offers a promising long term solution

More Related