1 / 59

Multicore: Commercial Processors

Multicore: Commercial Processors. Some Examples. Desktop and Server/Enterprise Space Intel AMD SUN Microsystems The Embedded Space: Freescale Semiconductor. Focus. The Chip Level Architecture What do we have on chip? The Core Architecture

garan
Download Presentation

Multicore: Commercial Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multicore: Commercial Processors

  2. Some Examples • Desktop and Server/Enterprise Space • Intel • AMD • SUN Microsystems • The Embedded Space: Freescale Semiconductor

  3. Focus • The Chip Level Architecture • What do we have on chip? • The Core Architecture • Note the presence/absence/configuration of concepts studied earlier in class • Rationalize the design decisions that led to the preceding • What can/should we expect next? • Building systems using multicore chips

  4. The Intel Core Duo Processor Series

  5. Intel Core Duo • Homogeneous cores • Bus based on chip interconnect • Shared Memory • Traditional I/O Classic OOO: Reservation Stations, Issue ports, Schedulers…etc Source: Intel Corp. Large, shared set associative, prefetch, etc.

  6. Intel Core Duo: Vital Stats • 151 million transistors; Shared 2 MB L2 cache • Each core has a 12 stage pipeline (Yonah) • Low-power (less than 25 watts) Dual Core microprocessor • Supports Intel’s Vanderpool virtualization technology • EM64T (Intel x86-64 extensions) is not supported • Desktop market – not severe due to lack of OS and software • Sossaman processor for servers, which is based on Yonah, also lacks EM64T-support  severe disadvantage • Communication between the L2 cache and both execution cores is handled by an arbitration bus unit • Eliminates cache coherency traffic over the FSB • Raises the core-to-L2 latency • The increase in clock frequency offsets the impact • Core processors communicate with the system chipset over a 667 MT/s front side bus (FSB), up from 533 MT/s used by the fastest Pentium M. • Intel Core Solo uses the same two-core die as the Core Duo, but features only one active core • Chips failing quality control can be sold • Core 2 Duo processors will also include the ability to disable one core to conserve power

  7. The Core™ micro-architecture Source: Ars Technica

  8. The Core Execution core Source: Ars Technica

  9. Intel Core Duo • High memory latency due to the lack of on-die memory controller (further aggravated by system-chipset's use of DDR-II RAM) • Main-memory transactions have to pass through the Northbridge of the chipset • Higher latency compared to the AMD's Turion platform. • Weakness shared by the entire line of Pentium processors • L2-cache is quite effective at hiding main-memory latency • Execution units • Three 64-bit integer exec units • one CIU (complex) + two SIU (simple) • Two FPUs • Poor Floating Point Unit (FPU) throughput • Limited to little "performance per watt" in single threaded applications compared to its predecessor.

  10. Core 2 Duo and Core Duo Source: Intel Corp. • Very similar architectures • Bump in the processor speed • Increase in Level 2 cache. (2MB to 4MB) • Both chips have a 65-nm process technology architecture and support a 667 MHz front-side-bus (FSB). • 14 stage pipeline

  11. Intel® CoreTM2 Duo Processor

  12. Intel Core 2 Duo Source: Hard Core Hardware

  13. Wide Dynamic Execution Source: Bit Tech

  14. Wide Dynamic Execution Source: Bit Tech

  15. Wide Dynamic Execution • Pipe width of 4 execution units per chip (Pentium M/Pentium 4 Netburst have 3) • Delivery of more instructions per clock cycle • Pipeline depth of 14 vs. 31 in Pentium Prescott 4 • Compromise between efficient execution of short instructions and long instructions • Ops fusion • Less work for the processor pipeline to run • Micro-ops fusion • fuse together repetitive instructions in x86 code • Macro-ops fusion • works on the x86 instructions themselves, not just their micro derivatives. • Instruction loads and micro-ops can be reduced by approximately 15% and 10%, respectively

  16. Intelligent Power Capability Source: Bit Tech

  17. Intelligent Power Capability • SpeedStep technology • Dyamic clock speed reduction • Intel mobile processors include this already • Enhanced SpeedStep used in Core 2 Duo • Controller that turns on sections of the processor as needed. One core can be shut down for single-threaded applications • Power consumption decreased by enhancements to Intel's 65nm process node • use Low-K dielectrics and strained silicon • use low-leakage and "sleep" transistors

  18. Advanced Smart Cache Source: Bit Tech

  19. Advanced Smart Cache Source: Bit Tech • Both cores share data stored in the L2 cache via an arbitration bus unit embedded in the cache. • Dynamically allocates cache space between the two cores, minimising bus traffic by allowing both cores to access one copy of data • Does larger L2 cache matter? • Studies point out that improvements in execution time are low from a 2MB to 4MB for most applications (2-4%)

  20. Smart Memory Access Source: Bit Tech

  21. Smart Memory Access • Improved prefetch units • Memory disambiguation • Allows re-ordering instructions more efficiently Execution with and without memory disambiguation Memory Aliasing Execution without memory disambiguation Example from http://arstechnica.com/articles/paedia/cpu/core.ars/8 Source: Ars Technica

  22. Advanced Digital Media Boost Source: Bit Tech

  23. Advanced Digital Media Boost • Streaming SIMD Extension (SSE) instructions • SSE instructions are an extension of the standard x86 instruction set. • Utilized in multimedia encoding, decoding, image manipulation and encryption • SSE instructions are 128-bit. • Up from 64-bits • Double the SSE performance over previous generation

  24. Comparison of SSE to prior processors Source: Ars Technica

  25. Intel Conroe Vs Presler Conroe Presler • What is the major difference? • Shared L2 versus separate caches Source: Bit Tech

  26. Intel’s Roadmap for Multicore Mobile processors Enterprise processors Desktop processors 8C 12MB shared (45nm) 8C 12MB shared (45nm) QC 8/16MB shared DC 3MB /6MB shared (45nm) DC 3 MB/6 MB shared (45nm) QC 4MB DC 4MB DC 2/4MB shared DC 16MB DC 2/4MB shared DC 2MB DC 4MB SC 1MB DC 2MB DC 2/4MB SC 512KB/ 1/ 2MB 2006 2007 2008 2006 2007 2008 2006 2007 2008 Source: Adapted from Tom’s Hardware • Drivers are • Market segments • More cache • More cores • 80 core processor prototype has been designed!

  27. Intel Chipset Example Source: Extreme Tech

  28. References and Links • http://www.intel.com/products/processor/coreduo/ • http://en.wikipedia.org/wiki/Intel_Core • http://www.hothardware.com/viewarticle.aspx?articleid=845&cid=1 • http://www.bit-tech.net/hardware/2006/03/10/intel_core_microarchitecture/ • http://www.bit-tech.net/hardware/2006/05/19/intel_core_duo_t2600_on_the_desktop • http://www.bit-tech.net/hardware/2006/07/14/intel_core_2_duo_processors/ • http://www.hardcoreware.net/reviews/review-347-1.htm • http://www.trustedreviews.com/cpu-memory/review/2006/08/28/Intel-Core-2-Duo-Merom-Notebooks/p1 • http://www.trustedreviews.com/cpu-memory/review/2006/07/14/Intel-Core-2-Duo-Conroe-E6400-E6600-E6700-X6800/p1 • http://techreport.com/reviews/2006q2/core-duo/index.x?pg=1 • http://arstechnica.com/articles/paedia/cpu/core.ars/1 • http://www.anandtech.com/mobile/showdoc.aspx?i=2663&p=4 • http://www.extremetech.com/article2/0,1697,1988794,00.asp • http://www.coreduoinfo.com/blog/about-intel-core-duo/ • http://67.91.114.164/intel_c2d_info.htm • http://www.pcper.com/article.php?aid=272&type=expert

  29. AMD MultiCore Processors

  30. Dual Core AMD Opteron Source: AMD

  31. Core 0 1-MB L2 Northbridge 1-MB L2 Core 1 AMD Multicore (Dualcore) Opteron • Two AMD Opteron CPU cores on a single die • Each has 1MB L2 cache • 90nm, ~205 million transistors • Approximately same die size as 130nm single-core AMD Opteron processor • 95 watt power envelope • fits into 90nm power infrastructure • Introduced with “K8” Revision E core in April 2005 Source: AMD

  32. Opteron Core Pipeline Source: Chip Architect

  33. L1 Icache 64KB Fetch Branch Prediction Scan/Align/Decode Microcode Engine Fastpath L1 Dcache 64KB µops Instruction Control Unit (72 entries) FP Decode & Rename Int Decode & Rename 36-entry FP scheduler 44-entry Load/Store Queue Res Res Res AGU AGU AGU FADD FMUL FMISC ALU ALU ALU MULT AMD Opteron Processor Core Architecture Source: The 3D shop

  34. Dual Core AMD Opteron • AMD64 technology • Runs 32-bit applications and is 64-bit capable • Compatible with the x86 software infrastructure • Enables a single architecture across 32- and 64-bit environments • Direct Connect Architecture • NUMA system • Each processor shares its memory with other processors in the system • Integrated Memory Controller on-die • DDR2 DRAM memory controller offers memory BW up to 10.7 GB/s per processor • HyperTransport • Point-to-point interconnect can be used to build a mesh of multiple-processor Opteron systems • Scalable bandwidth interconnect between processors, I/O subsystems, and other chipsets • 24.0 GB/s peak bandwidth per processor

  35. Dual Core AMD Opteron • Not a simple aggregation of K8 cores • Integrated the cores for efficiency • Dual-core Opteron acts very much like a SMP system • Compatible with existing single-threaded, multi-threaded (hyperthreaded) software • MOESI coherency protocol (O – “Owns”) • Updates through system request interface • SSE3 support with 10 new instructions. • Quad-core upgradeability • Hardware assisted AMD Virtualization • Optimized Power Management

  36. Dual Core AMD Opteron Source: Elec Design

  37. AMD Opteron (SOI) Source: Chip Architect

  38. AMD 64 bit Core • 1MB L2 Cache • Detailed discussion of the 64-bit core architecture at: • http://chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html

  39. CPU CPU CPU CPU CPU CPU 8 GB/S SRQ SRQ Crossbar Crossbar Mem.Ctrlr Mem.Ctrlr HT HT 8 GB/S 8 GB/S PCI-E Bridge PCI-E Bridge Memory Controller Hub I/O Hub PCI-E Bridge 8 GB/S Multiprocessor Systems using AMD Opteron PCI-E Bridge I/O Hub PCI-E Bridge PCI-E Bridge I/O Hub USB PCI • Legacy x86 Architecture • CPUs, Memory, I/O all share a bus • Major bottleneck to performance • Faster CPUs or more cores for performance • Symmetric Multiprocessing • AMD64 Direct Connect Architecture • Eliminates FSB bottleneck • HyperTransport™ Technology interconnect for high bandwidth and low latency • Each CPU has its own memory • Each CPU can access the main memory of another processor, transparent to the programmer  Different from SMP Source: AMD

  40. Multiprocessor Systems using AMD Opteron Source: XBitlabs

  41. Cache coherency Source: Chip Architect

  42. AMD Athlon 64 X2 Source: AMD

  43. References and Links • http://techreport.com/reviews/2005q2/opteron-x75/index.x?pg=1 • http://www.tomshardware.com/2005/06/03/dual_core_stress_test/index.html • http://www.a1-electronics.net/AMD_Section/CPUs/2005/AMD_Athlon64x2_Apr.shtml • http://en.wikipedia.org/wiki/Opteron • http://en.wikipedia.org/wiki/Athlon_64_X2 • http://www.amd.com/us-en/Processors/ProductInformation/0,,30_118_8796_14309,00.html • http://chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html • http://firingsquad.com/hardware/amd_dual-core_opteron_875/page2.asp • http://www.xbitlabs.com/articles/cpu/display/opteron-ws_4.html • http://www.extremetech.com/article2/0,1697,1675784,00.asp • http://www.elecdesign.com/Articles/Index.cfm?AD=1&ArticleID=11991 • http://www.the3dshop.com/userimages/amd_systems/opteron_dualcore.htm • http://www.nextcomputing.com/advantages/thruadv.shtml • http://arstechnica.com/news.ars/post/20060817-7535.html • http://www.bit-tech.net/hardware/2005/05/09/amd_a64x2_4800/1.html

  44. SUN – UltraSPARC Multicore

  45. SUN – UltraSPARC T1 • Eight cores, each 4-way threaded • 1.2 GHz • Cache • 16K 4-way 32B L1-I • 8K 4-way 16B L1-D • 3MB internal L2 cache partitioned into four banks and four memory controllers. • Data moved between the L2 and the cores using an integrated crossbar switch to provide high throughput Source: Sun

  46. SUN – UltraSPARC T1 Source: Sun

  47. SUN – UltraSPARC T1 Pipeline • T1's integer pipeline • Fetch, Thread Selection, Decode, Execute, Memory Access, Writeback Source: Sun

  48. SUN UltraSPARC T2 – Niagara 2 Source: Sun

  49. SUN UltraSPARC T2 • Ultra SPARC T2 has 8 threads/core (8 Sparc Cores) • 8 stage integer pipeline ( as opposed to 6 for T1) • Twice the performance of T1 with a transactional workload (under the same power envelope) • Each thread, increased to 1.4 GHz from 1.2 GHz • One PCI Express port (x8 1.0) • Two 10 Gigabit Ethernet ports with packet classification and filtering • L2 cache size increased to 4 MB shared (8-banks, 16-way associative) • 1 floating point unit per core • Eight encryption engines • Four dual-channel FBDIMM memory controllers • 711 signal I/O,1831 total

  50. UltraSparc T2 Core Microarchitecture Source: Realworld Tech

More Related