1 / 40

Heterogeneous CPU Cores

Heterogeneous CPU Cores. March 11, 2014. Kevin Stewart Derrik Huey Shuai Xu. Outline. Introduction to Multi-cores ARM big.LITTLE Technology Multi-thread Programming. ECE 570 W14 – Heterogeneous CPU Cores. 2. Multi-Cores. Multi-cores and why they are needed Cost and Power Benefits

jaunie
Download Presentation

Heterogeneous CPU Cores

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Heterogeneous CPU Cores March 11, 2014 Kevin Stewart Derrik Huey ShuaiXu

  2. Outline • Introduction to Multi-cores • ARM big.LITTLE Technology • Multi-thread Programming ECE 570 W14 – Heterogeneous CPU Cores 2

  3. Multi-Cores • Multi-cores and why they are needed • Cost and Power Benefits • Heterogeneous Cores and Homogeneous Cores • Who are the Players in this field? ECE 570 W14 – Heterogeneous CPU Cores 3

  4. Multi-cores and why they are needed ECE 570 W14 – Heterogeneous CPU Cores 4

  5. Multi-cores and why they are needed • Multi-cores came about due to increasing frequency scaling • Physical barriers due to power and thermal heat • Easier to have two cores than double the frequency ECE 570 W14 – Heterogeneous CPU Cores 5

  6. Cost and Power Benefits ECE 570 W14 – Heterogeneous CPU Cores 6

  7. Cost and Power Benefits Active Power Dissipation: (Switching power) • Standby Power Dissipation: ECE 570 W14 – Heterogeneous CPU Cores 7

  8. Cost and Power Benefits ECE 570 W14 – Heterogeneous CPU Cores 8

  9. Heterogeneous and Homogeneous cores • Homogeneous has the same cores • Symmetric Multi-Processing (SMP) • Heterogeneous has different cores • Heterogeneous Multi-Processing (HMP) • Application specific Processing (ASP) • SOC or SoC ECE 570 W14 – Heterogeneous CPU Cores 9

  10. Who are the players in the field? The usual cast: The mobile arena: ECE 570 W14 – Heterogeneous CPU Cores 10

  11. Outline • Introduction to big.LITTLE • big and LITTLE cores • The challenge of cache coherency • Pairing big and LITTLE cores • Software challenges • Benchmarks and market overview ECE 570 W14 – Heterogeneous CPU Cores 11

  12. Heterogeneous CPU cores • Dynamically adapt to computing needs • Combination of small and large core(s) • Large core(s) active • High performance • Small core(s) active • Low power • Proprietary technology called big.LITTLE by ARM ECE 570 W14 – Heterogeneous CPU Cores 12

  13. Requirements for cores • Requirements for cores • Caches need to be compatible • Same fundamental architecture (code compatible) • Can have different micro-architecture • LITTLE core • Cortex-A7 • big core • Cortex-A15 ECE 570 W14 – Heterogeneous CPU Cores 13

  14. The LITTLE core: A7 • ARM Cortex-A7 micro architecture • In-order execution • Dual issue • 8 to 10 stage pipeline ECE 570 W14 – Heterogeneous CPU Cores 14 Figure from Ref. [1]

  15. The big core: A15 • ARM Cortex-A15 micro architecture • Out-of-order execution • Triple issue • 15 to 24 stage pipeline ECE 570 W14 – Heterogeneous CPU Cores 15 Figure from Ref. [1]

  16. The big core: A15 • ARM Cortex-A15 micro architecture • Out-of-order execution • Triple issue • 15 to 24 stage pipeline 4x larger area than A7 4x higher power consumption 2-3x higher performance ECE 570 W14 – Heterogeneous CPU Cores 16

  17. Communication between cores • big and LITTLE cores need to be able to talk with each other • Cache coherency! ECE 570 W14 – Heterogeneous CPU Cores 17

  18. Cache coherency ECE 570 W14 – Heterogeneous CPU Cores 18 Figure from Ref. [7]

  19. Switching between cores • How does switching between cores work? ECE 570 W14 – Heterogeneous CPU Cores 19

  20. Switching between cores Migrates in less than 20,000 cycles or 20 µs ECE 570 W14 – Heterogeneous CPU Cores 20 Figure from Ref. [9]

  21. Pairing of big.LITTLE cores Switching threshold ECE 570 W14 – Heterogeneous CPU Cores 21 Figure from Ref. [4]

  22. Pairing of big.LITTLE - Summary • Cluster Switching Mode • All tasks are assigned to one cluster while the other one is inactive • CPU Migration Mode (In-kernel switcher) • Big and LITTLE cores are grouped in pairs • Heterogeneous Multi-Processing Mode (Global Task Scheduling) • Tasks are assigned to cores independently ECE 570 W14 – Heterogeneous CPU Cores 22

  23. Software challenges • Task scheduling • Operating System needs to assign tasks to specific cores * Utilize already available drivers for Dynamic Voltage Frequency Scaling (DVFS) ECE 570 W14 – Heterogeneous CPU Cores 23

  24. Software challenges • Cluster Switching and CPU Migration implemented in Linux Kernel and Android OS • Heterogeneous Multi-Processing support in development (2013) ECE 570 W14 – Heterogeneous CPU Cores 24

  25. Benchmarks Geekbench 3 Higher performance with similar power consumption ECE 570 W14 – Heterogeneous CPU Cores 25 Figures from Ref. [1],[8]

  26. Applications ECE 570 W14 – Heterogeneous CPU Cores 26

  27. Conclusion • ARM big.LITTLE technology • Can be combined with other power saving • techniques like DVFS or power/clock gating • Cluster Switching Mode • CPU Migration • Heterogeneous Multi-Processing ECE 570 W14 – Heterogeneous CPU Cores 27

  28. Multi-thread Operating System • Multi-thread Core • Multi-thread Programming • Pthread • GPU Programming • C++ AMP ECE 570 W14 – Heterogeneous CPU Cores 28

  29. Multi-thread Operating System • Thread • A thread is essentially a single sequence of instructions • Single-thread OS • In a Single-thread OS only one task can be runed at same time • For example the DOS • Low CPU usage • Multi-thread OS • Multi-thread OS can have more threads at one time which make the multitask possible • Higher CPU usage ECE 570 W14 – Heterogeneous CPU Cores 29

  30. Multi-thread Core • Intel Hyper-Threading(HT) Technology • Simultaneous multithreading(SMT) • According to Intel’s report, only used 5% more die area than the comparable non-hyperthreaded processor, but the performance was 15–30% better • In some specific situation this technology will reduce the performance of a physical processor or lead to more usage of power ECE 570 W14 – Heterogeneous CPU Cores 30

  31. Multi-thread Programming ECE 570 W14 – Heterogeneous CPU Cores 31

  32. Multi-thread Programming Single thread code main() { clock_t start=clock();   int res[M][N]={0}; //to store the result   int i,j,k; for(i=0;i<M;i++) for(j=0;j<M;j++) for(k=0;k<N;k++) res[i][j]+=matrixA[i][k]*matrixB[k][j]; //calculate the result clock_t finish=clock();   printf("Time use:%.2f s\n",(long)(finish-start)/1E6); } It spends about 0.07s to calculate the multiplication of two random matrices in size (200,300) and (300,200). ECE 570 W14 – Heterogeneous CPU Cores 32

  33. Multi-thread Programming Multi-thread code(Pthread)   for(i=0;i<num_p;i++) { if(pthread_create(&tids[i],NULL,func,(void *)&i)) //create a thread    { perror("pthread_create");//if cannot create the thread return error exit(1); } } for(i=0;i<num_p;i++) pthread_join(tids[i],NULL); //join all the threads    for(i=0;i<M;i++) for(j=0;j<M;j++) for(k=0;k<N;k++) res[i][j]+=arr[i][j][k]; //add the result together It spends about 0.02s to calculate the multiplication of two random matrices in size (200,300) and (300,200) when using 4 threads. ECE 570 W14 – Heterogeneous CPU Cores 33

  34. GPU Programming • CUDA (Compute Unified Device Architecture) • Introduced by NVIDIA in 2006, the world’s first solution for general-computing on GPUs • Special hardware architecture • Just supportsNVIDIA • SupportsC/C++, C#, Python, Fortran • OpenCL(Open Computing Language) • Introduced by Apple in 2008 • Supportsalmost all GPU • Just supportsC ECE 570 W14 – Heterogeneous CPU Cores 34

  35. GPU Programming extern "C" void MatrixMultiplication_CUDA(const float* M,const float* N,float* P,int Width) { cudaSetDevice(0); float *Md, *Nd, *Pd; int size = Width * Width * sizeof(float); cudaMalloc((void**)&Md, size); cudaMalloc((void**)&Nd, size); cudaMalloc((void**)&Pd, size); //Copies a matrix from the memory* area pointed to by src to the memory area pointed to by dst cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); // dim3 dimGrid(Width / TILE_WIDTH, Width / TILE_WIDTH); dim3 dimBlock(TILE_WIDTH, TILE_WIDTH); MatrixMulKernel<<< dimGrid, dimBlock >>>(Md, Nd, Pd, Width); cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); cudaFree(Md); cudaFree(Nd); cudaFree(Pd); } ECE 570 W14 – Heterogeneous CPU Cores 35

  36. C++ AMP • C++ Accelerated Massive Parallelism • Introduced by MS in 2012 • Only supported by VS 11 or later version • Real heterogenous programming, use both CPU and GPU • Automatically control how many threads can run in parallel ECE 570 W14 – Heterogeneous CPU Cores 36

  37. C++ AMP array_view<const int, 2> a(M, W, vA), b(W, N, vB); array_view<int, 2> c(M, N, vC); c.discard_data(); parallel_for_each(c.extent, [=](index<2> idx) restrict(amp) { int row = idx[0]; int col = idx[1]; int sum = 0; for(int i = 0; i < b.extent[0]; i++) sum += a(row, i) * b(i, col); c[idx] = sum; }); c.synchronize(); ECE 570 W14 – Heterogeneous CPU Cores 37

  38. Summary • Heterogeneous CPU Cores • Circumvent the “Power Wall” • Next step after Homogeneous Multi-Cores • ARM big.LITTLE is one heterogeneous solution • Many of the challenges of Homogeneous Multi-Cores still apply • Finding ILP • Writing parallel programs ECE 570 W14 – Heterogeneous CPU Cores 38

  39. Questions? ECE 570 W14 – Heterogeneous CPU Cores 39

  40. References [1]P. Greenhalgh, “big.LITTLE Processing with ARM Cortex-A15 & Cortex-A7,” ARM White Paper, Sep. 2011. [2]MEDIATEK, “MediaTek Enables ARM big.LITTLETM Heterogeneous Multi-Processing Technology in Mobile SoCs,” MEDIATEK White Paper, 2013. [3]“ARM Processors: Combining large and small compu... | ARM Connected Community.” [Online]. Available: http://community.arm.com/groups/processors/blog/2011/10/19/combining-large-and-small-compute-engines--arm-cortex-a7. [Accessed: 17-Feb-2014]. [4]“ARM Processors: Ten Things to Know About big.LI... | ARM Connected Community.” [Online]. Available: http://community.arm.com/groups/processors/blog/2013/06/18/ten-things-to-know-about-biglittle. [Accessed: 18-Feb-2014]. [5]“big.LITTLE Processing - ARM.” [Online]. Available: http://www.arm.com/products/processors/technologies/biglittleprocessing.php. [Accessed: 15-Feb-2014]. [6] “ARM Processors: big.LITTLE and AMBA 4 ACE keep ... | ARM Connected Community.” [Online]. Available: http://community.arm.com/groups/processors/blog/2011/11/10/biglittle-and-amba-4-ace-keep-your-cache-warm-and-avoid-flushes. [Accessed: 17-Feb-2014]. [7]“CoreLinkCCI-400 Cache Coherent Interconnect - ARM.” [Online]. Available: http://www.arm.com/products/system-ip/interconnect/corelink-cci-400.php. [Accessed: 15-Feb-2014]. [8]H. Chung, M. Kang, and H.-D. Cho, “Heterogeneous Multi-Processing Solution of Exynos 5 Octa with ARM® big. LITTLE™ Technology,” Samsung White Paper, Nov. 2013. [9]A. Stevens, “Introduction to AMBA® 4 ACETM and big.LITTLETM Processing Technology,” ARM White paper, http://wwww. arm. com, Jun. 2011. [10]“Software Techniques for ARM big.LITTLE Systems | ARM Connected Community.” [Online]. Available: http://community.arm.com/docs/DOC-2875. [Accessed: 17-Feb-2014]. [11]“Linux support for ARM big.LITTLE [LWN.net].” [Online]. Available: http://lwn.net/Articles/481055/. [Accessed: 17-Feb-2014]. [12]“A big.LITTLE scheduler update [LWN.net].” [Online]. Available: http://lwn.net/Articles/501501/. [Accessed: 17-Feb-2014]. ECE 570 W14 – Heterogeneous CPU Cores 40

More Related