Advanced computing techniques applications
This presentation is the property of its rightful owner.
Sponsored Links
1 / 86

Advanced Computing Techniques & Applications PowerPoint PPT Presentation


  • 90 Views
  • Uploaded on
  • Presentation posted in: General

Advanced Computing Techniques & Applications. Dr. Bo Yuan E-mail: [email protected] Course Profile. Lecturer:Dr. Bo Yuan Contact Phone:2603 6067 E-mail:[email protected] Room: F - 301B Time: 10:25 am – 12:00pm , Friday Venue: CI - 208 Teaching Assistant

Download Presentation

Advanced Computing Techniques & Applications

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Advanced computing techniques applications

Advanced Computing Techniques & Applications

Dr. Bo Yuan

E-mail: [email protected]


Course profile

Course Profile

  • Lecturer:Dr. Bo Yuan

  • Contact

    • Phone:2603 6067

    • E-mail:[email protected]

    • Room: F-301B

  • Time:10:25 am – 12:00pm, Friday

  • Venue: CI-208

  • Teaching Assistant

    • Mr. Shiquan Yang


We will study

We will study ...

  • MPI

    • Message Passing Interface

    • API for distributed memory parallel computing (multiple processes)

    • The dominant model used in cluster computing

  • OpenMP

    • Open Multi-Processing

    • API for shared memory parallel computing (multiple threads)

  • GPU Computing with CUDA

    • Graphics Processing Unit

    • Compute Unified Device Architecture

    • API for shared memory parallel computing in C (multiple threads)

  • Parallel Matlab

    • A popular high-level technical computing language and interactive environment


Aims objectives

Aims & Objectives

  • Learning Objectives

    • Understand the main issues and core techniques in parallel computing.

    • Obtain first-hand experience in Cloud Computing.

    • Able to develop MPI based parallel programs.

    • Able to develop OpenMP based parallel programs.

    • Able to develop GPU based parallel programs.

    • Able to develop Matlab based parallel programs.

  • Graduate Attributes

    • In-depth Knowledge of the Field of Study

    • Effective Communication

    • Independence and Teamwork

    • Critical Judgment


Learning activities

Learning Activities

  • Lecture (10)

    • Introduction (3)

    • MPI and OpenMP (3)

    • GPU Computing (3)

    • Invited Talk (1)

  • Practice (3)

    • GPU Programming (1)

    • Cloud Computing (1)

    • Parallel Matlab (1)

  • Others (2)

    • Industry Tour (1)

    • Final Exam (1)


Assessment

Assessment

Final Exam

(50%)

  • Assignment 1

    • Weight: 20%

    • Task: Parallel Programming using MPI

    • Type: Individual

  • Assignment 2

    • Weight: 10%

    • Task: Parallel Programming using OpenMP

    • Type: Individual

  • Assignment 3

    • Weight: 20%

    • Task: Parallel Programming using CUDA

    • Type: Individual


Learning resources

Learning Resources


Learning resources1

Learning Resources

  • Books

    • http://www.mcs.anl.gov/~itf/dbpp/

    • https://computing.llnl.gov/tutorials/parallel_comp/

    • http://www-users.cs.umn.edu/~karypis/parbook/

  • Journals

    • http://www.computer.org/tpds

    • http://www.journals.elsevier.com/parallel-computing/

    • http://www.journals.elsevier.com/journal-of-parallel-and-distributed-computing/

  • Amazon Cloud Computing Services

    • http://aws.amazon.com

  • CUDA

    • http://developer.nvidia.com


Rules policies

Rules & Policies

  • Plagiarism

    • Plagiarism is the act of misrepresenting as one's own original work the ideas, interpretations, words or creative works of another.

    • Direct copying of paragraphs, sentences, a single sentence or significant parts of a sentence.

    • Presenting as independent work done in collaboration with others.

    • Copying ideas, concepts, research results, computer codes, statistical tables, designs, images, sounds or text or any combination of these.

    • Paraphrasing, summarizing or simply rearranging another person's words, ideas, without changing the basic structure and/or meaning of the text.

    • Copying or adapting another student's original work into a submitted assessment item.


Rules policies1

Rules & Policies

  • Late Submission

    • Late submissions will incur a penalty of 10% of the total marks for each day that the submission is late (including weekends). Submissions more than 5 days late will not be accepted.

  • Assumed Background

    • Acquaintance with C language is essential.

    • Knowledge of computer architecture is beneficial.

  • We have CUDA supported GPU cards available!


Half adder

Half Adder

A: AugendB: Addend

S: Sum C: Carry


Full adder

Full Adder


Sr latch

SR Latch


Address decoder

Address Decoder


Address decoder1

Address Decoder


Advanced computing techniques applications

Electronic Numerical Integrator And Computer

  • Programming

    • Programmable

    • Switches and Cables

    • Usually took days.

    • I/O: Punched Cards

  • Speed (10-digit decimal numbers)

    • Machine Cycle: 5000 cycles per second

    • Multiplication: 357 times per second

    • Division/Square Root: 35 times per second


Stored program computer

Stored-Program Computer


Personal computer in 1980s

Personal Computer in 1980s

BASIC

IBM PC/AT


Top 500 supercomputers

Top 500 Supercomputers

GFLOPS


Cost of computing

Cost of Computing


Complexity of computing

Complexity of Computing

  • A: 10×100 B: 100×5 C: 5×50

  • (AB)C vs. A(BC)

  • A: N×N B: N×NC=AB

  • Time Complexity: O(N3)

  • Space Complexity: O(1)


Why parallel computing

Why Parallel Computing?

  • Why we need every-increasing performance:

    • Big Data Analysis

    • Climate Modeling

    • Gaming

  • Why we need to build parallel systems:

    • Increase the speed of integrated circuits  Overheating

    • Increase the number of transistors  Multi-Core

  • Why we need to learn parallel programming:

    • Running multiple instances of the same program is unlikely to help.

    • Need to rewrite serial programs to make them parallel.


Sum example

Sum Example

8

19

7

15

7

13

12

14

0

1

2

3

4

5

6

7

Cores

0

95


Sum example1

Sum Example

8

19

7

15

7

13

12

14

0

1

2

3

4

5

6

7

Cores

0

4

6

27

22

2

20

26

0

4

49

46

0

95


Levels of parallelism

Levels of Parallelism

  • Embarrassingly Parallel

    • No dependency or communication between parallel tasks

  • Coarse-Grained Parallelism

    • Infrequent communication, large amounts of computation

  • Fine-Grained Parallelism

    • Frequent communication, small amounts of computation

    • Greater potential for parallelism

    • More overhead

  • Not Parallel

    • Giving life to a baby takes 9 months.

    • Can this be done in 1 month by having 9 women?


Data decomposition

Data Decomposition

2 Cores


Granularity

Granularity

8 Cores


Coordination

Coordination

  • Communication

    • Sending partial results to other cores

  • Load Balancing

    • Wooden Barrel Principle

  • Synchronization

    • Race Condition


Data dependency

Data Dependency

  • Bernstein's Conditions

  • Examples

Flow Dependency

Output Dependency

1: function Dep(a, b)

2: c = a·b

3: d = 3·c

4: end function

1: function NoDep(a, b)

2: c = a·b

3: d = 3·b

4: e = a+b

5: end function


What is not parallel

What is not parallel?

Loop-Carried Dependence

for (k=5; k<N; k++) {

b[k]=DoSomething(K)

a[k]=b[k-5]+MoreStuff(k);

}

Recurrences

for (i=1; i<N; i++)

a[i]=a[i-1]+b[i];

Atypical Loop-Carried Dependence

wrap=a[0]*b[0];

for (i=1; i<N; i++) {

c[i]=wrap;

wrap=a[i]*b[i];

d[i]=2*wrap;

}

Solution

for (i=1; i<N; i++) {

wrap=a[i-1]*b[i-1];

c[i]=wrap;

wrap=a[i]*b[i];

d[i]=2*wrap;

}


What is not parallel1

What is not parallel?

Induction Variables

i1=4;

i2=0;

for (k=1; k<N; k++) {

B[i1++]=function1(k,q,r)

i2+=k;

A[i2]=function2(k,r,q);

}

Solution

i1=4;

i2=0;

for (k=1; k<N; k++) {

B[k+3]=function1(k,q,r)

i2=(k*k+k)/2;

A[i2]=function2(k,r,q);

}


Types of parallelism

Types of Parallelism

  • Instruction Level Parallelism

  • Task Parallelism

    • Different tasks on the same/different sets of data

  • Data Parallelism

    • Similar tasks on different sets of the data

  • Example

    • 5 TAs, 100 exam papers, 5 questions

    • How to make it task parallelism?

    • How to make it data parallelism?


Assembly line

Assembly Line

  • How long does it take to produce a single car?

  • How many cars can be operated at the same time?

  • How long is the gap between producing the first and the second car?

  • The longest stage on the assembly line determines the throughput.

15

20

5


Instruction pipeline

Instruction Pipeline

1: Add 1 to R5.

2: Copy R5 to R6.

  • IF: Instruction fetch

  • ID: Instruction decode and register fetch

  • EX: Execute

  • MEM: Memory access

  • WB: Register write back


Superscalar

Superscalar


Computing models

Computing Models

  • Concurrent Computing

    • Multiple tasks can be in progress at any instant.

  • Parallel Computing

    • Multiple tasks can be run simultaneously.

  • Distributed Computing

    • Multiple programs on networked computers work collaboratively.

  • Cluster Computing

    • Homogenous, Dedicated, Centralized

  • Grid Computing

    • Heterogonous, Loosely Coupled, Autonomous, Geographically Distributed


Concurrent vs parallel

Concurrent vs. Parallel

Job 1

Job 2

Job 1

Job 2

Job 1

Job 2

Job 3

Job 4

Core

Core 1 Core 2

Core 1 Core 2


Process thread

Process & Thread

  • Process

    • An instance of a computer program being executed.

  • Threads

    • The smallest units of processing scheduled by OS

    • Exist as a subset of a process.

    • Share the same resources from the process.

    • Switching between threads is much faster than switching between processes.

  • Multithreading

    • Better use of computing resources

    • Concurrent execution

    • Makes the application more responsive

Thread

Process

Thread


Parallel processes

Parallel Processes

Process 1

Node 1

Process 2

Program

Node 2

Process 3

Node 3

Single Program, Multiple Data


Parallel threads

Parallel Threads


Graphics processing unit

Graphics Processing Unit


Cpu vs gpu

CPU vs. GPU


Advanced computing techniques applications

CUDA


Advanced computing techniques applications

CUDA


Gpu computing showcase

GPU Computing Showcase


Mapreduce vs gpu

MapReduce vs. GPU

  • Pros:

    • Run on clusters of hundreds or thousands of commodity computers.

    • Can handle excessive amount of data with fault tolerance.

    • Minimum efforts required for programmers: Map & Reduce

  • Cons:

    • Intermediate results are stored in disks and transferred via network links.

    • Suitable for processing independent or loosely coupled jobs.

    • High upfront hardware cost and operational cost

    • Low Efficiency: GFLOPS per Watt, GFLOPS per Dollar


Parallel computing in matlab

Parallel Computing in Matlab

for i=1:1024

A(i) = sin(i*2*pi/1024);

end

plot(A);

matlabpool open local 3

parfor i=1:1024

A(i) = sin(i*2*pi/1024);

end

plot(A);

matlabpool close


Gpu computing in matlab

GPU Computing in Matlab

http://www.mathworks.cn/discovery/matlab-gpu.html


Cloud computing

Cloud Computing


Everything is cloud

Everything is Cloud …


Five attributes of cloud computing

Five Attributes of Cloud Computing

  • Service Based

    • What the service needs to do is more important than how the technologies are used to implement the solution.

  • Scalable and Elastic

    • The service can scale capacity up or down as the consumer demands at the speed of full automation.

  • Shared

    • Services share a pool of resources to build economies of scale.

  • Metered by Use

    • Services are tracked with usage metrics to enable multiple payment models.

  • Uses Internet Technologies

    • The service is delivered using Internet identifiers, formats and protocols.


Flynn s taxonomy

Flynn’s Taxonomy

  • Single Instruction, Single Data (SISD)

    • von Neumann System

  • Single Instruction, Multiple Data (SIMD)

    • Vector Processors, GPU

  • Multiple Instruction, Single Data (MISD)

    • Generally used for fault tolerance

  • Multiple Instruction, Multiple Data (MIMD)

    • Distributed Systems

    • Single Program, Multiple Data (SPMD)

    • Multiple Program, Multiple Data (MPMD)


Flynn s taxonomy1

Flynn’s Taxonomy


Von neumann architecture

Von Neumann Architecture

Harvard Architecture


Inside a pc

Inside a PC ...

Front-Side Bus (Core 2 Extreme)

8B × 400MHZ × 4/Cycle = 12.8GB/S

Memory (DDR3-1600)

8B × 200MHZ × 4 × 2/Cycle = 12.8GB/S

PCI Express 3.0 (×16)

1GB/S × 16= 16GB/S


Shared memory system

Shared Memory System

CPU

CPU

CPU

CPU

. . .

Interconnect

Memory


Non uniform memory access

Non-Uniform Memory Access

Remote Access

Core 1Core 2

Core 1Core 2

Interconnect

Interconnect

Local Access

Local Access

Memory

Memory


Distributed memory system

Distributed Memory System

CPU

CPU

CPU

. . .

Memory

Memory

Memory

Communication Networks


Crossbar switch

Crossbar Switch

P1

P2

P3

P4

M1

M2

M3

M4


Cache

Cache

  • Component that transparently stores data so that future requests for that data can be served faster.

    • Compared to main memory: smaller, faster, more expensive

    • Spatial Locality

  • Cache Line

    • A block of data that is accessed together

  • Cache Miss

    • Failed attempts to read or write a piece of data in the cache

    • Main memory access required

    • Read Miss, Write Miss

    • Compulsory Miss, Capacity Miss, Conflict Miss


Writing policies

Writing Policies


Cache mapping

Cache Mapping

Direct Mapped

2-Way Associative


Cache miss

Cache Miss

Row Major

#define MAX 4

double A[MAX][MAX], x[MAX], y[MAX]

/* Initialize A and x, assign y=0 */

for (i=0; i<MAX, i++)

for (j=0; j<MAX; j++)

y[i]+=A[i][j]*x[j];

/* Assign y=0 */

for (j=0; j<MAX, j++)

for (i=0; i<MAX; i++)

y[i]+=A[i][j]*x[j];

Column Major

Cache Memory

How many miss hits?


Cache coherence

Cache Coherence

Core 0

Core 1

Cache 0

Cache 1

Interconnect

What is the value of z1?

With write through policy …

With write back policy …

x=2y1

y0z1


False sharing

False Sharing

inti, j, m, n;

double y[m];

/* Assign y=0 */

for (i=0; i<m; i++)

for (j=0; j<n; j++)

y[i]+=f(i, j);

/* Private variables */

inti, j, iter_count;

/* Shared variables */

int m, n, core_count

double y[m];

iter_count=m/core_count

/* Core 0 does this */

for (i=0; i<iter_count; i++)

for (j=0; j<n; j++)

y[i]+=f(i, j)

/* Core 1 does this */

for (i=iter_count; i<2*iter_count; i++)

for (j=0; j<n; j++)

y[i]+=f(i, j)

m=8, two cores

cache line: 64 bytes


Virtual memory

Virtual Memory

  • Virtualization of various forms of computer data storage into a unified address space

    • Logically increases the capacity of main memory (e.g., DOS can only access 1 MB of RAM).

  • Page

    • A block of continuous virtual memory addresses

    • The smallest unit to be swapped in/out of main memory from/into secondary storage.

  • Page Table

    • Used to store the mapping between virtual addresses and physical addresses.

  • Page Fault

    • The accessed page is not in the physical memory.


Interleaving statements

Interleaving Statements

T0

T1

s1

s1

s1

s1

s1

s1

s1

s1

s2

s1

s1

s1

s1

s2

s2

s2

s1

s2

s2

s2

s2

s1

s2

s2

s2

s2

s2

s2


Critical region

Critical Region

  • A portion of code where shared resources are accessed and updated.

  • Resources: data structure (variables), device (printer)

  • Threads are disallowed from entering the critical region when another thread is occupying the critical region.

  • A means of mutual exclusion is required.

  • If a thread is not executing within the critical region, that thread must not prevent another thread seeking entry from entering the region.

  • We consider two threads and one core in the following examples.


First attempt

First Attempt

  • Q1: Can T1 enter the critical region more times than T0?

  • Q2: What would happen if T0 terminates (by design or by accident)?

intthreadNumber = 0;

void ThreadZero()

{

while (TRUE) do {

while (threadNumber == 1)

do {} // spin-wait

CriticalRegionZero;

threadNumber=1;

OtherStuffZero;

}

}

void ThreadOne()

{

while (TRUE) do {

while (threadNumber == 0)

do {} // spin-wait

CriticalRegionOne;

threadNumber=0;

OtherStuffOne;

}

}


Second attempt

Second Attempt

int Thread0inside = 0;

int Thread1inside = 0;

  • Q1: Can T1 enter the critical region multiple times when T0 is not within the critical region?

  • Q2: Can T1 and T2 be allowed to enter the critical region at the same time?

void ThreadOne()

{

while (TRUE) do {

while (Thread0inside) do {}

Thread1inside = 1;

CriticalRegionOne;

Thread1inside = 0;

OtherStuffOne;

}

}

void ThreadZero()

{

while (TRUE) do {

while (Thread1inside) do {}

Thread0inside = 1;

CriticalRegionZero;

Thread0inside = 0;

OtherStuffZero;

}

}


Third attempt

Third Attempt

int Thread0WantsToEnter = 0;

int Thread1WantsToEnter = 0;

void ThreadOne()

{

while (TRUE) do {

Thread1WantsToEnter = 1;

while (Thread0WantsToEnter)

do {}

CriticalRegionOne;

Thread1WantsToEnter = 0;

OtherStuffOne;

}

}

void ThreadZero()

{

while (TRUE) do {

Thread0WantsToEnter = 1;

while (Thread1WantsToEnter)

do {}

CriticalRegionZero;

Thread0WantsToEnter = 0;

OtherStuffZero;

}

}


Fourth attempt

Fourth Attempt

int Thread0WantsToEnter = 0;

int Thread1WantsToEnter = 0;

void ThreadOne()

{

while (TRUE) do {

Thread1WantsToEnter = 1;

while (Thread0WantsToEnter)

do {

Thread1WantsToEnter = 0;

delay(someRandomCycles);

Thread1WantsToEnter = 1;

}

CriticalRegionOne;

Thread1WantsToEnter = 0;

OtherStuffOne;

}

}

void ThreadZero()

{

while (TRUE) do {

Thread0WantsToEnter = 1;

while (Thread1WantsToEnter)

do {

Thread0WantsToEnter = 0;

delay(someRandomCycles);

Thread0WantsToEnter = 1;

}

CriticalRegionZero;

Thread0WantsToEnter = 0;

OtherStuffZero;

}

}


Dekker s algorithm

Dekker’s Algorithm

int Thread0WantsToEnter = 0, Thread1WantsToEnter = 0, favored = 0;

void ThreadZero()

{

while (TRUE) do {

Thread0WantsToEnter = 1;

while (Thread1WantsToEnter)

do {

if (favored == 1) {

Thread0WantsToEnter = 0;

while (favored == 1) do {}

Thread0WantsToEnter = 1;

}

}

CriticalRegionZero;

favored = 1;

Thread0WantsToEnter = 0;

OtherStuffZero;

}

}

void ThreadOne()

{

while (TRUE) do {

Thread1WantsToEnter = 1;

while (Thread0WantsToEnter)

do {

if (favored == 0) {

Thread1WantsToEnter = 0;

while (favored == 0) do {}

Thread1WantsToEnter = 1;

}

}

CriticalRegionOne;

favored = 0;

Thread1WantsToEnter = 0;

OtherStuffZero;

}

}


Parallel program design

Parallel Program Design

  • Foster’s Methodology

  • Partitioning

    • Divide the computation to be performed and the data operated on by the computation into small tasks.

  • Communication

    • Determine what communication needs to be carried out among the tasks.

  • Agglomeration

    • Combine tasks that communicate intensively with each other or must be executed sequentially into larger tasks.

  • Mapping

    • Assign the composite tasks to processes/threads to minimize inter-processor communication and maximize processor utilization.


Parallel histogram

Parallel Histogram

Find_bin()

data[i-1]

data[i]

data[i+1]

Incrementbin_counts

bin_counts[b-1]++

bin_counts[b]++

0

1

2

3

4

5


Parallel histogram1

Parallel Histogram

data[i-1]

data[i]

data[i+1]

data[i+2]

loc_bin_cts[b-1]++

loc_bin_cts[b]++

loc_bin_cts[b-1]++

loc_bin_cts[b]++

bin_counts[b-1]+=

bin_counts[b]+=


Performance

Performance

  • Speedup

  • Efficiency

  • Scalability

    • Problem Size, Number of Processors

  • Strongly Scalable

    • Same efficiency for larger N with fixed problem size

  • Weakly Scalable

    • Same efficiency for larger N with a fixed problem size per processor


Amdahl s law

Amdahl's Law


Gustafson s law

Gustafson's Law

  • Linear speedup can be achieved when:

    • Problem size is allowed to grow monotonously with N.

    • The sequential part is fixed or grows slowly.

  • Is it possible to achieve super linear speedup?

sequential parallel


Review

Review

  • Why is parallel computing important?

  • What is data dependency?

  • What are the benefits and issues of fine-grained parallelism?

  • What are the three types of parallelism?

  • What is the difference between concurrent and parallel computing?

  • What are the essential features of cloud computing?

  • What is Flynn’s Taxonomy?


Review1

Review

  • Name the four categories of memory systems.

  • What are the two common cache writing policies?

  • Name the two types of cache mapping strategies.

  • What is a cache miss and how to avoid it?

  • What may cause the false sharing issue?

  • What is a critical region?

  • How to verify the correctness of a concurrent program?


Review2

Review

  • Name three major APIs for parallel computing.

  • What are the benefits of GPU computing compared to MapReduce?

  • What is the basic procedure of parallel program design?

  • What are the key performance factors in parallel programming?

  • What is a strongly/weakly scalable parallel program?

  • What is the implication of Amdahl's Law?

  • What does Gustafson's Law tell us?


  • Login