1 / 106

Unit-5: Cocurrent Processors

Unit-5: Cocurrent Processors. Vector & Multiple Instruction Issue Processors. Concurrent Processors. Processors that can execute multiple instructions at the same time ( Concurrently)

inigo
Download Presentation

Unit-5: Cocurrent Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unit-5: Cocurrent Processors Vector & Multiple Instruction Issue Processors

  2. Concurrent Processors • Processors that can execute multiple instructions at the same time ( Concurrently) • Concurrent processors can make simultaneous access to memory and can execute multiple operations simultaneously. • These processors execute from one program stream and have single instruction counter but instructions are so rearranged that concurrent instruction execution is achieved.

  3. Concurrent Processors • Processor performance depends on compiler ability, execution resources and memory system design. • Sophisticated compilers can detect various types of instruction level parallelism that exists with in a program and then depending upon type of concurrent processor, compilers can restructure the code that allows the use of available concurrency

  4. Concurrent Processors • There are two main types of concurrent processors. • Vector Processors: single vector instruction replaces multiple scalar instructions. It depends on compilers ability to vectorize the code to transform loops into sequence of vector operations. • Multiple Issue Processors: Instructions whose effects are independent of each other are executed concurrently.

  5. Vector Processors • A vector computer or vector processor is a machine designed to efficiently handle arithmetic operations on elements of arrays, called vectors. Such machines are especially useful in high-performance scientific computing, where matrix and vector arithmetic are quite common. Supercomputers like Cray Y-MP is an example of vector Processor.

  6. Vector Processors • Vector processors are based on the premise that the original program has either explicitly declared many of the data operands to be vectors or arrays or it implicitly uses loops whose data references can be expressed as references to a vector of operands ( Achieved by Compilers) . • Vector processors achieve considerable speed up in processor performance over that of simple pipelined processors.

  7. Vector Processors • To achieve the concurrency in operations and resulting speed up in performance, vector processors have extended instruction set and architecture to support the concurrent execution of commonly used vector operations in hardware. • Directly supporting vector operations in hardware reduces or eliminates the overhead of loop control which would otherwise be necessary

  8. Vectors and vector arithmetic • A vector, v, is a list of elements v = ( v1, v2, v3, ..., vn ) • The length of a vector is defined as the number of elements in that vector; so the length of v is n. • When mapping a vector to a computer program, we declare the vector as an array of one dimension.

  9. Vectors and vector arithmetic • Arithmetic operations may be performed on vectors. Two vectors are added by adding corresponding elements: s = x + y = ( x1+y1, x2+y2, ..., xn+yn ). where s is the vector representing the final sum and S, X, and Y have been declared as arrays of dimension N. This operation is sometimes called element wise addition. Similarly, the subtraction of two vectors, x - y, is an element wise operation.

  10. Vector Computing Architectural Concepts • A vector computer contains a set of special arithmetic units called pipelines. • These pipelines overlap the execution of the different parts of an arithmetic operation on the elements of the vector. • There can be different set of arithmetic pipelines to perform vector additions and vector multiplications.

  11. The stages of a floating-point operation • steps or stages involved in a floating-point addition on a sequential machine with normal floating point arithmetic hardware: s = x + y. • [A:] The exponents of the two floating-point numbers to be added are compared to find the number with the smallest magnitude. • [B:] The significand of the number with the smaller magnitude is shifted so that the exponents of the two numbers agree. • [C:] The significands are added.

  12. The stages of a floating-point operation • [D:] The result of the addition is normalized. • [E:] Checks are made to see if any floating-point exceptions occurred during the addition, such as overflow. • [F:] Rounding occurs. Consider an example of such an addition. The numbers to be added are x = 1234.00 and y = -567.8.

  13. Stages of a Floating-point Addition F Step A B C D E x 0.1234E4 0.12340E4 y - 0.5678E3 - 0.05678E4 0.066620E4 0.66620E3 0.66620E3 0.6662E3 s

  14. Stages of a Floating-point Addition • consider this scalar addition performed on all the elements of a pair of vectors (arrays) of length n. • Each of the six stages needs to be executed for every pair of elements. • If each stage of the execution takes tau units of time, then each addition takes 6*tau units of time (not counting the time required to fetch and decode the instruction itself or to fetch the two operands).

  15. Stages of a Floating-point Addition • So number of time units required to add all the elements of the two vectors in a serial fashion would be Ts = 6*n*tau.

  16. An Arithmetic Pipeline • Supposeaddition operation described previously is pipelined; that is, one of the six stages of the addition for a pair of elements is performed at each stage in the pipeline. • Each stage of the pipeline has a separate arithmetic unit designed for the operation to be performed at that stage. • it still takes 6*tau units of time to complete the sum of the first pair of elements, but that the sum of the next pair is ready in only tau more units of time.

  17. An Arithmetic Pipeline • So the time, Tp, to do the pipelined addition of two vectors of length n is Tp = 6*tau + (n-1)*tau = (n + 5)*tau. • Thus, this pipelined version of addition is faster than the serial version by almost a factor of the number of stages in the pipeline. • This is an example of what makes vector processing more efficient than scalar processing.

  18. An Arithmetic Pipeline • The operations at each stage of a pipeline for floating-point multiplication are slightly different than those for addition. • A multiplication pipeline may even have a different number of stages than an addition pipeline. • There may also be pipelines for integer operations. • Some vector architectures provide greater efficiency by allowing the output of one pipeline to be chained directly into another pipeline .

  19. Vector Functional Units • All modern vector processors use vector-register- vector instruction format. • So all vector processors consists of vector register sets. • Vector registers sets consists of eight or more registers with each register containing from 16 to 64 vector elements • Each vector element is a floating point word.(64 bits)

  20. Vector Functional Units • Vector registers access memory with special Load and Store instructions. • There are separate and independent functional units to manage the load / store function. • There are separate vector execution units for each instruction class. • These execution units are segmented ( pipelined ) to support highest possible execution rate.

  21. Primary Storage Facilities VECTOR REGISTERS VLD VST Memory Scalar (Floating Point registers) Data Cache Integer General Purpose Registers

  22. Vector Functional Units • Vector operations involve operations on a large number of operands (Vector of operands), thus pipelining helps in achieving execution at the cycle rate of the system. • Vector processors also contain scalar floating point registers, integer (General Purpose) registers and scalar functional units. • Scalar registers and their contents can interface with vector execution units.

  23. Vector Functional Units • Vectors as a data structure are not well managed by a data cache, so vector load / store operations avoid data cache and are implemented directly between memory and vector registers. • Vector load / store operations can be overlapped with other vector instruction executions. But vector loads must complete before they can be used.

  24. Vector Functional Units • The ability of processor to concurrently execute multiple ( independent) vector instructions is limited by the number of vector register ports and vector execution units. • Each concurrent load or store requires a vector register port; vector ALU operations require multiple ports.

  25. Vector Instructions / Operations • Vector Instructions are effective in several ways. • They significantly improve code density. • They reduce the number of instructions required to execute a program ( Reducing I-Bandwidth) • They organize data argument into regular sequences that can be efficiently handled by the hardware • They can represent a simple loop construct, thus removing the control overhead for loop execution.

  26. Types of Vector Operations • (a): Vector Arithmetic and Logical operations. • VADD, VSUB, VMPY, VDIV, VAND, VOR, VEOR VOP VR1 VR2 VR3 VR2 and VR3 are the vector register which contain the source operands on which vector operation is performed and result is stored in Vector register VR1. VR2 VOP VR3 VR1

  27. Types of Vector Operations • (b)Compare ( VCOMP): VR VCOMP VR S The result of vector compare is stored in a scalar register S. The Si bit of scalar register is set to 1 if v1.i > v2.i ( comparison of ith element) Test (VTEST): V1 VTEST CC - Scalar The Si bit is 1 if v1.i satisfies CC (condition code specified in the instruction).

  28. Types of Vector Operations • (c): Accumulate (VACC): ∑ ( V1 * V2) -> S Accumulate the sum of product of each element of two vectors into a scalar register. • (d): Expand / Compress ( VEXP / VCPRS) VR OP S  VR Take logical vectors and apply them to elements in vector registers to create a new vector value. • Vector Load (Scatter) and Vector Store (Gather) Instructions asynchronously access memory.

  29. Types of Vector Operations • When a vector operation is done on two vector registers of unequal length, we need some convention for producing result. • All entries in a vector register which are not explicitly stored are given a invalid content symbol NAN and any operation using NAN will also produce NAN regardless of contents of other register.

  30. Vector Processor Implementation • Vector processor implementation requires considerable amount of additional control and hardware. • Vector registers used to store vector operands generally bypass the data cache. • Data cache is used solely to store scalar values. • Since particular values stored in a vector may be aliased as a scalar and stored in a data cache all vector references to memory must be checked against contents of data cacahe.

  31. Vector Processor Implementation • If there is a hit, invalidate the current value contained in data cache and force memory update. • Additional H/W or S/W control is required to ensure that scalar references from data cache to memory do not inadvertently reference a value contained in vector register. • Earlier Vector processors used Memory to Memory instruction format, but due to severe memory congestion and contention problems most recent vector processor use vector registers to load / store vector operands.

  32. Vector Processor Implementation • Vector registers generally consists of a set of eight registers each containing from 16 to 64 entries of the size of a floating point word. • The arithmetic pipeline may be shared with the scalar part of the processor. • Under some conditions, it is possible to execute more than one arithmetic operation per cycle. • Te result of one arithmetic operation can be directly used as an operand in subsequent vector instruction.

  33. Vector Processor Implementation • This is called Chaining. • For the two instructions VADD VR3, VR1, VR2 VMPY VR5, VR3, VR4. VR 1.3 + VR 2.3 VR 1.2 + VR 2.2 To VR 3.1 VADD VMPY VR 3.1 * VR 4.1

  34. Vector Processor Implementation • The illustrated chained ADD- MPY with each functional unit having 4 stages, saves 64 cycles. • If unchained it would have taken 4 (startup) + 64 (elements/VR) = 68 cycles for each function – a total of 136 cycles. • With chaining this is reduced to 4 (add start up) + 4 (multiply startup) + 64 (elements/VR) = 72 Cycles – A saving of 64 cycles.

  35. Vector Processor Implementation • Another important aspect of vector implementation is management of references to memory. • Memory must have sufficient bandwidth to support minimum two and preferably three references per cycle (two reads and one write) • This bandwidth allows for two vector reads and one vector write to be initiated and executed concurrently with execution of a vector arithmetic operation.

  36. Vector Processor Implementation • Major data paths in a generic vector processor are shown in Figure 7.14 on page no 438 of computer architecture by Michael Flynn.

  37. Vector Memory • Simple low order interleaving used in normal pipelined processors is not suitable for vector processors. • Since access in case of vectors is non sequential but systematic, thus if array dimension or stride( address distance between adjacent elements) is same as interleaving factor then all references will concentrate on same module.

  38. Vector Memory • It is quite common for these strides to be of the form 2k or other even dimensions. • So vector memory designs use address remapping and use a prime number of memory modules. • Hashed addresses is a technique for dispersing addresses. • Hashing is a strict 1:1 mapping of the bits in X to form a new address X’ based on simple manipulations of the bits in X.

  39. Vector Memory • A memory system used in vector / matrix accessing consists of following units. • Address Hasher • 2k + 1 memory modules. • Module mapper. This may add certain overhead and add extra cycles to memory access but since the purpose of the memory is to access vectors, this can be overlapped in most cases.

  40. Vector Memory X address Address Hasher Module Index (X’ / 2k) X’ Module address Computation Address in a module X’ mod (2k + 1) . . . . . . Data 2k + 1 Modules

  41. Modeling Vector Memory Performance • Vector memory is designed for multiple simultaneous requests to memory. • Operand fetching and storing is overlapped with vector execution. • Three concurrent operand access to memory are a common target but increased cost of memory system may limit this to two. • Chaining may require even more accesses. • Another issue is the degree of bypassing or out of order requests that a source can make to memory system.

  42. Modeling Vector Memory Performance • In case of conflict ie a request being directed to a busy module, the source can continue to make subsequent requests only if not serviced requests are held in a buffer. • Assume each of ‘s’ access ports to memory has a buffer of size TBF/s which holds requests that are being held due to a conflict. • For each source, degree of bypassing is defined as the allowable number of requests waiting before stalling of subsequent requests occurs.

  43. Modeling Vector Memory Performance • If Qc is the expected number of denied requests per module and m is the number of modules, Then buffer size must be large enough to hold denied requests Buffer = TBF > m. Qc • If n is the total number of requests made and B is the bandwidth achieved then m . Qc = n-B ( denied requests)

  44. Gamma (γ) – Binomial Model • Assume that each vector source issues a request each cycle(δ =1) and each physical requestor has the same buffer capacity and characteristics. • If the vector processor can make s requests per cycle and there are t cycles per Tc, then Total requests per Tc = t . s = n This is same as n requests per Tc in simple binomial model .

  45. Gamma (γ) – Binomial Model • If γ is the mean queue size of bypassed requests awaiting service then each of γ – buffered requests also make a request. • From memory modeling point of view this is equivalent to buffer requesting service each cycle until module is free. Total request per Tc = t . s + t . s . γ = t. s(1 + γ) = n (1 + γ).

  46. Gamma (γ) – Binomial Model • Substituting this value in simple binomial equation B (m, n,γ ) = m+n(1+γ)-1/2 - (m+n(1+γ)-1/2)2 – 2nm(1+γ)

  47. Calculating γopt • The γ is the mean expected bypassed request queue per source. • If we continue to increase number of bypass buffer registers we can achieve a γopt which totally eliminates contention. • No contention occurs when B = n or B(m,n,γ) = n • This occurs when ρa = ρ = n/m • Since MB/ D/1 queue size is given by Q = ρa2 – pρa / 2(1- ρa) = n(1+γ) –B /m

  48. Calculating γopt • Substituting ρa = ρ= n/m and p=1/m we get: Q=( n2-n)/2((m2-nm) = (n/m )(n-1) /(2m-2n) • Since Q = (n(1+γ) –B) /m So mQ = n(1+γ) –B Now for γopt (n-B) =0 So γopt =m/n Q So γopt = n-1/ 2m-2n And mean total buffer size (TBF) = n γopt To avoid overflow buffer may be considerably larger may be 2 x TBF

  49. Vector Processor Speed up: Performance Relative to Pipelined Processor Vector processor performance depends on • The amount of program that can be expressed in a vectorizable form. • Vector startup costs ( length of pipeline) • Number of execution units and support for chaining. • Number of operands that can be simultaneously accessed /stored • Number of vector registers.

  50. Vector Processor Speedup • The overall effect of speedup possible by the vector processor over the pipelined processor is generally limited to 4. • This assumes concurrent execution of 2 load, 1 store and 1 arithmetic operation. • If chaining is allowed and memory system can accommodate an additional concurrent load instruction, the speedup can extend to <6 ( 3LDs, 2 Arith, and 1 ST)

More Related