1 / 48

Block A Introduction

Embedded Computer Architecture 2. Block A Introduction. The organisation of the course. 8 sessions Examination Lecturers Jan Kuper (Zilverling 4102, telephone: 3785) j.kuper@utwente.nl Andr é Kokkeler (Zilverling 4096, telephone: 4291) Course Material Powerpoint presentations on Teletop

myron
Download Presentation

Block A Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Embedded Computer Architecture 2 Block A Introduction

  2. The organisation of the course • 8 sessions • Examination • Lecturers • Jan Kuper (Zilverling 4102, telephone: 3785) j.kuper@utwente.nl • André Kokkeler (Zilverling 4096, telephone: 4291) • Course Material • Powerpoint presentations on Teletop • Presentations will differ from last year

  3. The aims of the course • Show the relation between the algorithm and the architecture. • Derive the architecture from the algorithm (if possible) by means of transformations • Derive architectural requirements from the algorithm (by means of transformations)

  4. The aims of the course Applications Algorithms subset ECA2 Transformations Analysis methods Design Architecture +/- 4 sessions +/- 4 sessions Platform (HW/SW)

  5. The design process A design description may express: • Behavior: Expresses the relation between the input and the output value-streams of the system • Structure: Describes how the system is decomposed into subsystems and how these subsystems are connected • Geometry: Describes where the different parts are located.

  6. Abstraction levels Behavior Geometry Structure Application Algorithm Basic operator Boolean logic Physical level Board level Layout Cell Block level Processing element Basic block Transistor

  7. Specification overloading Specification overloading means that the specification gives a possibly unwanted implementation suggestion, i.e. the behavioral specification expresses structure In practice: A behavioral specification always contains structure.

  8. 2 x + z a b 2 x + z a x b Example: same function same behavior, different expressions different structure different designs suggests: and suggests:

  9. Our focus • Array processors. • Systolic arrays. • Architectures for embedded algorithms s.a. digital signal processing algorithms.

  10. PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE Array processor An array processor is a structure in which identical processing elements are arranged regularly 1 dimension 2 dimensions

  11. PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE Array processor 3 dimensions

  12. PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE Systolic array In a systolic array processor all communication path contain at least one unit delay (register). is register or delay Delay constraints are local. Therefore unlimited extension without changing the cells

  13. Array Processors • Can be approached from: • Application • Algorithm • Architecture • Technology • We will focus on • Algorithm Architecture • Derive the architecture from the algorithm

  14. Array processors: Application areas • Speech processing • Image processing (video, medical ,.....) • Radar • Weather • Medical signal processing • Geology • . . . . . . . . . . . Many simple calculations on a lot of data in a short time General purpose processors do not provide sufficient processing power

  15. Example video processing • 1000 operations per pixel (is not that much) • 1024 x 1024 pixels per frame (high density TV) • 50 frames per second (100 Hz TV) • 50 G operations per second • < 1 Watt available • Pentium 2Ghz: 2G operations per second • > 30 Watt • required 25 Pentiums 750 Watt

  16. Description of the algorithms • In practice the algorithms are described (specified) in: • some programming language. • In our (toy) examples we use: • programming languages • algebraic descriptions

  17. Examples of algorithms we will use: Filter: Matrix algebra: Transformations like Fourier transform Z transform Sorting . . . .

  18. Graphs • Graphs are applicable for describing • behavior • structure • Dependency graphs • consist of: • nodes expressing operations or functions • edges expressing data dependencies or • the flow of data • Graphs are our vehicles to describe the design flow from • Algorithm to architecture

  19. Design flow idea program (imperative) single assignment code (functional) recurrent relations dependency graph signal flow graph

  20. 8 Example: Sorting: the idea < empty place needed 10 9 8 5 3 2 1 12 < 8 8 5 2 1 3 10 9 12 shifted one position

  21. 8 8 9 3 3 1 8 9 6 1 3 3 8 9 9 9 6 8 6 3 3 1 8 9 9 8 3 6 3 1 8 9 mj-1 mj-1 mj-1 mj-1 mj mj mj mj mj+1 mj+1 mj+1 mj+1 y x y y := mj x x y mj:= x x y x:= y

  22. Sorting: inserting one element if (x>= m[j]) { y = m[j]; m[j] = x; x = y; } if (x>= m[j]) swap(m[j],x); Identical descriptions of swapping m[j],x = MaxMin(m[j],x); Inserting an element into a sorted array of i elements such that the order is preserved: m[i] = -infinite for(j = 0; j < i+1; j++) { m[j],x = MaxMin(m[j],x); }

  23. Sorting: The program Sorting N elements in an array is composed from N times inserting an element into a sorted array of N elements such that the order is preserved. An empty array is ordered. int in[0:N-1], x[0:N-1], m[0:N-1]; for(int i = 0; i < N; i++) { x[i] = in[i]; m[i] = - infinite; } input body for(int i = 0; i < N; i++) { for(j = 0; j < i+1; j++) { m[j],x[i] = MaxMin(m[j],x[i]);} } output for(int j = 0; j < N; j++) { out[j] = m[j];}

  24. Sorting: Towards ‘Single assignment’ • Single assignment: • Each scalar variable is assigned only once • Why? • Goal is a data dependency graph • - nodes expressing operations or functions • - edges expressing data dependencies or • the flow of data

  25. Sorting: Towards ‘Single assignment’ Single assignment: Each scalar variable is assigned only once Why? Code Nodes Graph x=a+b; x=c*d; a x + b How do you connect these? c x * d

  26. Sorting: Towards ‘Single assignment’ Single assignment: Each scalar variable is assigned only once Why? Code x=a+b; x=c*d; Description already optimized towards implementation: memory optimization. But, fundamentally you produce two different values, e.g. x1 an x2

  27. Sorting: The program Sorting N elements in an array is composed from N times inserting an element into a sorted array of N elements such that the order is preserved. An empty array is ordered. int in[0:N-1], x[0:N-1], m[0:N-1]; for(int i = 0; i < N; i++) { x[i] = in[i]; m[i] = - infinite; } input body for(int i = 0; i < N; i++) { for(j = 0; j < i+1; j++) { m[j],x[i] = MaxMin(m[j],x[i]);} }

  28. hence, for(int i = 0; i < N; i++) { for(j = 0; j < i+1; j++) { m[i,j],x[i] = MaxMin(m[i-1,j],x[i]);} } Sorting: Towards ‘Single assignment’ Single assignment: Each scalar variable is assigned only once Start with m[j]: m[j] at loop index i depends on the value at loop index i-1 for(int i = 0; i < N; i++) { for(j = 0; j < i+1; j++) { m[j],x[i] = MaxMin(m[j],x[i]);} }

  29. hence, for(int i = 0; i < N; i++) { for(j = 0; j < i+1; i++) { m[i,j],x[i,j] = MaxMin(m[i-1,j],x[i,j-1]);} } Sorting: Towards ‘Single assignment’ x[i] at loop index j depends on the value at loop index j-1 for(int i = 0; i < N; i++) { for(j = 0; j < i+1; i++) { m[i,j],x[i] = MaxMin(m[i-1,j],x[i]);} }

  30. Sorting: The algorithm in ‘single assignment’ input int in[0:N-1], x[0:N-1,-1:N-1], m[-1:N-1,0:N-1]; for(int i = 0; i < N; i++) { x[i,-1] = in[i]; m[i-1,i] = - infinite; } body for(int i = 0; i < N; i++) { for(j = 0; j < i+1; j++) { m[i,j],x[i,j] = MaxMin(m[i-1,j],x[i,j-1]);} } output for(int j = 0; j < N; j++) { out[j] = m[N-1,j];} All scalar variables are assigned only once. The algorithm satisfies the single assignment property

  31. Sorting: Recurrent relation A description in single assignment can be directly translated into a recurrent relation in[0:N-1], out[0:N-1], x[0:N-1, -1:N-1], m[-1:N-1, 0:N-1]; declaration x[i,-1] = in[i] m[i-1,i] = - infinite input m[i,j],x[i,j] = MaxMin(m[i-1,j],x[i,j-1]) body out[j] = m[N-1,j] output 0 <= i < N; 0 <= j < i+1 } area Notice that the order of these relations is arbitrary

  32. j m[i-1,j] x[i,j-1] x[i,j] MaxMin m[i,j] i Sorting: Body in two dimensions m[i,j],x[i,j] = MaxMin(m[i-1,j],x[i,j-1]) body The body is executed for all i and j. Hence two dimensions

  33. j i Variable naming and index assignment A variable associated to an arrow gets the indices of the processing element that delivers its value. ci-1,j bi-1,j-1 ai,j-1 ai,j PEi,j ( i , j ) bi,j ci,j vi,j PEi,j Local constants get the indices of the processing element that they are in

  34. j m[i-1,j] 1 x[i,j] 0 x[i,j-1] i  1 0 m[i,j] Sorting: Body implementation body m[i,j],x[i,j] = MaxMin(m[i-1,j],x[i,j-1]) if( m[i-1,j] <= x[i,j-1]) { m[i,j] = x[i,j-1]; x[i,j] = m[i-1,j]; } else { m[i,j] = m[i-1,j]; x[i,j] = x[i,j-1]); }

  35. j m[2,3]= m[1,2]= m[0,1]= m[-1,0]= i Sorting: Implementation N = 4 -1 0 1 2 3 -1 PE = MaxMin x[0,-1] PE 0 PE PE x[1,-1] 1 PE PE PE x[2,-1] 2 PE PE PE PE x[3,-1] 3 m[3,0] m[3,1] m[3,2] m[3,3]

  36. Sorting: Example N = 4 3 1 5 2 5 3 2 1

  37. Dependency Graphs and Signal Flow Graphs • The array processor described: • the way in which the processors are • arranged and • the way in which the data is communicated • between the processing elements. PE PE PE PE PE PE PE PE PE PE PE PE Hence, the graph describes the dependencies of the data that is communicated, or said differently: The graph describes the way in which the data values at the outputs of a processing element depend on the data at the outputs of the other processing elements. So we may consider it as a Dependency Graph

  38. Recurrent relations For simple algorithms the transformation from single assignment code to a recurrent relation is simple. • Questions to answer: • How do recurrent relations influence the dependency graph • How can recurrent relations be manipulated such that the behavior remains the same and the structure of the dependency graph is changed We will answer these questions by means of an example: Matrix-Vector multiplication

  39. Matrix Vector multiplication Recurrent relations: Alternative (because  is associative)

  40. Matrix Vector multiplication The basic cell is described by: We have two indices i and j, so the dependency graph can be described as a two-dimensional array j bj ai,j bj x si,j si,j-1 si,j-1 si,j PE + i

  41. b0 b1 b2 j s0,-1 S0,0 s0,1 s0,2=c0 0 PE PE PE s1,0 0 PE PE s1,2=c1 PE s2,0 s2,2=c2 0 PE PE PE s3,0 s3,-1 i 0 s3,2=c3 PE PE PE DG-1 of the Matrix Vector multiplication (K = 4) (N = 3) b0, b1 and b2 are global dependencies. Therefore this graph is called a Globally recursive Graph

  42. j i DG-2 of the Matrix Vector multiplication b0 b1 b2 s0,1 s0,2 s0,3 c0=s0,0 0 PE PE PE s1,1 c1=s1,0 0 PE PE (K = 4) PE (N = 3) s2,1 c2=s2,0 0 PE PE PE s3,1 s3,3 c3=s3,0 0 PE PE PE

  43. Equation results in Equation results in Recurrent relations: Conclusion The associative operations and result in two different recurrent relations and thus in two different dependency graphs. Other associative operations are for example ‘AND’ and ‘OR’.

  44. å - N 1 = c a . b i i , j j = j 0 Changing global data dependencies into local data dependencies Global data dependencies resist manipulating the dependency graph j bj Global data dependencies ci i bj Local data dependencies di-1,j ci si,j

  45. b0=d-1,0 b1=d-1,1 b2=d-1,2 s0,-1 s0,0 s0,1 s0,2=c0 0 PE PE PE d0,0 d0,1 s1,0 0 PE PE s1,2=c1 PE d1,0 s2,0 å - N 1 = c a . b s2,2=c2 0 PE PE PE i i , j j = j 0 s3,0 s3,-1 0 s3,2=c3 PE PE PE Changing global data dependencies into local data dependencies So the matrix-vector multiplications becomes: Relations: (K = 4) (N = 3) Locally recursive graph

  46. å - N 1 = c a . b i i , j j = j 0 Alternative transformation from global data dependencies to local data dependencies bj Global data dependencies ci Local data dependencies di,j ci si,j bj

  47. s0,-1 s0,0 s0,1 s0,2=c0 0 PE PE PE d1,0 d1,1 s1,0 0 PE PE s1,2=c1 PE d2,0 å - N 1 s2,0 = c a . b s2,2=c2 0 PE PE PE i i , j j = j 0 s3,0 s3,-1 0 s3,2=c3 PE PE PE b2=d4,2 b0=d4,0 b1=d4,1 Changing global data dependencies into local data dependencies So the alternative locally recursive graph becomes: Relations: (K = 4) (N = 3)

  48. Associative operations give two alternative DG’s. Transformation from global to local dependencies gives two alternative DG’s. Input, output and intermediate edges will be treated separately. Dependency Graphs Conclusions:

More Related