Research at the Computer Engineering Laboratory of Delft University of Technology

Research at theComputer Engineering Laboratory ofDelft University of Technology Ben Juurlink

Outline • General Information • Group Location • Group Formation • Group Funding • Group Interests • Group Projects • Molen • -Iliad • MOVE • Pamela • PUB library • Concluding Remarks

Delft University of Technology Aerospace Engineering Applied Sciences Architecture Civil Engineering and Geosciences Design, Engineering and Production Information Technology and Systems Technology, Policy and Management Computer Science Electrical Engineering Mathematics Telecommunication Software Engineering Microelectronics Energy Mediamatica Mathematical Analysis Control, Risk, Optimization, Stochastics, and Systems Group Location 7 faculties 13,000 student 2,100 researchers

Group Formation

Group Funding 94-98 (in Kfl) Total financing: 6000 Kfl

Group Output (‘94-’98) • Degrees: • PhD Theses......................................................................... 9 • Eng. degrees........................................................................ 5 • MSc...................................................................................... 87 • Publications: • Books/Chapters.................................................................... 7 • Journal articles..................................................................... 47 • Conference papers............................................................... 165 • Patents................................................................................. 50 • Five start-ups

Computer Engineering • Computer Engineering: Analysis of data processing requirements for electronic data processing units and systems and the design (synthesis) of their architecture, implementation, and realization • Architecture: Determine the function to perform • Implementation: Establish a method to achieve the function • Realization: Use available means to materialize the method

Computer Engineering Interests

Group Projects MOLEN : Embedded system architecture, multimedia, Java. MOVE : Embedded system synthesis, compilers, hardware software co-design. PAMELA : Performance analysis and languages. D-ILIAD : Computer architecture, implementation, computer arithmetic, switches.

MOLENEmbedded System Design • Topics: • Embedded Processor Architectures • Multimedia • Java • Embedded System Tools • Embedded Agents • Current Contributions: • Java Processor • Multimedia Instructions • Specialized Units • FPGA Units • Future Directions: • Reconfigurable embedded processors

Molen Multimedia Instructionand Functional Unit • Published at EUROMICRO’98 • Motion estimation, sum of absolute differences: s = 0; for (j=0; j<h; j++){ if ((v = p1[0]-p2[0])<0) v = -v; s += v; if ((v = p1[1]-p2[1])<0) v = -v; s += v; ... if ((v = p1[15]-p2[15])<0) v = -v; s += v; if (s >= distlim) break; p1 += lx; p2 += lx; } • Formula:

Straightforward approach: Compute Ai-Bi for all pairs of pixels Take absolute values Accumulate absolute values Cost: 4 cycles MOLEN solution: Observation: |Ai-Bi| = max{Ai,Bi}-min{Ai,Bi} Problem: determine and negate min(Ai-Bi) takes > 1 cycle Solution: pass min(Ai,Bi) to accumulate stage and correct Cost: 3 cycles Efficient Implementation ofthe SAD Operation

x User intercation x Solution Space x Optimizer x x x x exec. time x x x Architecture parameters x x x x feedback feedback x x x x x x cost Parametric compiler Hardware generator Parallel object code MOVE Semi-automatic generation of application specific processors

MOVE • Current Contributions: • Transport triggered architecture • Operational design framework (add any unit you like, no restrictions) • Several cheap designs (data logger, video-enhancer, MPEG-decoder, wireless communications) • Future Directions: • Tune your application to suit your processor • System design • Multiprocessor TTA • Low-power processors

Transport Triggered Architecture • Published in e.g. Jnl. of Systems Architecture ‘99 • Transport triggered architecture: • Only one instruction: MOVE! • FU operations are triggered by moving data to their input ports • Example: add r1,r2,r3 sub r4,r2,r6 st r4,r1 • TTA code: r2->O1add.alu1; r3->O2add.alu1; r2->O1sub.alu2; r6->O2sub.alu2 Radd.alu1->r1; Rsub.alu2->r4 r1->O1st.ls; r4->O2st.ls • After bypassing: r2->O1add.alu1; r3->O2add.alu1; r2->O1sub.alu2; r6->O2sub.alu2 Radd.alu1->r1; Rsub.alu2->r4; Radd.alu1->O1st.ls; Rsub.alu2->O2st.ls

Analytic Evaluation Architecture Implementation Simulation Evaluation PAMELAPerformance Analysis of Computer Systems • Current Contributions: • Specialized Languages • Simulation Tools & Methodology • Parallel Algorithms • Delft Architecture Workbench • Future Directions: • Complete the Delft Architecture Workbench

Static Branch Prediction • Data dependent branches: for (i=0; i<n-1; i++){ minIndex = i; for (j=i+1; j<n; j++) if (a[j] < a[minIndex]) B minIndex = j; swap(&a[i], &a[minIndex]); } • Oblivious static branch predictor: B will be taken 50% • Bernoulli model with truth probability p (profiling): large variance prediction error • New model based on alternating renewal processes reduces variance prediction error by order of magnitude • Let D (U) = consecutive number of 0’s (1’s) • Then • Example: 110011001100 • Then E[PA] = 0.5, Var[PA] = 0 E[PA] = E[U]/(E[D]+E[U]) Var[PA] = (E[D]2 Var[U]+E[U]2 Var[D]) (E[D]+E[U])2

D-IliadHigh Performance General Purpose Computers • Topics: • Uni & Multiprocessors • Internet Processing • Computer Design • High Speed Switches • Current Contributions: • Instruction level parallel machines (Superscalar, SCISM) • New “Complex” Instructions • New Designs of Arithmetic Processing • New Switch Design • Future Directions: • New Architectural paradigm

Complex Streamed Instructions • See PACT’01, EuroPar’01 • Drawbacks of MMX-like extensions: • Multimedia (MM) register size architecturally visible and fixed. Ways out: • add MM FUs and increase issue width • expensive • increase MM register size • existing codes have to be recompile/rewritten • not beneficial due to small sub-matrices • overhead for converting between packed data types and alignment • Proposed solution: Complex Streamed Instructions (CSI) • two-dimensional vector (stream) architecture, streams of arbitrary length • stream is specified by set of stream control registers • conversion between data types in hardware • no loop control and address generation overhead

The Need for a Parallel Computation Model • Parallel computing has not been very successful • One reason: lack of a standard parallel computation model • Properties that a suitable parallel computation model should possess: • Scalability • Portability • Predictability • Model proposed by Valiant (1990) • Bulk-Synchronous Parallel (BSP) model

BSP Model M M • BSP architectural model • set of p processors communicating by sending point-to-point messages • BSP programming model • computations proceed in phases (supersteps), separated by barrier synchronizations • BSP cost model • superstep takes time w + g · h + L where w: max. work h: max. messages (h-relation) g: bandwidth reciprocal L: latency/synchronization cost P P communication network P P M M barrier sync barrier sync

PUB Library • Paderborn University BSP (PUB) library (IPDPS’99) basics: • SPMD • no receive operation; barrier synchronization signifies end of all communication operations • only non-blocking communication primitives • buffered and unbuffered communication • message is placed in buffer associated with destination processor from which it can be retrieved after the next barrier sync • Additional features: • (non-blocking) collective communication primitives • ability to partition the processors • running different BSP computations on the same system (in different threads)

11 16 9 14 11 11 13 14 17 18 19 23 21 21 18 17 14 17 17 17 13 19 19 13 PUB ExampleParallel Binary Multisearch • Search butterfly: Proc 0 Proc 1 Proc 2 Proc 3 Local search tree

Parallel Binary Multisearch Using PUB void bin_search(int d, int m){ for (i=new_m=0; i<m; i++) if (query[i]<=gkey[d]&&inRight(d,me) || query[i]>gkey[d]&&inLeft(d,me)) bsp_send(&bsp,Opposite(d,me),&query[i],sizeof(int)); else query[new_m++] = query[i]; bsp_sync(&bsp); for (i=0;i<bsp_nmsgs(&bsp);i++){ msg = bsp_getmsg(&bsp,i); query[new_m++] = (int)(*bspmsg_data(msg)); } if (d==0) local_search(new_m,query,n,key); else bin_search(d-1,new_m); }

Concluding Remarks • Not discussed: • testing • ISA extensions for sparse matrix computations • computer arithmetic using single-electron technology • reconfigurable processors • network processors • low power • ... • For further information, please contact me (benj@ce.et.tudelft.nl) or see • ce.et.tudelft.nl • ce.et.tudelft.nl/~benj • www.upb.de/~pub Thank You

Research at the Computer Engineering Laboratory of Delft University of Technology