Matrix multiplication implemented in data flow technology
This presentation is the property of its rightful owner.
Sponsored Links
1 / 17

Matrix multiplication implemented in data flow technology PowerPoint PPT Presentation


  • 48 Views
  • Uploaded on
  • Presentation posted in: General

Matrix multiplication implemented in data flow technology. Aleksandar Milinkovi ć Belgrade University, School of Electrical Engineering [email protected] Introduction. Problem with big data Need to change computing paradigm Data flow instead of control flow

Download Presentation

Matrix multiplication implemented in data flow technology

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Matrix multiplication implemented in data flow technology

Matrix multiplication implemented in data flow technology

AleksandarMilinković

Belgrade University, School of Electrical Engineering

[email protected]


Introduction

Introduction

  • Problem with big data

  • Need to change computing paradigm

  • Data flow instead of control flow

  • Achieved by construction of graph

  • Graph nodes (vertices) perform computations

  • Each node is one deep pipeline


Dataflow computation

Dataflow computation

  • Dependencies are resolved at compile time

  • No new dependencies are made

  • The whole mechanism is in deep pipeline

  • Pipeline levels perform parallel computations

  • Data flow produces one result per cycle


Matrix multiplication

Matrix multiplication

  • Data flow doesn’t suit all situations

  • However, it is applicable in lot of cases:

    • Partial differential equations

    • 3D finite differences

    • Finite elements method

    • Problems in bioinformatics, etc.

  • Most of them contain matrix multiplications

  • Goal: realization on FPGA, using data flow


Project realizations

Project realizations

  • Two solutions:

    • Maximal utilization of on-chip matrix part

      • Matrices with small dimensions

      • Matrices with large dimensions

    • Multiplication using parallel pipelines


Good chip utilization a

Good chip utilization A

  • Set of columns on the chip until they are fully used

  • Every pipe calculates 48 sums at the time

  • Equivalent to 2 processors with 48 cores

  • Additional parallelizationpossible


Good chip utilization a1

Good chip utilization A


Good chip utilization a2

Good chip utilization A

  • Chip utilization and acceleration

  • LUTs: 195345/297600 (65,64%)

  • FFs: 290689/595200 (48.83%)

  • BRAMs: 778/1064 (73.12%)

  • DSPs: 996/2016 (49,40%)

  • Matrix: 2304 x 2304

    • Intel: 42.5 s

    • MAX3: 2.38 s

  • Acceleration at kernel clock 75 MHz: ≈18 x


Good chip utilization b

Good chip utilization B

  • Part of matrix Y is on chip during computation

  • Each pipe calculates 48 sums at the time

  • Equivalent to 2 processors with 48 cores


Good chip utilization b1

Good chip utilization B


Good chip utilization b2

Good chip utilization B

  • Chip utilization and acceleration

  • LUTs: 201237/297600 (67,62%)

  • FFs: 302742/595200 (50.86%)

  • BRAMs: 782/1064 (73.50%)

  • DSPs: 1021/2016 (50,64%)

  • Matrix: 2304 x 2304

    • Intel: 42.5 s

    • MAX3: 2.38 s

  • Acceleration at kernel clock 75 MHz: ≈ 18x

  • Matrix: 4608 x 4608

    • Intel: 1034 s

    • MAX3: 58.41 s


Multiple parallel pipelines

Multiple parallel pipelines

  • Matrices are exclusively in a big memory

  • Each pipe calculates one sum at the time

  • Equivalent to 48 processors with one core


Multiple parallel pipelines1

Multiple parallel pipelines


Multiple parallel pipelines2

Multiple parallel pipelines

  • Chip utilization and acceleration

  • LUTs: 166328/297600 (55,89%)

  • FFs: 248047/595200 (41.67%)

  • BRAMs: 430/1064 (40.41%)

  • DSPs: 489/2016 (24,26%)

  • Matrix: 2304 x 2304

    • Intel: 42.5 s

    • MAX3: 4,08 s

  • Acceleration at kernel clock 150 MHz: > 10x

  • Matrix: 4608 x 4608

    • Intel: 1034 s

    • MAX3: 98,48 s


Comparison of solutions

Comparison of solutions

  • First solution:

    • Good chip utilization

    • Shorter execution time

  • Drawback: matrices up to 8GB

  • Second solution: matricesup to 12GB

  • Drawback: longer execution time


Conclusions

Conclusions

  • Matrix multiplication is operation with complexity O(n3)

  • Part of complexity moved from time to space

  • That produces acceleration (shorter execution time)

  • Achieved by application of data flow technology

  • Developed using tool chain from Maxeler Technologies

  • Calculations order of magnitude faster than Intel Xeon


Matrix multiplication implemented in data flow technology1

Matrix multiplication implemented in data flow technology

AleksandarMilinković

Belgrade University, School of Electrical Engineering

[email protected]


  • Login