# Transforming Linear Algebra Libraries: From Abstraction to Parallelism - PowerPoint PPT Presentation

Transforming Linear Algebra Libraries: From Abstraction to Parallelism. Ernie Chan. Motivation. Statically. Outline. Inversion of a Triangular Matrix Requisite Semantic Information Static Generation of a Directed Acyclic Graph Performance Conclusion. Inversion of a Triangular Matrix.

## PowerPoint Slideshow about 'Transforming Linear Algebra Libraries: From Abstraction to Parallelism' - kyros

Presentation Transcript

### Transforming Linear Algebra Libraries: From Abstraction to Parallelism

Ernie Chan

Motivation Parallelism

Statically

Outline Parallelism

• Inversion of a Triangular Matrix

• Requisite Semantic Information

• Static Generation of a Directed Acyclic Graph

• Performance

• Conclusion

Inversion of a Triangular Matrix Parallelism

• Formal Linear Algebra Methods Environment (FLAME)

• High-level abstractions for expressing linear algebra algorithms

• Triangular Inversion (Trinv)

R := U-1

Inversion of a Triangular Matrix Parallelism

Inversion of a Triangular Matrix Parallelism

• LAPACK-style Implementation

DO J = 1, N, NB

JB = MIN( NB, N-J+1 )

CALL DTRSM( ‘Left’, ‘Upper’, ‘No transpose’, ‘Non-unit’,

\$ JB, N-J-JB+1, -ONE, A( J, J ), LDA,

\$ A( J, J+JB ), LDA )

CALL DGEMM( ‘No transpose’, ‘No transpose’,

\$ J-1, N-J-JB+1, JB, ONE, A( 1, J ), LDA,

\$ A( J, J+JB ), LDA, ONE, A( 1, J+JB ), LDA )

CALL DTRSM( ‘Right’, ‘Upper’, ‘No transpose’, ‘Non-unit’,

\$ J-1, JB, ONE, A( J, J ), LDA,

\$ A( 1, J ), LDA )

CALL DTRTI2( ‘Upper’, ‘Non-unit’,

\$ JB, A( J, J ), LDA, INFO )

ENDDO

Inversion of a Triangular Matrix Parallelism

• FLASH

• Matrix of matrices

Inversion of a Triangular Matrix Parallelism

FLA_Part_2x2( A, &ATL, &ATR,

&ABL, &ABR, 0, 0, FLA_TL );

while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) )

{

FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02,

/* ******** */ /* **************** */

&A10, /**/ &A11, &A12,

ABL, /**/ ABR, &A20, /**/ &A21, &A22,

1, 1, FLA_BR );

/*-------------------------------------------------------*/

FLASH_Trsm( FLA_LEFT, FLA_UPPER_TRIANGULAR,

FLA_NO_TRANSPOSE, FLA_NONUNIT_DIAG,

FLA_MINUS_ONE, A11, A12 );

FLASH_Gemm( FLA_NO_TRANSPOSE, FLA_NO_TRANSPOSE,

FLA_ONE, A01, A12, FLA_ONE, A02 );

FLASH_Trsm( FLA_RIGHT, FLA_UPPER_TRIANGULAR,

FLA_NO_TRANSPOSE, FLA_NONUNIT_DIAG,

FLA_ONE, A11, A01 );

FLASH_Trinv( FLA_UPPER_TRIANGULAR, FLA_NONUNIT_DIAG, A11 );

/*-------------------------------------------------------*/

FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02,

A10, A11, /**/ A12,

/* ********** */ /* ************* */

&ABL, /**/ &ABR, A20, A21, /**/ A22,

FLA_TL );

}

Inversion of a Triangular Matrix Parallelism

• Extensible Markup Language (XML)

<?xml version="1.0" encoding="ISO-8859-1"?>

<Function name="FLA_Trinv" type="blk" variant="3">

<Option type="uplo">FLA_UPPER_TRIANGULAR</Option>

<Declaration>

<Operand type="matrix" direction="TL->BR" inout="both">A</Operand>

</Declaration>

<Loop>

<Guard>A</Guard>

<Update>

<Statement name="FLA_Trsm">

<Option type="side">FLA_LEFT</Option>

<Option type="uplo">FLA_UPPER_TRIANGULAR</Option>

<Option type="trans">FLA_NO_TRANSPOSE</Option>

<Option type="diag">FLA_NONUNIT_DIAG</Option>

<Parameter>FLA_MINUS_ONE</Parameter>

<Parameter partition="11">A<Parameter>

<Parameter partition="12">A<Parameter>

<Statement name="FLA_Gemm">

<Option type="trans">FLA_NO_TRANSPOSE</Option>

<Option type="trans">FLA_NO_TRANSPOSE</Option>

<Parameter>FLA_ONE<Parameter>

Inversion of a Triangular Matrix Parallelism

• Extensible Markup Language (XML) Cont.

<Parameter partition="01">A</Parameter>

<Parameter partition="12">A</Parameter>

<Parameter>FLA_ONE</Parameter>

<Parameter partition="02">A</Parameter>

</Statement>

<Statement name="FLA_Trsm">

<Option type="side">FLA_RIGHT</Option>

<Option type="uplo">FLA_UPPER_TRIANGULAR</Option>

<Option type="trans">FLA_NO_TRANSPOSE</Option>

<Option type="diag">FLA_NONUNIT_DIAG</Option>

<Parameter>FLA_ONE</Parameter>

<Parameter partition="11">A</Parameter>

<Parameter partition="01">A</Parameter>

</Statement>

<Statement name="FLA_Trinv">

<Option type="uplo">FLA_UPPER_TRIANGULAR</Option>

<Option type="diag">FLA_NONUNIT_DIAG</Option>

<Parameter partition="11">A</Parameter>

</Statement>

</Update>

</Loop>

</Function>

Outline Parallelism

• Inversion of a Triangular Matrix

• Requisite Semantic Information

• Static Generation of a Directed Acyclic Graph

• Performance

• Conclusion

Requisite Semantic Information Parallelism

• Partitioning Scheme

<?xml version="1.0" encoding="ISO-8859-1"?>

<Function name="FLA_Trinv" type="blk" variant="3">

<Option type="uplo">FLA_UPPER_TRIANGULAR</Option>

<Declaration>

<Operand type="matrix" direction="TL->BR" inout="both">A</Operand>

</Declaration>

<Loop>

<Guard>A</Guard> <!-- while m( ATL ) < m( A ) -->

<Update>

<Statement name="FLA_Trsm“>

<!-- ‘Left’, ‘Upper’, ‘No transpose’, ‘Non-unit’, -ONE, A11, A12 -->

</Statement>

<Statement name="FLA_Gemm“>

<!-- ‘No transpose’, ‘No transpose’, ONE, A01, A12, ONE, A02 -->

</Statement>

<Statement name="FLA_Trsm“>

<!-- ‘Right’, ‘Upper’, ‘No transpose’, ‘Non-unit’, ONE, A11, A01 -->

</Statement>

<Statement name="FLA_Trinv“>

<!–- ‘Upper’, ‘Non-unit’, A11 -->

</Statement>

</Update>

</Loop>

</Function>

Requisite Semantic Information Parallelism

• Problem Size*

<?xml version="1.0" encoding="ISO-8859-1"?>

<Function name="FLA_Trinv" type="blk" variant="3">

<Option type="uplo">FLA_UPPER_TRIANGULAR</Option>

<Declaration>

<Operand type="matrix" direction="TL->BR" inout="both">A</Operand>

</Declaration>

<Loop>

<Guard>A</Guard> <!-- while m( ATL ) < m( A ) -->

<Update>

<Statement name="FLA_Trsm“>

<!-- ‘Left’, ‘Upper’, ‘No transpose’, ‘Non-unit’, -ONE, A11, A12 -->

</Statement>

<Statement name="FLA_Gemm“>

<!-- ‘No transpose’, ‘No transpose’, ONE, A01, A12, ONE, A02 -->

</Statement>

<Statement name="FLA_Trsm“>

<!-- ‘Right’, ‘Upper’, ‘No transpose’, ‘Non-unit’, ONE, A11, A01 -->

</Statement>

<Statement name="FLA_Trinv“>

<!–- ‘Upper’, ‘Non-unit’, A11 -->

</Statement>

</Update>

</Loop>

</Function>

Requisite Semantic Information Parallelism

<?xml version="1.0" encoding="ISO-8859-1"?>

<Function name="FLA_Trinv" type="blk" variant="3">

<Option type="uplo">FLA_UPPER_TRIANGULAR</Option>

<Declaration>

<Operand type="matrix" direction="TL->BR" inout="both">A</Operand>

</Declaration>

<Loop>

<Guard>A</Guard> <!-- while m( ATL ) < m( A ) -->

<Update>

<Statement name="FLA_Trsm“>

<!-- ‘Left’, ‘Upper’, ‘No transpose’, ‘Non-unit’, -ONE, A11, A12 -->

</Statement>

<Statement name="FLA_Gemm“>

<!-- ‘No transpose’, ‘No transpose’, ONE, A01, A12, ONE, A02 -->

</Statement>

<Statement name="FLA_Trsm“>

<!-- ‘Right’, ‘Upper’, ‘No transpose’, ‘Non-unit’, ONE, A11, A01 -->

</Statement>

<Statement name="FLA_Trinv“>

<!–- ‘Upper’, ‘Non-unit’, A11 -->

</Statement>

</Update>

</Loop>

</Function>

Requisite Semantic Information Parallelism

• Input and Output Parameters

<?xml version="1.0" encoding="ISO-8859-1"?>

<Function name="FLA_Trsm">

<Declaration>

<Operand type=“scalar“ inout=“in">alpha</Operand>

<Operand type="matrix“ inout=“in">A</Operand>

<Operand type="matrix“ inout=“both“>B</Operand>

</Declaration>

</Function>

<Function name="FLA_Gemm">

<Declaration>

<Operand type=“scalar“ inout=“in">alpha</Operand>

<Operand type="matrix“ inout=“in">A</Operand>

<Operand type="matrix“ inout=“in">B</Operand>

<Operand type=“scalar“ inout=“in">beta</Operand>

<Operand type="matrix“ inout="both">C</Operand>

</Declaration>

</Function>

<Function name="FLA_Trinv">

<Declaration>

<Operand type="matrix“ inout="both">A</Operand>

</Declaration>

</Function>

Outline Parallelism

• Inversion of a Triangular Matrix

• Requisite Semantic Information

• Static Generation of a Directed Acyclic Graph

• Performance

• Conclusion

Static Generation of a DAG Parallelism

• Code Generation

• Convert XML representation to FLASH code generation intermediary

• Annotated with input and output information

• Create directed acyclic graph (DAG) by statically unrolling the loop

• Operations on submatrix blocks (tasks) are vertices

• Data dependencies between tasks are edges

Static Generation of a DAG Parallelism

• Data Dependencies

S1: A = B + C;

S2: D = A + E;

S3: F = A + G;

S4: A = H + I;

• Output (write-after-write)

S5: A = J + K;

S6: A = L + M;

Static Generation of a DAG Parallelism

HIPS 2010

Static Generation of a DAG Parallelism

• Problem Size

• Problem size cannot be determined a priori

• Fix the block size or loop unrolling factor

• Balance between instruction footprint and data granularity of tasks

• Example

• Trinv on 3x3 matrix of blocks

Static Generation of a DAG Parallelism

• Trinv

• Iteration 1

Trsm0

Trsm1

Trinv2

Static Generation of a DAG Parallelism

• Trinv

• Iteration 2

Trsm5

Gemm4

Trinv6

Trsm3

Static Generation of a DAG Parallelism

• Trinv

• Iteration 3

Trsm7

Trsm8

Trinv9

Static Generation of a DAG Parallelism

Trsm0

Trsm1

Trinv2

Trsm3

Gemm4

Trsm5

Trinv6

Trsm7

Trsm8

Trinv9

Outline Parallelism

• Inversion of a Triangular Matrix

• Requisite Semantic Information

• Static Generation of a Directed Acyclic Graph

• Performance

• Conclusion

Performance Parallelism

• LabVIEW

• Graphical, data flow programming language (G)

• Anti-dependencies cannot exist in G

• Copies are made when wire is split

Performance Parallelism

Performance Parallelism

• Target Architecture

• 16-core AMD processor

• 1.9 GHz

• 4 GB of RAM per socket

• LabVIEW 8.6

• Windows XP

• Basic Linear Algebra Subprograms (BLAS)

• MKL 7.2

Performance Parallelism

Performance Parallelism

• Results

• Parallelism

• Exploit parallelism inherent within DAG

• Hierarchical matrix storage

• Spatial locality

• Copy matrix from flat row-major storage to hierarchical matrix and back

Performance Parallelism

Outline Parallelism

• Inversion of a Triangular Matrix

• Requisite Semantic Information

• Static Generation of a Directed Acyclic Graph

• Performance

• Conclusion

Conclusion Parallelism

• Instantiate linear algebra algorithm using a code generation intermediary

• Statically produce a directed acyclic graph by fixing block size or loop unrolling factor

XML → FLASH → DAG

Acknowledgments Parallelism

• Jim Nagle, Robert van de Geijn

• We thank the other members of FLAME team for their support

• Funding

• National Instruments

• NSF Grants

• CCF—0540926

• CCF—0702714

Conclusion Parallelism