1 / 28

Domain-Specific Languages for Ubiquitous Parallelism

Domain-Specific Languages for Ubiquitous Parallelism. Calvin Lin University of Texas at Austin September 6, 2007. Domain-Specific Languages. Sam Midkiff: “Domain-specific languages are good” Jim Larus: “We don’t want to rely on people like Sam to build domain-specific languages”

overton
Download Presentation

Domain-Specific Languages for Ubiquitous Parallelism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Domain-Specific Languages for Ubiquitous Parallelism Calvin Lin University of Texas at Austin September 6, 2007

  2. Domain-Specific Languages • Sam Midkiff: “Domain-specific languages are good” • Jim Larus: “We don’t want to rely on people like Sam to build domain-specific languages” • How can we simplify the creation of Domain-Specific Languages?

  3. Libraries are Domain-Specific Languages • Two issues • No new syntax • No compiler support • Libraries encapsulate domain-specific semantics • This semantic information provides many opportunities for analysis and optimization • This information is unavailable to conventional compilers c = a * b; /* language primitive */ bnMultiply(c,a,b) /* library call*/

  4. Integrated App and Library Application Our Solution Extends power of compilers to library operations Broadway Compiler Library Annotations Domain-specific information

  5. Integrated App and Library Application Hard parts are reused many times Separation of Concerns Extends power of compilers to library operations Mortal programmers Compiler writer Broadway Compiler Library One compiler for all libraries Annotations Domain Expert One set of annotations per library Hard parts are hidden from the mortals

  6. Outline • Motivation • Our Solution • Example: Optimizing PLAPACK applications • Results • Looking to the Future

  7. Birdseye View of PLAPACK PLAPACK: Dense parallel linear algebra library • Developed by van de Geijn, et al • Designed for high performance •  40,000 lines of C code [van de Geijn 1997] Applications: LU, QR, Cholesky, ... Parallel BLAS 3 Parallel BLAS 2 BLAS3 Parallel BLAS 1 BLAS2 BLAS1 MPI Utils

  8. Typical PLAPACK Application while (True) { PLA_Obj_global_length(ABR, &length); if (length == 0) break; PLA_Obj_split_4(ABR, nb, nb, &A11, &A12, &A21, &A22); Cholesky(A11); PLA_Trsm(PLA_SIDE_RIGHT, PLA_LOW_TRIAN, PLA_TRANS, PLA_NONUNIT_DIAG, one, A11, A21); PLA_Syrk(PLA_LOW_TRANS, PLA_NO_TRANS, minus_one, A21, one, ABR); } “views” of the data

  9. Views in PLAPACK • The notion of views can be used to perform optimizations • Views can have special properties • These properties can be reasoned about by programmers • These properties can be exploited by using special algorithms These properties are an example of domain-specific information local distributed PLA_Trsm_local(PLA_SIDE_RIGHT, PLA_LOW_TRIAN, PLA_TRANS, PLA_NONUNIT_DIAG, one, A11, A21);

  10. PLA_Trsm() and PLA_Syrk() are overly general– they work for any distribution View-Based Optimizations Given the original program: PLA_Obj_view_all(A, &ABR) while (True) { PLA_Obj_length(ABR, &b); b = min(b, nb); if (b==0) break; PLA_Obj_split_4(ABR,b,b,&A11.. &A21, &ABR); Cholesky(A11); PLA_Trsm(PLA_SIDE_RIGHT, …) PLA_Syrk(PLA_LOW_TRIAN, …) } • Compiler analyzes the flow of view information through the program • Compiler determines when specialized routines can be used

  11. What Information Is Needed? • Define special properties “Views can be local or distributed” • Specify how the library routines affect these properties “Which routines create views, shrink views, etc” • Specify when special routines can be used “How can view information be used to invoke specialized routines”

  12. How Do We Convey This Information? • Define special properties • Specify how the library routines affect these properties • Specify when special routines can be used property Distribution = {Local, Distributed, Empty}; procedure PLA_Obj_split_4(obj, length, ...) { analyze Distribution { (view == Distributed) ==> view11 = Local; } } procedure PLA_TRSM(...) { specialize { (view == Local) ==> replace ”PLA_Trsm_Local”;

  13. Other Annotations • Basic Annotations convey dependence information • defs and uses of procedure parameters • pointer relationships • These annotations are not domain-specific modify {}; access {view}; on_entry {obj--> view}; on_exit {A11-->view11, A12-->view12, A21-->view21, A22-->view22};

  14. Broadway Baseline Guru-optimized Does It Work? • Comparison against guru-optimized version written by PLAPACK development team [Baker, et al98] Cholesky (3072×3072) Cholesky (3072  3072) 3000 MFLOPS 0 0 40 Processors [Guyer and Lin 1999]

  15. Cray T3E PLA_Trsm() and PLA_Gemm() are specialized for their specific calling contexts Trsm Cholesky Cholesky Lyapunov Lyapunov Gemm A Closer Look at Performance • Improvement over clean, high quality PLAPACK programs 400 300 200 % Improvement 100 0 250 750 1500 2500 Problem size

  16. MPI_Send is specialized • Broadcast is specialized Cray T3E Asynchronous Send Pipelined Broadcast Pipelined Broadcast Rank K algorithm Rank K algorithm PLA_Gemm PLA_Gemm Scalability is improved A Closer Look at Gemm • The Gemm algorithm is specialized 4000 3000 MFLOPS Rank K algorithm 2000 PLA_Gemm 1000 0 4 16 36 Number of Processors

  17. PLA_Gemm() Broadway Rank K algorithm MPI_Send() C compiler There is great benefit to optimizing at multiple levels of abstraction Optimizing at Multiple Levels Levels of Abstraction in PLAPACK Should program at this level global matrix operations global matrix operations Global Global Explicitly parallel Explicitly parallel matrices + high level communication matrices + high level communication Local Local MPI + local BLAS MPI + local BLAS C language C language C primitives C primitives

  18. What About Other Domains? • Is our approach general?

  19. Security Analysis as Data Flow Analysis • Example: • Vulnerability: executes any remote command • What if this program runs as root? • Requirement: • This is a domain-specific analysis • We can use the same annotation language and compiler int sock; char buffer[100]; sock = socket(AF_INET, SOCK_STREAM, 0); read(sock, buffer, 100); execl(buffer); ! Data from an Internet socket should not specify a program to execute

  20. Generating Tainted Data • Any external input is tainted • Examples: read(), fscanf(), readdir() • Taintedness is a property of the buffer, not the surface variable procedure read(fd, buffer_ptr, size) { on_entry { buffer_ptr --> buffer } analyze Taint { buffer <- Tainted } }

  21. Transmitting Taintedness • String manipulation can transmit taintedness • Examples: strcpy(), strdup(), strcat(), sprintf() procedure strcpy(dest, src) { on_entry { src --> src_string dest --> dest_string analyze Taint { dest_string <- src_string } }

  22. Reporting Vulnerabilities • Test the flow values • Tainted strings should not be passed to execl() • Reports the exact location of the problem procedure execl(buffer_ptr) { on_entry { buffer_ptr --> buffer } report if (Taint : buffer is-exactly Tainted) “Vulnerability at “ ++ @context ++ “: Argument; “ ++ [ buffer_ptr ] ++ “ is tainted.\n”; }

  23. Security Analysis with Broadway • Tested on actual programs that were distributed with the bug Results for Format String Vulnerability

  24. Beyond Static Analysis Original Code Inserted Code • Compiler can dynamically enforce security policies int sock; char buffer[100]; sock = socket(AF_INET, SOCK_STREAM, 0); read(sock, buffer, 100); execl(buffer); int vs, vb; vs = Tainted; vb = Tainted; if (vb!=Tainted) { execl(buffer); } Data from an Internet socket should not specify a program to execute

  25. Sequential Library Broadway Compiler Parallel Library Annotations Looking to the Future Sequential Library Parallel Library Broadway Compiler Sequential App and Library Parallel App and Library • Key questions: • What information is needed? • How do we express this information? • How should libraries be structured? Annotations

  26. What Information is Needed? Sequential Library Broadway Compiler Parallel Library • Dependence information • e.g. FLAME pointers are well-behaved • e.g. Which operations commute • Hints on what to parallelize • Hints on granularity of parallelism • Machine information • Guidance for empirical tuning . . . Annotations

  27. Possible Test Cases • Parallelize van de Geijn’s FLAME library • Parallelize graphics rendering engine • Support Galois-style optimistic dynamic parallelization [Pingali, et al]

  28. Conclusions • Long term vision • Create tools and frameworks that help domain experts create parallel DSL’s • Use Broadway to move us towards this goal • Broadway approach works with existing libraries, new libraries, program generators, . . . • Ignore issues of syntax • Focus on parallelization issues

More Related