1 / 22

Data Intensive Computing at Sandia

Data Intensive Computing at Sandia. September 15, 2010 Andy Wilson Senior Member of Technical Staff Data Analysis and Visualization Sandia National Laboratories.

marinel
Download Presentation

Data Intensive Computing at Sandia

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Intensive Computing at Sandia September 15, 2010 Andy Wilson Senior Member of Technical Staff Data Analysis and Visualization Sandia National Laboratories Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

  2. The Question What is Data-Intensive Computing?

  3. My Answer What is Data-Intensive Computing? Parallel computing where you design your algorithms and your software around efficient access and traversal of a data set; where hardware requirements are dictated by data size as much as by desired run times Usually distilling compact results from massive data

  4. Outline • What is Data-Intensive Computing? • Data-Intensive Computing at Sandia • Physics • Informatics • Architectures • Into the Future

  5. Spaghetti Plot (2)

  6. Traditional Visualization Workflow Solver Full Mesh Disk Storage Visualization

  7. Traditional In-Situ Visualization Solver Solver Visualization Full Mesh Disk Storage Images Disk Storage Visualization

  8. Coprocessing Solver Solver Solver Features & Statistics Visualization Full Mesh Disk Storage Salient Data Images Disk Storage Disk Storage Visualization Visualization

  9. Collision Movie

  10. Outline • What is Data-Intensive Computing? • Data-Intensive Computing at Sandia • Physics • Informatics • Architectures • Into the Future

  11. Community Detection in Networks • Find many small groups of vertices and/or edges • O(n) communities • overlaps may be allowed • Hundreds of papers in physics and computer science Lancichinetti, Fortunato, Radicchi 2008

  12. Twitter social network (|V|≈200M) [Akshay Java, 2007] Analysis of Massive Graphs • Finding communities: a kernel of social network analysis • “Dunber’s number” from sociology: there is a size limit (~150) on stable social group size (from neolithic farming village to academic sub-discipline)

  13. Collapsed Dendrograms and Statistical Confidence: wCNM The wCNM partitioning is much deeper, resolving smaller communities The statistically significant variation is visually close, but does not reproduce ground truth as well The (much better) wCNM solution also has a statistically significant variation. Image credit: Titan

  14. LSA and LDA from 5 miles up (LDA) Image credit: Dave Robinson

  15. LSA/LDA: Increasing Data Size, Single ProcessorStraight Line = Linear Scaling, Lower = Faster Slide 16 of 18

  16. LSA/LDA: Weak Scaling(Bigger Problem, Same Time)Flat Lines = Perfect Scaling Slide 17 of 18

  17. Outline • What is Data-Intensive Computing? • Data-Intensive Computing at Sandia • Physics • Informatics • Architectures • Into the Future

  18. NGC System Diagram “This project seeks to bring these two strengths – a solid reputation for excellence in computing, and our niche expertise in specific classes of intelligence analysis – to bear on a thorny problem: developing advanced informatics capabilities that are both usable and useful to analysts who are drowning in data.” NGC project proposal Architectures Algorithms Data Web Services Applications (Clients) Titan, browser Trilinos Algebraic Methods Clustering, Ranking, High Dimensional Mapping Titan Analysis Pipelines, Capability Integration, Data Access, Lightweight analysis MTGL Graph Methods Subgraph searches, Connection sg’s, Shortest Path, etc. Titan Analysis Pipelines, Capability Integration, Data Access, Lightweight analysis Specialized Distributed Data Operations Highly optimized Iterative, flexible

  19. SQL ServiceEnables Remote Access to Data Warehouse Appliances (DWA) Analyst HPC System (Red Storm) DWA Service Nodes (GUI and Database Services) Netezza TCP/IP SQL Additional Modifications for Multilingual • Tokenization support on Netezza (goal is to count unique words) • Developed a custom UTF-8 words splitter for SPU (snippet processing unit) • Allows parallel tokenization and counting at storage device SQL Service* • Provides “bridge” between parallel apps and external DWA • Runs on Red Storm network nodes • Titan applications communicate with service through Portals • External resources (Netezza) communicate through standard interfaces (e.g. ODBC over TCP/IP) High-Speed Network (Portals) LexisNexis Other ODBC DWA Compute Nodes (Titan Analysis Code) Anywhere Tech Area 1 CSRI The SQL service enables an HPC application to access a remote DWA * Results of SQL access from parallel statistics code presented at CUG’2009.

  20. Outline • What is Data-Intensive Computing? • Data-Intensive Computing at Sandia • Physics • Informatics • Architectures • Into the Future

  21. Into the Future • I don’t care about flops anymore. I care about mops. • I want to send more complex requests to the storage system. • There is no one perfect architecture.

More Related