1 / 65

S CICOM P, IBM, and TACC: Then, Now, and Next

S CICOM P, IBM, and TACC: Then, Now, and Next. Jay Boisseau, Director Texas Advanced Computing Center The University of Texas at Austin August 10, 2004. Precautions.

sanaa
Download Presentation

S CICOM P, IBM, and TACC: Then, Now, and Next

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SCICOMP, IBM, and TACC:Then, Now, and Next Jay Boisseau, Director Texas Advanced Computing Center The University of Texas at Austin August 10, 2004

  2. Precautions • This presentation contains some historical recollections from over 5 years ago. I can’t usually recall what I had for lunch yesterday. • This presentation contains some ideas on where I think things might be going next. If I can’t recall yesterday’s lunch, it seems unlikely that I can predict anything. • This presentation contains many tongue-in-cheek observations, exaggerations for dramatic effect, etc. • This presentation may cause boredom, drowsiness, nausea, or hunger.

  3. Outline • Why Did We Create SCICOMP5 Years Ago? • What Did I Do with My Summer (and the Previous 3 Years)? • What is TACC Doing Now with IBM? • Where Are We Now? Where Are We Going?

  4. Why Did We Create SCICOMP5 Years Ago?

  5. The Dark Ages of HPC • In late 1990s, most supercomputing was accomplished on proprietary systems from IBM, HP, SGI (including Cray), etc. • User environments were not very friendly • Limited development environment (debuggers, optimization tools, etc.) • Very few cross platform tools • Difficult programming tools (MPI, OpenMP… some things haven’t changed)

  6. Missing Cray Research… • Cray was no longer the dominant company, and it showed • Trend towards commoditization had begun • Systems were not balanced • Cray T3Es were used longer than any production MPP • Software for HPC was limited, not as reliable • Who doesn’t miss real checkpoint/restart, automatic performance monitoring, no weekly PM downtime, etc.? • Companies were not as focused on HPC/research customers as on larger markets

  7. 1998-99: Making Things Better • John Levesque hired by IBM to start the Advanced Computing Technology Center • Goal: ACTC should provide to customers what Cray Research used to provide • Jay Boisseau became first Associate Director of Scientific Computing at SDSC • Goal: Ensure SDSC helped users migrate from Cray T3E to IBM SP and do important, effective computational research

  8. Creating SCICOMP • John and Jay hosted workshop at SDSC in March 1999 open to users and center staff • to discuss current state, issues, techniques, and results in using IBM systems for HPC • SP-XXL already existed, but was exclusive and more systems-oriented • Success led to first IBM SP Scientific Computing User Group meeting (SCICOMP) in August 1999 in Yorktown Heights – Jay as first director • Second meeting held in early 2000 at SDSC • In late 2000, John & Jay invited international participation in SCICOMP at IBM ACTC workshop in Paris

  9. What Did I Do with My Summer(and the Previous 3 Years)?

  10. Moving to TACC? • In 2001, I accepted job as director of TACC • Major rebuilding task: • Only 14 staff • No R&D programs • Outdated HPC systems • No visualization, grid computing or data-intensive computing • Little funding • Not much profile • Past political issues

  11. Moving to TACC! • But big opportunities • Talented key staff in HPC, systems, and operations • Space for growth • IBM Austin across the street • Almost every other major HPC vendor has large presence in Austin • UT Austin has both quality and scale in sciences, engineering, CS • UT and Texas have unparalleled internal & external support (pride is not always a vice) • Austin is a fantastic place to live (and recruit)

  12. Moving to TACC! • TEXAS-SIZEDopportunities • Talented key staff in HPC, systems, and operations • Space for growth • IBM Austin across the street • Almost every other major HPC vendor has large presence in Austin • UT Austin is has both quality and scale in sciences, engineering, CS • UT and Texas have unparalleled internal & external support (pride is not always a vice) • Austin is fantastic place to live (and recruit)

  13. Moving to TACC! • TEXAS-SIZEDopportunities • Talented key staff in HPC, systems, and operations • Space for growth • IBM Austin across the street • Almost every other major HPC vendor has large presence in Austin • UT Austin is has both quality and scale in sciences, engineering, CS • UT and Texas have unparalleled internal & external support (pride is not always a vice) • Austin is fantastic place to live (and recruit) • I got the chance to build something else good and important

  14. TACC Mission To enhance the research & education programsof The University of Texas at Austin and its partners through research, development, operation & support of advanced computing technologies.

  15. TACC Strategy To accomplish this mission, TACC: • Evaluates, acquires & operatesadvanced computing systems • Provides training, consulting, anddocumentation to users • Collaborates with researchers toapply advanced computing techniques • Conducts research & development toproduce new computational technologies Resources & Services Research &Development

  16. TACC Advanced ComputingTechnology Areas • High Performance Computing (HPC) numerically intensive computing: produces data

  17. TACC Advanced ComputingTechnology Areas • High Performance Computing (HPC) numerically intensive computing: produces data • Scientific Visualization (SciVis) rendering data into information & knowledge

  18. TACC Advanced ComputingTechnology Areas • High Performance Computing (HPC) numerically intensive computing: produces data • Scientific Visualization (SciVis) rendering data into information & knowledge • Data & Information Systems (DIS) managing and analyzing data for information & knowledge

  19. TACC Advanced ComputingTechnology Areas • High Performance Computing (HPC) numerically intensive computing: produces data • Scientific Visualization (SciVis) rendering data into information & knowledge • Data & Information Systems (DIS) managing and analyzing data for information & knowledge • Distributed and Grid Computing (DGC) integrating diverse resources, data, and people to produce and share knowledge

  20. TACC Activities & Scope Since 2001! Since 1986

  21. TACC Applications Focus Areas • TACC advanced computing technology R&D must be driven by applications • TACC Applications Focus Areas • Chemistry -> Biosciences • Climate/Weather/Ocean -> Geosciences • CFD

  22. TACC HPC & Storage Systems LONESTAR LONGHORN TEJAS Cray-Dell Xeon Linux Cluster1028 CPUs (6.3 Tflops) 1 TB memory, 40+ TB disk IBM Power4 System 224 CPUs (1.16 Tflops) ½ TB memory, 7.1 TB disk IBM Linux Pentium III Cluster 64 CPUs (64 Gflops) 32 GB memory, ~1 TB disk ARCHIVE SAN STK PowderHorns (2) 2.8 PB max capacity managed by Cray DMF Sun SANs (2) 8 TB / 4 TB to be expanded

  23. ACES VisLab • Front and Rear Projection Systems • 3x1 cylindrical immersive environment, 24’ diameter • 5x2 large-screen, 16:9 panel tiled display • Full immersive capabilities with head/motion tracking • High end rendering systems • Sun E25K: 128 processors, ½ TB memory, > 3 Gpoly/sec • SGI Onyx2: 24 CPUs, 6 IR2 Graphics Pipes, 25 GB Memory • Matrix switch between systems, projectors, rooms

  24. TACC Services • TACC resources and services include: • Consulting • Training • Technical documentation • Data storage/archival • System selection/configuration consulting • System hosting

  25. TACC R&D – High Performance Computing • Scalability, performance optimization, and performance modeling for HPC applications • Evaluation of cluster technologies for HPC • Portability and performance issues of applications on clusters • Climate, weather, ocean modeling collaboration and support of DoD • Starting CFD activities

  26. TACC R&D – Scientific Visualization • Feature detection / terascale data analysis • Evaluation of performance characteristics and capabilities of high-end visualization technologies • Hardware accelerated visualization and computation on GPUs • Remote interactive visualization / grid-enabled interactive visualization

  27. TACC R&D – Data & Information Systems • Newest technology group at TACC • Initial R&D focused on creating/hosting scientific data collections • Interests / plans • Geospatial and biological database extensions • Efficient ways to collect/create metadata • DB clusters / parallel DB I/O for scientific data

  28. TACC R&D – Distributed & Grid Computing • Web-based grid portals • Grid resource data collection and information services • Grid scheduling and workflow • Grid-enabled visualization • Grid-enabled data collection hosting • Overall grid deployment and integration

  29. TACC R&D - Networking • Very new activities: • Exploring high-bandwidth (OC-12, GigE, OC-48, OC192) remote and collaborative grid-enabled visualization • Exploring network performance for moving terascale data on 10 Gbps networks (TeraGrid) • Exploring GigE aggregation to fill 10 Gbps networks (parallel file I/O, parallel database I/O) • Recruiting a leader for TACC networking R&D activities

  30. TACC Growth • New infrastructure provides UT with comprehensive, balanced, world-class resources: • 50x HPC capability • 20x archival capability • 10x network capability • World-class VisLab • New SAN • New comprehensive R&D program with focus on impact • Activities in HPC, SciVis, DIS, DGC • New opportunities for professional staff • 40+ new, wonderful people in 3 years, adding to the excellent core of talented people that have been at TACC for many years

  31. Summary of My Time with TACCOver Past 3 years • TACC provides terascale HPC, SciVis, storage, data collection, and network resources • TACC provides expert support services: consulting, documentation, and training in HPC, SciVis, and Grid • TACC conducts applied research & development in these advanced computing technologies • TACC has become one of the leading academic advanced computing centers in years • I have the best job in the world, mainly becauseI have the best staff in the world (but also because of UT and Austin)

  32. And one other thing kept me busy the past 3 years…

  33. What is TACC Doing Now with IBM?

  34. UT Grid: Enable Campus-wide Terascale Distributed Computing • Vision: provide high-end systems, but move from ‘island’ to hub of campus computing continuum • provide models for local resources (clusters, vislabs, etc.), training, and documentation • develop procedures for connecting local systems to campus grid • single sign-on, data space, compute space • leverage every PC, cluster, NAS, etc. on campus! • integrate digital assets into campus grid • integrate UT instruments & sensors into campus grid • Joint project with IBM

  35. Building a Grid Together • UT Grid: Joint Between UT and IBM • TACC wants to be leader in e-science • IBM is a leader in e-business • UT Grid enables both to • Gain deployment experience (IBM Global Services) • Have a R&D testbed • Deliverables/Benefits • Deployment experience • Grid Zone papers • Other papers

  36. UT Grid: Initial Focus on Computing • High-throughput parallel computing • Project Rodeo • Use CSF to schedule to LSF, PBS, SGE clusters across campus • Use Globus 3.2 -> GT4 • High-throughput serial computing • Project Roundup uses United Devices software on campus PCs • Also interfacing to Condor flock in CS department

  37. UT Grid: Initial Focus on Computing • Develop CSF adapters for popular resource management systems through collaboration: • LSF: done by Platform Computing • Globus: done by Platform Computing • PBS: partially done • SGE • LoadLeveler • Condor

  38. UT Grid: Initial Focus on Computing • Develop CSF capability for flexible job requirements: • Serial vs parallel: no diff, just specify Ncpus • Number: facilitate ensembles • Batch: whenever, or by priority • Advanced reservation: needed for coupling, interactive • On-demand: needed for urgency • Integrate data management for jobs into CSF • SAN makes it easy • GridFTP is somewhat simple, if crude • Avaki Data Grid is a possibility

  39. UT Grid: Initial Focus on Computing • Completion time in a compute grid is a function of • data transfer times • Use NWS for network bandwidth predictions, file transfer time predictions (Rich Wolski, UCSB) • queue wait times • Use new software from Wolski for prediction of start of execution in batch systems • application performance times • Use Prophesy (Valerie Taylor) for applications performance prediction • Develop CSF scheduling module that is data, network, and performance aware

  40. UT Grid: Full Service! • UT Grid will offer a complete set of services: • Compute services • Storage services • Data collections services • Visualization services • Instruments services • But this will take 2 years—focusing on compute services now

  41. UT Grid Interfaces • Grid User Portal • Hosted, built on GridPort • Augment developers by providing info services • Enable productivity by simplifying production usage • Grid User Node • Hosted, software includes GridShell plus client versions of all other UT Grid software • Downloadable version enables configuring local Linux box into UT Grid (eventually, Windows and Mac)

  42. UT Grid: Logical View • Integrate distributed TACCresources first (Globus, LSF, NWS,SRB, United Devices, GridPort) TACC HPC, Vis, Storage (actually spread across two campuses)

  43. UT Grid: Logical View • Next add other UTresources in one bldg.as spoke usingsame tools andprocedures TACC HPC, Vis, Storage ICES Data ICES Cluster ICES Cluster

  44. UT Grid: Logical View PGE Data • Next add other UTresources in one bldg.as spoke usingsame tools andprocedures PGE Cluster TACC HPC, Vis, Storage PGE Cluster ICES Cluster ICES Cluster ICES Cluster

  45. UT Grid: Logical View PGE Data BIO Instrument • Next add other UTresources in one bldg.as spoke usingsame tools andprocedures BIO Cluster PGE Cluster GEO Data TACC HPC, Vis, Storage PGE Cluster GEO Instrument ICES Cluster ICES Cluster ICES Cluster

  46. UT Grid: Logical View PGE Data BIO Instrument • Finally negotiateconnectionsbetween spokesfor willing participantsto develop a P2P grid. Bio Cluster PGE Cluster GEO Data TACC HPC, Vis, Storage PGE Cluster GEO Instrument ICES Data ICES Cluster ICES Cluster

  47. UT Grid: Physical ViewTACC Systems Ext nets Research campus NOC GAATN CMS NOC Switch TACC Storage TACC PWR4 ACES TACC Cluster Switch TACC Vis Maincampus

  48. UT Grid: Physical ViewAdd ICES Resources Ext nets Research campus NOC GAATN CMS NOC Switch TACC Storage TACC PWR4 ACES TACC Cluster Switch ICES Cluster TACC Vis ICES Data ICES Cluster Main campus

  49. UT Grid: Physical ViewAdd Other Resources Ext nets Research campus NOC GAATN CMS NOC Switch TACC Storage PGE TACC PWR4 ACES TACC Cluster Switch ICES Cluster PGE Cluster Switch TACC Vis ICES Data PGE Cluster PGE Data ICES Cluster Main campus

  50. Texas Internet Grid for Research & Education (TIGRE) • Multi-university grid: Texas, A&M, Houston, Rice, Texas Tech • Build-out in 2004-5 • Will integrate additional universities • Will facilitate academic research capabilities across Texas using Internet2 initially • Will extend to industrial partners to foster academic/industrial collaboration on R&D

More Related