140 likes | 539 Views
Ten Years of Software Sustainability at The Infrared Processing and Analysis Center. G. Bruce Berriman and John Good NASA Exoplanet Science Institute, Infrared Processing and Analysis Center, Caltech, USA Ewa Deelman Information Sciences Institute, University of Southern California, USA
E N D
Ten Years of Software Sustainability at The Infrared Processing and Analysis Center G. Bruce Berriman and John Good NASA Exoplanet Science Institute, Infrared Processing and Analysis Center, Caltech, USA Ewa Deelman Information Sciences Institute, University of Southern California, USA Anastasia Alexov Astronomical Institute Anton Pannekoek, Amsterdam, Netherlands Presentation at AHM 2010, Cardiff, September 2010.
The Role of IPAC in Astronomy • Long-term archive • Curation of data • Dissemination to the community http://www.ipac.caltech.edu
Size and Usage Have Grown • Archives contain data from 30 missions and projects • Space based, ground based and knowledge based • 85 million queries • 3 TB/month downloaded Archives Built on a Common Hardware And Software Architecture
A Common Software Architecture • Applications are generally simple web forms or Web services that search for data • The “smarts” are on the server side; optimize complex queries on large data sets • Component based architecture which enables strong re-use and adaptation • Optimized for astronomical spatial searches and complex, general queries regardless of wavelength and type of mission • All services are integrated into the Infrared Science Information System (ISIS) • Components are generic; minimize dependencies on third-party software or environments • Avoid shared memories or system calls • All database queries are performed in one module • 300 KLOC • New projects automatically inherit functionality • Supports efficient development and controls maintenance costs Application is usually a CGI program • Each component is a module with a standard interface that communicates with other components and fulfills one general function • Modules are stand-alone portable ANSI-C tools • Components plugged together & controlled by an executive library • Executive starts components as child services and parses return values
Engage Your Users! • Concerted program of user engagement to attract new users and build a user community • Number of end-users has increased to 18,000 • 12% of peer-reviewed papers cited IPAC archives or data • Actively seek feedback, e.g. • Watch users as they try services; see where they get stuck • User Surveys ask respondents to write down their views rather than answer questions
Speed Is King In An Archive • Image data sets becoming very large: Spitzer Space Telescope will deliver over 100 million images, with varying footprints on the sky. • Searches for spatially extended images are slow: a scan of Spitzer images can take 2,000 s • … results pages are becoming more complex. • What matters more – fast access? Or interactivity? Speed won hands down.
R-tree Indexing • Uses hierarchically nested minimum bounding boxes • Performance scales as log(N) • Performance gain of x1000 over table scan Segment of virtual memory is assigned a byte for byte correlation with part of a file. • Memory-mapped files • Parallelization / cluster processing • REST-based web services
Modernization of Scanpi • Written in 1983, Scanpi co-adds scans from the far-infrared IRAS survey. 15 papers per year on average by 2007. • Sensitivity gain of x5 over survey data products • Improve spatial resolution of extended or confused sources • But it was coughing up blood and was a classic legacy program • Written in F66, it had become a patchwork of scripts and bug fixes and was a maintenance nightmare. • Dependent modules for data compression etc. no longer supported. • Stranded on Solaris 2.8 • Developer retiring • User panel strongly recommended modernization because of its value in supporting interpretation of data from current IR missions Spitzer and Herschel.
Scanpi Workflow Output: Results and files on Web Input: Source info Get scans Co-register scans Back-ground fitting Source fitting Co-add all scans Re-usable Components bulk download plotting table manipulation background coordinate transformation • Rewritten from ground up in C • Workflow gives visibility into processing • 21 KLOC cf. 102 KLOC • 1.25 FTE development cf. 0.5 FTE for maintenance • Rewritten from ground up in C • Developed as a workflow application that gives visibility into the processing steps • Calls existing components, reduce code base to 21 KLOC cf. 102 KLOC • 1.25 FTE development cf. 0.5 FTE for maintenance
The Montage Image Mosaic Engine • Creates science-grade image mosaics • Scalable, modular design • ANSI-C code (300 MB) runs on all common *nix platforms – desktops, clusters, grids and supercomputers. • Processes 40 million 2MASS pixels in 32 min on 128 nodes of 1.2 GHz Linux cluster Montage (http://montage.ipac.caltech.edu) creates science-grade image mosaics from multiple input images. Broadband simulates and compares seismograms from earthquake simulation codes. Epigenome maps short DNA segments collected using high-throughput gene sequencing machines to a reference genome. Output Output Output Output Co-addition Co-addition Co-addition Co-addition Reprojection Background Rectification Reprojection Reprojection Reprojection Background Rectification Background Rectification Background Rectification Input Input Input Input Montage Workflow Montage Workflow Montage Workflow Montage Workflow (http://montage.ipac.caltech.edu)
How Is It Used? • Science Analysis • Support Production of Data Sets, Data Products and Preview Products • Incorporate into Workflows and Pipelines • Spitzer Space Telescope teams • Quality Assurance of data products • 5,000 downloads by bona-fide astronomers • Users now contributing to the project • Scripts for generating mosaics • Python front ends • MPI version Contributed Script (Dr. Inseok Song)
Development of Cyber Infrastructure • Task scheduling in distributed environments (performance focused) • Designing job schedulers for the grid • Designing fault tolerance techniques for job schedulers • Exploring issues of data provenance in scientific workflows • Exploring applicability of scientific applications running on Clouds • Developing high-performance workflow restructuring techniques • Developing application performance frameworks • Developing workflow orchestration techniques Cost of running workflows on Amazon EC2 cloud
Best Practices for Software Sustainability • Design for sustainability, extensibility, re-use and portability • Build an engaged user community that encourages users to contribute to sustainability • Be careful about new technologies – do a cost benefit analysis before adopting them • Use rigorous software engineering practices to ensure well-organized and well-documented code. • Control your and manage your interfaces. • Make source code and test and validation data available • ✔ ✔ ✔