1 / 23

UT Research Data Repository

UT Research Data Repository. Chris Jordan UT Research Cyberinfrastructure Storage Committee Chair. Outline. UTRC Introduction/Current Status Research Data Requirements Current TACC storage infrastructure (Corral) New UTRC capabilities External services and partnerships

levia
Download Presentation

UT Research Data Repository

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UT Research Data Repository Chris Jordan UT Research Cyberinfrastructure Storage Committee Chair

  2. Outline • UTRC Introduction/Current Status • Research Data Requirements • Current TACC storage infrastructure (Corral) • New UTRC capabilities • External services and partnerships • Research and UTRC future

  3. UT Research Cyberinfrastructure • Collaborative effort initiated by Dr. Ken Shine, Vice Chancellor for Health • Jay Boisseau (TACC), Brian Herman (UTHSCSA) co-chairs • Assessment of research CI needs across system campuses • Data Storage emerged as highest priority/biggest unmet need

  4. UTRC Proposal • Approved by UT Regents November 2010 • Expanded Lonestar 4 for HPC needs • Establish dedicated 10gb research network to all campuses • Develop replicated, 5PB Research Data Repository

  5. Storage Committee Activities • Proposed iterative approach with pilot deployment in late 2011 • 1st half of 2011 spent on requirements and architecture development • Released RFP in June • Vendor selected in August • Installation in October • Initial users ~December

  6. Sidebar: Why “The Cloud” is not the answer • Cloud storage costs = $1000s/TB/year • Often not as reliable as advertised (Google, Amazon have both had major issues) • Restrictive interfaces, lack of high-performance access • Issues with institutional control, security integration, etc

  7. Pilot UTRDR Deployment • 5PB Raw storage in each of two installations • Main installation at TACC added to existing data infrastructure • Mirror installation at Arlington for replication • High level of redundancy within each installation • Power supplies to storage controllers and servers

  8. Research Data Requirements • Persistent Storage is just the beginning • High reliability/availability is key • Complex, evolving security needs • Importance of Collaboration • Data Applications and Services • Data Management and Analysis • Also, it has to be cheap (or free)

  9. Research Data Security • HIPAA Compliance is a major goal of the UTRDR effort • But HIPAA is just the beginning • Intellectual property and research confidentiality issues are more fine-grained • Long-term issues of availability/usability • Tiers of access, change over time

  10. Example Application Areas • Biology • Biodiversity (natural history collections) • Phylogenetics • Health Sciences • Medical Imaging • High-throughput sequencing • Social Sciences • Economic and social analysis

  11. TACC Corral Architecture • Emphasis on large-scale storage, highly flexible service infrastructure • Fast networks and heterogeneous systems = malleable service and storage platform • Allows integration of UTRC hardware into an existing infrastructure • Near-transparent migration for existing users • Expansion improves reliability and availability

  12. Corral Hardware and Services • 1.2 PetabytesDataDirect SATA Disk • 16 Dell Servers • ~300 TB of heterogeneous disks and servers • High-Performance Parallel File System, multiple databases, iRODS data management, replication to tape archive • Multiple levels of access control • Supports almost any imaginable data need

  13. iRODS at TACC • Distributed/Replicated data management • Corral, Ranch, and offsite storage systems • Extensible metadata support • Policy/Rule-based automation and enforcement • Used for sophisticated data management needs • Provides wide variety of interfaces

  14. Current Corral Usage • >30 Data Allocations & Collections • 350 Users at TACC and UT • >500 External users accessing collections • >500TB Research and Reference Data • Data of all types and disciplines: • Plant specimens and ‘omics, MRI, GIS, Simulations, Fish and Pottery, Economics and Medicine

  15. Added Capabilities w/ UTRDR • Synchronous replication • Very high availability (weather, comet strikes) • Tiers of storage and data management • Huge performance boost (>80GB/sec) • Accessibility from all UT System campuses • HIPAA Compliance

  16. UTRDR Pilot Access • Accelerated access for early adopters • Allows us to shake out bugs, assess readiness for production • Helps to develop requirements present and future • Research network performance assessment • Expect to open to all UT System researchers early 2012

  17. UTRDR Long-term sustainability • After pilot phase, storage will be free to all Pis up to some small limit (5TB?) • Additional storage will be available for cost-recovery fee per TB • Currently only trying to recoup costs on an annual basis • Long-term preservation costs are TBD but are of major interest

  18. Fee-based Research Storage • 2 Major types of service: • Simple storage (iSCSI, SCP/FTP) based on per-TB/year costs • Application services (databases, web applications, data management, etc) • Provides fixed, relatively low costs that can be written into grant proposals • Can include both disk and tape + offsite storage • Long-term model for UTRDR

  19. Existing/Upcoming Partnerships • University of Alaska • UC Berkeley • University of North Texas Libraries • Texas Digital Library • University of Florida • Indiana University • NSF XSEDE – 15 Institutions

  20. UTRC Plan 2012-2013 • Initial production in early 2012 • Design assessment and adjustment based on initial experiences • Expansion proposal mid-2012 • Significant expansion likely late 2012/early 2013 • Ongoing assessment and design adjustments integral to the process

  21. TACC Storage Research • Data upload and ingest processes • Storage reliability and management • Data Integrity/Long-term planning • Automated data management applications • Wide-area storage and replication efforts in the NSF XSEDE project

  22. Acknowledgements • Dr. Ken Shine – UT System • Dr. Patricia Hurn – UT System • Jay Boisseau and Brian Herman • Jerry York – UTHSCSA • UTRC Storage Committee • Brian Grimm, Kevin Granhold, Huapei Chen, Wayne Mueller, Bill Sanns • And many, many others

  23. Q&A

More Related