1 / 16

Genome Sequencing Center Site Report - HEPiX Fall 2007

This report provides an overview of the genome sequencing center's data flow and storage system, including information on instrument-specific raw data, DNA sample tracking, data collection, database growth, storage growth, data processing, and recent changes.

ebonyj
Download Presentation

Genome Sequencing Center Site Report - HEPiX Fall 2007

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genome Sequencing Center Site Report - HEPiX Fall 2007 Gary Stiehr garystiehr@wustl.edu

  2. Instrument-specific raw data (various formats) General Data Flow Attached computer system or cluster Our disk arrays Various disk arrays and the cluster for analysis garystiehr@wustl.edu

  3. DNA sample preparation and movement is carefully tracked:75+ Debian Linux systems with touch screens and barcode scanners for lab technician input.OLTP schema on Oracle 10g RAC running across four Infiniband-connected Sun X4100, each with four cores and 16 GB of memory using 15 TB NetApp FAS980. Sample Tracking garystiehr@wustl.edu

  4. The bulk of our data is from DNA sequencing instruments scattered throughout the lab.In most cases, the data’s first stop is a locally attached, vendor-provided system or cluster. 235+ Windows systems, some LinuxPreviously, data produced by the sequencers stored mostly in Oracle databases.With newer sequencers, we store only tracking data in Oracle; raw data is on the file system. Data Collection garystiehr@wustl.edu

  5. DW: Currently 15.3 TB + 360 GB per monthOLTP: Currently 1.6 TB + 27 GB per month OLAP: Currently 183 GB + 7.5 GB per month Database Growth garystiehr@wustl.edu

  6. Over the last year, migrating database instances to Sun X4100 servers (16 GB RAM, two dual-core Opteron 285 processors).Oracle 10g RAC used for DW (2 nodes) and OLTP (4 nodes) connected via Infiniband.Running RedHat to be within Oracle, Cisco support matrix. Database Servers garystiehr@wustl.edu

  7. 5000% increase since last year; Not counting user analysis (i.e., only production analysis): Incoming Data Growth garystiehr@wustl.edu

  8. Storage space available for production data storage and archiving: Storage Growth garystiehr@wustl.edu

  9. Currently utilizing NetApp and BlueArc as NAS.Older SAN infrastructure utilizes EMC, Hitachi and StorageTek.Two 700-slot StorageTek L700 tape libraries. SDLT drives (on their way out) T10K drives (may test T10K-B drives)NetBackup used for backups. Current Storage garystiehr@wustl.edu

  10. Some sequencers ship with software to run on in-house clusters.Need to customize to fit local environment. Makefile-based parallelism. Many small independent jobs. Find right granularity.Other sequencers ship with a four or five node cluster and tens of TB of disk.Power/cooling issues for multiple instruments? Data Processing garystiehr@wustl.edu

  11. Platform LSF HPC manages compute nodes. Processing Capacity garystiehr@wustl.edu

  12. 96 GB mem, 4 Itaniums Large datasets utilizing more memory. Large Memory Systems New applications asking for even more memory in some cases. garystiehr@wustl.edu

  13. Looking at Sun X4600 servers:8 server boards each, currently with one dual-core Opteron processor and 64 GB each.Configuration: 16 Opteron cores, 256 GB of memory, four 146 GB SAS drives.At full CPU utilization according to Sun’s Power Calculator = 1.14 kW (around 9.5 A @ 120 V). Large Memory Systems garystiehr@wustl.edu

  14. Wider deployment of Ganglia for monitoring.Migration to LDAP-based authentication (from NIS).Out of physical space, cooling in current data center (as well as weight limitations).Creating a disaster recovery environment at Washington University School of Medicine’s Business Continuity CenterNightly offsite backups of Oracle databases. Recent Changes garystiehr@wustl.edu

  15. Polycom 9101 video conferencing system being installed this week.For internal documentation and collaboration, migrated from PHPwiki to MediaWiki.Upgraded Linux systems from Debian Sarge to Debian Etch.Wireless network migrated to Cisco 4402 LAN controllers (from previous set of individually managed APs) Recent Changes garystiehr@wustl.edu

  16. Division of Statistical Genomics evaluating SASGrid; GSC preparing to install and manage DSG purchase of 400 cores and 50 TB of disk to support new studies.New data center construction started; to be completed May 2008 (more on this Wednesday). Recent Changes garystiehr@wustl.edu

More Related