1 / 111

Data from Far and Wide: Finding IT, Managing IT, Using IT

Data from Far and Wide: Finding IT, Managing IT, Using IT. Professor Robert Hollebeek NSCP - University of Pennsylvania 7th International Conference on High Performance Computing, December 18, 2000 Bangalore, India. Outline. The importance of Data Intensive Computing Data and Medicine

grady-lopez
Download Presentation

Data from Far and Wide: Finding IT, Managing IT, Using IT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data from Far and Wide: Finding IT, Managing IT, Using IT Professor Robert Hollebeek NSCP - University of Pennsylvania 7th International Conference on High Performance Computing, December 18, 2000 Bangalore, India

  2. Outline • The importance of Data Intensive Computing • Data and Medicine • Data and Maps • Data Infrastructure Conclusions R. Hollebeek

  3. data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data

  4. Data Intensive Computing: Particularly Interesting (hard) when • Data comes from distributed sensors • is controlled or stored in distributed databases or caches • is secure or semi-private • is large scale (terabyte to petabyte) • is made of multi-component data R. Hollebeek

  5. Difficulty Increases with data diversity, size, speed requirements Diversity and Complexity Current Projects explore all three dimensions Govt Data Medical Data Size NSCP3-parallel hardware Speed R. Hollebeek

  6. The Power of Data Mining Network Traffic on a 500 node LAN Destination Computer run Source Computer

  7. Destination Node The network data shown here contains a lot of information but displayed this way, yields little insight or knowledge about the underlying activity. Source Node R. Hollebeek

  8. NSCP BlockNess Algorithm Rearranged, sorted and clustered, we see that there are several major groups of processors with joint activities.

  9. Data Mining Prerequisites • Finding IT: Find Interesting Data • Data Intensive Applications • Social Science, Economics, Medicine, Science • Managing IT: Data Infrastructure and Data Organization • Parallel Storage above the Terabyte Level • Using IT: Finally you get to do Mining • Data Intensive -> Semi-automated R. Hollebeek

  10. Talk Will Highlight Examples of Data Intensive Applications from NSCP@PENN (http://nscp.upenn.edu) • NDMA: National Digital Mammography Archive • NIS-P: Neighborhood Information system for Philadelphia • Parallel Data Infrastructure : NSCP Massive Distributed Secure Diverse Web enabled Secure Ultra high speeds for massive data R. Hollebeek

  11. Outline - Data and Medicine • The importance of Data Intensive Computing • Data and Medicine • Finding IT • Managing IT • Using IT • Data and Maps • Data Infrastructure Conclusions R. Hollebeek

  12. X-rays mammograms MRI cat scans endoscopies ….. Finding IT • Hospitals • Very large data sources - great clinical value to digital storage and manipulation and significant cost savings • 7,000 Gigabytes per hospital per year • dominated by digital images • Why we chose Mammography • clinical need for film recall • large volume ( 4,000 GB/year ) • standards exist • great clinical value to this application R. Hollebeek

  13. Managing IT R. Hollebeek

  14. Major Components Hospital Portal Systems “RadAR” Large Scale Storage and Indexing Network Infrastructure R. Hollebeek

  15. RadAR : NSCP@PENN • High capacity radiology storage developed by NSCP 1996-1999 • RadiologyActive Repository R. Hollebeek

  16. Large Disks Parallel CPU Control (MA R) Hi-speed Interconnect RadAR Components R. Hollebeek

  17. Large Disks RadAR MetaData MetaData R. Hollebeek

  18. Large Disks MetaData Logs Records Dicom SR Birads RadAR Contents Not to scale Images R. Hollebeek

  19. Large Disks Parallel CPU Control (MA R) Images MetaData Logs Records Hi-speed Interconnect RadAR + Portals Portal Systems at HUP, UNC, UC, SWH MAP/MAQ NDMA/NSCP R. Hollebeek

  20. Map - MA system portal Hospital Network VPN Win 2000 Linux Two Dual Processor IBM/Netfinity 5100 systems R. Hollebeek

  21. R. Hollebeek

  22. R. Hollebeek

  23. Large Disks Parallel CPU Hospital Network VPN Control (MA R) Win 2000 Linux Hi-speed Interconnect Portals + RadAR R. Hollebeek

  24. R. Hollebeek

  25. NSCP High Capacity Archive 100 TB, million record per day pilot system developed by NSCP and demonstrated at SC98 RadAR R. Hollebeek RadAR R. Hollebeek

  26. Control spcw sp02 NSCP – IBM/SP2 Hardware Components MAR Serial Ports High Performance Switch ATM sp01 Primary Node BackupPrimary Node Disk Pool 1 Disk Pool 2 Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data

  27. Status Data Data Node Data Data sp03 sp03 sp03 sp03 Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Serial HPS ATM Node Node Node Node Disk Pool Disk Pool Disk Pool Disk Pool

  28. Lab Tour R. Hollebeek

  29. Scale of the Problem Recent FDA approval and cost and other advantages of digital devices will encourage digital radiology conversion • 2000 Hospitals x 7 TB per year x 2 • 28 PetaBytes per year • (1 Petabyte = 1 Million Gigabytes ) • Pilot Problem scale in NDMA • 4 x 7 x 2 = 56 Terabytes / year R. Hollebeek

  30. Storage Hierarchy Hospital / Clinic 7 R @ 4,000 TB/yr 20 A @ 100 TB/yr 15 H @ 7 TB/yr Goal: Distribute Storage Load and Balance Network and Query Loads R. Hollebeek

  31. Networks • 7 TB / yr in each hospital is ~2% of an OC3 • Typical T1 to DS-3 connects today at Clinics are almost sufficient • Study size and transmission time to remote reader is a more important constraint requiring higher speeds • 1.5 Minutes at DS-3 • 2 sec at OC48 R. Hollebeek

  32. NDMA • NSCP@Penn: • Digital Storage, Search and Retrieval • Oak Ridge National Lab: • Network (VPN) and Security • Hospitals of • University of Pennsylvania • University of Chicago • University of North Carolina • University of Toronto

  33. R. Hollebeek

  34. Large scale radiology testbed Regional and Area Archives (A) R. Hollebeek

  35. Layout matches growth pattern of national networks R. Hollebeek

  36. Portal Systems in the test lab at NSCP/PENN R. Hollebeek

  37. First Hospital portal systems being installed at the Hospital of the University of Pennsylvania

  38. Portal NDMA01 in place in the communications closet

  39. Construction of the remaining Portal systems R. Hollebeek

  40. Systems Undergoing network tests in the server room

  41. 1200 Gigabyte fast disk under test in a joint program with Lucent and CyberStorage Systems.

  42. Using IT • Store Records for retrieval • typical request would retrieve 3-4 yrs • Audit and log transmissions • Parse, Index and Store incoming information • Support Computer Assisted Diagnostics • Support Radiologist Training and Evaluation R. Hollebeek

  43. Training, Teaching, Evaluation R. Hollebeek

  44. R. Hollebeek

  45. Network and Data Security • Virtual Private Network • used to assure system security • User Authentication • password + token or biometric • Roles • Doctor, Administrator, Assistant, ... • Client Authorization • required for Medical Records

  46. NDMA Data Mining Challenges • Fuzzy matching for records • feature matching in images • clustering - outcomes, other variables • outlier search in many dimensions • computer assisted diagnosis R. Hollebeek

  47. NDMA - http://nscp.upenn.edu/ndma

  48. NSCP with Children’s Hospital • To provide fast parallel • processing over high speed nets • so that functional MRI can be • used in real time clinically • On the right: an individual • noisy frame of a human brain R. Hollebeek

  49. Functional MRI • J. Yu graduate student Degree in 2000 Now on Wall Street R. Hollebeek

More Related