1 / 51

NSF Workshop on Cyberinfrastructure for Environmental Observatories Introduction to CI Topics

NSF Workshop on Cyberinfrastructure for Environmental Observatories Introduction to CI Topics. Chaitan Baru, SDSC/NLADR Bertram Ludaescher, UC Davis/SDSC Michael Welge, NCSA/NLADR. Outline. A nexus of CI projects CI project “principles” CI technical focus areas/topics

colin
Download Presentation

NSF Workshop on Cyberinfrastructure for Environmental Observatories Introduction to CI Topics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NSF Workshop on Cyberinfrastructure for Environmental ObservatoriesIntroduction to CI Topics Chaitan Baru, SDSC/NLADR Bertram Ludaescher, UC Davis/SDSC Michael Welge, NCSA/NLADR

  2. Outline • A nexus of CI projects • CI project “principles” • CI technical focus areas/topics • CI organizational issues

  3. CI Projects • Biomedical • BIRN-CC (Ellisman PI, Papadopoulos, Gupta, Baru, …) • National Biomedical Computational Resource, NBCR (Arzberger PI, Ellisman, Papadopoulos, Gupta, Baru, …) ... • Geosciences • GEON (Baru PI, Ludaescher, Papadopoulos, Helly, …) • SCEC (Jordan PI, Moore, …) • LEAD (Drogemeier PI, Wilhelmson, Welge, …) • Chronos (Cervato PI, Baru…) • CUAHSI-HIS (Maidment PI, Helly, Zaslavsky, …) • LOOKING (Smarr/Orcutt PI, Welge, Fountain, …) ... • Bio/Eco/Environmental • SEEK (Michener PI, Ludaescher, Jones, Rajasekar, …) • LTER (Michener PI, SDSC partner (Arzberger, Baru, Fountain, Rajasekar)…) • NEON (Hayden/Michener Lead PI’s, Krishtalka, Baru, Welge…) • ROADNet (Orcutt PI, Vernon, Rajasekar, Ludaescher, Fountain, …) • NSF/BDI Lake Metabolism (Arzberger/Kratz PI’s, Fountain, …) ... • Engineering • Monitoring Health of Civil Infrastructure (El Gamal PI, Fountain, …) • CLEANER (Minsker, Welge, Zaslavsky, Fountain, Pancake, …) • CISE • OpIPuter (Smarr PI, Ellisman, Orcutt, Papadopoulos, Welge, …) • NMI, GRIDs Center • Data Intensive Grid Benchmarking (Baru PI, Snavely, Casanova) • MPS • NVO, GriPhyN, …

  4. CI Project Principles • Use IT state-of-the-art, and develop advanced IT where needed,to support the “day-to-day” conduct of science (e-science) • (not just “hero” computations) • Based on a Web/Grid services-based distributed environment •  The “two-tier” approach • Use best practices, including commercial tools, • while developing advanced technology in open source, and doing CS research • An equal partnership • IT works in close conjunction with science, to create CI, i.e., the best practices, data sharing frameworks, useful and usable capabilities and tools • Create the “science IT infrastructure” • Online databases with advanced search engines • Robust tools and applications, etc. • Leverage from other intersecting projects • Much commonality in the technologies, regardless of science disciplines • Constantly work towards eliminating (or, at least, minimizing) the “NIH” syndrome • And, importantly, try not to reinvent what industry already knows how to do…

  5. Important Focus Areas / Topics • Security • Authentication, access control, controls for data publication… • Grid middleware • WSRF implementations, architecting “core” services (e.g. for metadata management, versioning, …) • Data integration and ontologies • Data interoperability, schema and semantic integration • Workflow systems • “system-level” and science workflows (ingestion and analysis) • Sensor network and sensor data management • Extensible, scalable, autonomic software; intelligent sensor management • Data mining • Online analysis, large-scale data, novel algorithms, advanced triggering and notification • Visualization • Large-scale, multi-model (data viz, GIS, info viz)

  6. Example: GEON “Software Stack” Portal (myGEON) Other service “consumers” service interfaces Registration GEONsearch GEONworkbench Registration Services Data Integration Services GIS Mapping Services Computational And Modeling Services Core Grid Services Data, Metadata, Indexing, Logging, Other Systems Services “Physical” Grid RedHat Linux, ROCKS, OGSI, Internet, I2, OptIPuter (planned)

  7. SoapRequest Proxy Cert Soap Body Request Params WS-Resource WS-Resource WS-Resource WS-Resource WS-Resource WS-Resource WS-Resource Antelope WSRF ExtensionsCourtesy: Tony Fountain, SDSC and LOOKING project Services Repositoryname, definiton, others Proxy RepositoryCerts,username, password, others Soap Header ORB Manager Databaseoperator LookupService Portal WSRF Authentication & Authorization ORB commander SOAP/HTTP Data Analyzer ORB Monitor Event Coordinator ServiceInvoker Antelope Web Services Services Subscriber OtherServices field digitizer Object Ring Buffer Field Interface Module field digitizer Databases ORB Operations: Orb ImportOrb ExportProcessingArchiving Antelope Executive Module field digitizer

  8. CI Organizational Issues • How to foster development of common infrastructure (based upon science needs/input), across multiple science domains • Not just at hardware level (e.g. supercomputers, high-speed networks) or OS and system services level • But, at the database, data integration, data mining levels • How to deal with the continuum of activities from basic CS research to production IT systems • NLADR – created with above issues in mind • Prototype for a CI organization

  9. NLADR—National Lab for Advanced Data Research • Joint activity between SDSC and NCSA, started October 1, 2004 • Formed based on NSF’s requirement that SDSC and NCSA collaborate on CI activities • Collaborative R&D activity focused on advanced data technologies • Guided by real applications from science communities • …to assemble expertise and a “knowledge base” of data technologies • And, also develop a broad data architecture framework • …within which to develop, integrate, test, and benchmark data-related technologies • …in the context of national-scale physical infrastructure

  10. nladrSearch DataWorkbench NLADR Query, Analysis, and Visualization Services Data Registration And Indexing Database Federation & Integration Workflow Authoring Execution Data Analysis and Mining Data and Information Visualization Collaboration Benchmarking NLADR Services Architecture Applications NSF – LEAD, GEON, LTERGrid, CLEANER, LOOKING NIH/NCRR – BIRN NASA – Space & Earth Sciences Strategic Industrial Partners -- … NLADR Data Management Services Management and archiving of large simulation outputs, streaming data, databases, data collections Grid and Web Middleware – (Globus/WSRF/WebServices/J2EE) Node Operating Systems (Linux, …) Internet2, LambdaGrids SDSC/NCSA testbed, OptIPuter

  11. Some Core IT Areas • Data integration and ontologies • Data interoperability, schema and semantic integration • Scientific Workflows

  12. Sources Integration Schema Sources Schema Integration (“registering” local schemas to a global schema) ABBREV Arizona PERIOD FORMATION AGE Idaho NAME Colorado PERIOD LITHOLOGY Utah TYPE PERIOD Nevada FMATN TIME_UNIT Wyoming NAME Livingston formation FORMATION PERIOD Tertiary-Cretaceous Montana West AGE New Mexico NAME PERIOD LITHOLOGY andesitic sandstone Montana E. FORMATION PERIOD

  13. Multihierarchical Rock Classification “Ontology” (Taxonomies) for “Thematic Queries” (GSC) Genesis Fabric Composition Texture

  14. domain knowledge Knowledge representation Geologic Age ONTOLOGY Show formations where AGE = ‘Paleozic’ (without age ontology) Show formations where AGE = ‘Paleozic’ (with age ontology) Nevada +/- a few hundred million years Ontology-Enabled Application Example:Geologic Map Integration

  15. Different views on State Geological Maps

  16. Sedimentary Rocks: BGS Ontology

  17. Sedimentary Rocks: GSC Ontology

  18. Purkinje cells and Pyramidal cells have dendrites that have higher-order branches that contain spines. Dendritic spines are ion (calcium) regulating components. Spines have ion binding proteins. Neurotransmission involves ionic activity (release). Ion-binding proteins control ion activity (propagation) in a cell. Ion-regulating components of cells affect ionic activity (release). domain expert knowledge Made usable for the system using Description Logic formalized as domain map/ontology Example: Domain Knowledge to “glue” SYNAPSE & NCMIR Data

  19. “Semantic Source Browsing”: Domain Maps/Ontologies (left) & conceptually linked data (right)

  20. A Semantic Mediation Result View

  21. In addition to registering (“hanging off”) data relative to existing concepts, a source may also refine the mediator’s domain map... Source Contextualization through Ontology Refinement • sources can register new concepts at the mediator ... • increase your data usability

  22. What is a Scientific Workflow (SWF)? • Aims: • automate a scientist’s repetitive data management and analysis tasks • typical phases: • data access, scheduling, generation, transformation, aggregation, analysis, mining, visualization • design, test, share, deploy, execute, reuse, … SWFs

  23. Promoter Identification Workflow Source: Matt Coleman (LLNL)

  24. Ilkay Altintas SDM, Resurgence Kim Baldridge Resurgence, NMI Chad Berkley SEEK Shawn Bowers SEEK Terence Critchlow SDM Tobin Fricke ROADNet Jeffrey Grethe BIRN Christopher H. Brooks Ptolemy II Zhengang Cheng SDM Dan Higgins SEEK Efrat Jaeger GEON Matt Jones SEEK Werner Krebs, EOL Edward A. Lee Ptolemy II Kai Lin GEON Bertram Ludaescher SDM, SEEK, GEON, BIRN,ROADNet Mark Miller EOL Steve Mock NMI Steve Neuendorffer Ptolemy II Jing Tao SEEK Mladen Vouk SDM Xiaowen Xin SDM Yang Zhao Ptolemy II Bing Zhu SEEK ••• KEPLER/CSP: Contributors, Sponsors, Projects(or loosely coupled Communicating Sequential Persons ;-) Ptolemy II www.kepler-project.org

  25. Scientific Workflows as a Melting Pot:Example: The Kepler SWF System • A grass-roots project • collaboration at the level of developers • Intra-project links • e.g. in SEEK: AMS  SMS  EcoGrid • Inter-project links • SEEK ITR, GEON ITR, ROADNet ITRs, DOE SciDAC SDM, Ptolemy II, NIH BIRN (coming we hope …), UK eScience myGrid, … • Inter-technology links • Globus, SRB, JDBC, web services, soaplab services, command line tools, R, GRASS, XSLT, … • Interdisciplinary links • CS, IT, domain sciences, …

  26. Promoter Identification Workflow in KEPLER

  27. Promoter Identification Workflow in KEPLER

  28. Web Services  Actors (WS Harvester) 1 2 4 3 •  “Minute-made” (MM) WS-based application integration • Similarly: MM workflow design & sharing w/o implemented components

  29. Job Management (here: NIMROD) • Job management infrastructure in place • Results database: under development • Goal: 1000’s of GAMESS jobs (quantum mechanics)

  30. Some Recent Actor Additions

  31. in KEPLER (w/ editable script) Source: Dan Higgins, Kepler/SEEK

  32. Blurring Design (ToDo) and Execution

  33. Towards Real-time Analysis Pipelines:Combining Simulations, Models, and Observations

  34. A Briefing On Data Mining to the NSF Planning Meeting Discussion Group on Cyberinfrastructure For Environmental ObservatoriesDecember 6 & 7, Arlington, VA Michael Welge University Of Illinois/NCSA welge@ncsa.uiuc.edu

  35. Modern Discovery and Problem Solving • Team-oriented and collaborative • Information-based, decision focused • Requires large-scale data fusion and analysis • All data is not under user’s control • Geographically distributed experts • Geographically distributed data and applications • Multiple stakeholders – multiple objectives

  36. Enabling Scientist Scientists, Engineers, Decision Makers, Policy Makers, Media and Citizens Engaging in discovery, analysis, discussion, deliberation, decisions, policy formulation and communication Collaboration Framework facilitates Idea and Knowledge Sharing, eLearning and Multi-Objective Decision Support Processes Analysis Framework facilitates Data and Model Discovery, Exploration, and Analysis; via the Collaboration Framework Data Management Framework builds logical maps of distributed, heterogeneous information resources (data, models, tools, etc.) and facilitates their use via the Analysis and Collaboration Frameworks Physical Infrastructure

  37. Data Streams – large number of applications • Sensor networks • Massive Simulation data sets (stored but random access is too expensive) • Monitoring & surveillance: video streams • Network monitoring and traffic engineering • Text based systems • RFID tags • Web logs and Web page click streams • Credit card transaction flows • Telecommunication calling records • Engineering & industrial processes: power supply & manufacturing

  38. Streaming Data Continuous, unbounded, rapid, time-varying Huge volumes of continuous data, possibly infinite Unpredictable arrival Fast changing and requires real-time response Random access is expensive so an application can only have one look at the data May require methods to detect rare events Large Static Data Databases involving many terabytes can exceed reasonable processing capacity Thousands of files problems of management and version control Thousands of fields create problems with model building May require auxiliary models to support data quality issues May require methods to detect rare events Distributed data store necessary for some application domains Support For Large Data Driven Problems

  39. Managing and Mining Data Streams

  40. E Q L Event Federation I Data Sources • Connect with data sources. • Parse source data to form (composite) events according to type definitions. • Collect and stage events for retrieval. … 1 2 N Parse and Compose Type Info Event Interface Event Collector … Persistence Stream Clients Buffering

  41. Event Federation 2 EventWorks • Monitors are event expression recognition agents. • Recognize Event • Evaluate Conditions • Act • EQL (Event Query Language) implements a compositional semantics for event expressions. • Composite events are “first order” events. • Monitors can monitor monitors. • Clock events are part of the language implementation. • Easy to write queries with temporal constraints. Event Router Streams Monitor 1 Monitor N New Events … EQL EQL Monitors are generated by users or programmatically.

  42. D2K : A Framework For Building Data-Driven Apps – Persistent Stream Data Analytics Foundation Designed for Building and Maintaining Complex Persistent and Stream Data-Driven Applications http://alg.ncsa.uiuc.edu

  43. D2K/T2K/I2K: Data, Text, and Image Analysis http://alg.ncsa.uiuc.edu

  44. LOOKING: Stream Data Analytics/Information Visualization scientific “dashboard” Uses novel methods to do real-time stream data analysis. Adaptable to the changes and evolution of data streams. Online Stream Query Engine Online Stream Classification Discovers association and correlation rules in data stream environment. Detects outliers and finds evolution of clusters in data streams. Online Frequent Pattern Mining Online Clustering of Data Streams

  45. CI Issues Architecture – NEON/CLEANER

  46. Real-time Visualization of RFID people location sensors: Supercomputing IntelliBadge™

  47. Atmospheric Science: Analytic Feature Extraction Scientific Visualization Techniques

  48. LOOKING: Scientist Analytical/Spatial-temporal Visualization Techniques

  49. LOOKING/Optiputer/Planetary Collaboratory U of W 1024 Processor Altix 3 TB Shared Memory >300 TeraBytes Disk 8 X 8 Processor 4 Pipe, 16 gig Memory each, Prisms coupled with Infiniban for On-demand, Interactive

  50. NLADR Tier 1 Architecture

More Related