1 / 66

Kepler, Opal and Gemstone

Kepler, Opal and Gemstone. Amarnath Gupta University of California San Diego. Changing Needs for Scientific Process. Observe  Hypothesize  Conduct experiment  Analyze data  Compare results and Conclude   Predict. Traditional Scientific Process (before computers).

aquila
Download Presentation

Kepler, Opal and Gemstone

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Kepler, Opal and Gemstone Amarnath Gupta University of California San Diego

  2. Changing Needs for Scientific Process Observe  Hypothesize  Conduct experiment  Analyze data  Compare results and Conclude   Predict Traditional Scientific Process (before computers) Yesterday…at least for some of us!

  3. + + + + + + + What’s different in today’s science? Observe  Hypothesize  Conduct experiment  Analyze data  Compare results and Conclude   Predict • Observing / Data: Microscopes, telescopes, particle accelerators, X-rays, MRI’s, • microarrays, satellite-based sensors, sensor networks, field studies… • Analysis, Prediction / Models and model execution: Potentially large • computation and visualization Today’s scientific process More to add to this picture: network, Grid, portals, +++

  4. A Brief Recap • What are Scientific Workflow Systems trying to achieve? • Creation of a problem solving environment over distributed and mostly autonomous platforms • Seamless access to resources and services • Service composition and reuse • Scalability • Detached execution and yet, user interaction • Reliability and fault tolerance • “Smart” re-runnability • Reproducibility of execution • Information discovery as an aid to workflow design

  5. What is Kepler? • Derived from an earlier scientific data-flow system called Ptolemy-II, which is • Designed to model heterogeneous, concurrent systems for engineering applications • An actor-based workflow paradigm • Kepler adds to Ptolemy-II • New components for scientific workflows • Structural and Semantic type management • Semantic annotation and annotation propagation mechanisms • Distributed execution capabilities • Execution in a grid framework • …

  6. Promoter Identification Workflow Source: Matt Coleman (LLNL)

  7. Promoter Identification Workflow

  8. Enter initial inputs, Run and Display results

  9. Custom Output Visualizer

  10. Kepler System Architecture Authentication GUI …Kepler GUI Extensions… Vergil Documentation Smart Re-run / Failure Recovery Provenance Framework Kepler Object Manager SMS Type System Ext Actor&Data SEARCH Kepler Core Extensions Ptolemy

  11. What is an Actor-based Workflow? • An actor-based workflow is a graph with three components • Actors: passive (parameterized) programs are specified by their input and output signatures • Ports: an actor has a set of input and output ports that are specified by the signature of the data tokens passing through that port • No call semantics • Attributes • Dataflow connections: a connectivity specification that designates the flow of data from one actor to another • Relation: an intermediate data holding station • Director: an execution control model that coordinates the execution behavior of a workflow

  12. Composite Actors • Composite actor AW • A pair (W,ΣW)comprising a subworkflow W and a set of distinguished ports ΣW freeports(W), the i/o-signature of W • The i/o-signatures of the subworkflow W and of the composite actor AW containing W match, i.e., ΣW= ports(AW) • An actor can be “refined” by treating it as a workflow and adding other restrictions around it • Workflow abstraction • One can substitute a subworkflow as a single actor • The subworkflow may have a different director than the higher-level workflow

  13. Mineral Classification Workflow

  14. PointInPolygon algorithm

  15. Execution Model • Actors • Asynchronous: Many actors can be ready to fire simultaneously • Execution ("firing") of a node starts when (matching) data is available at a node's input ports. • Locally controlled events • Events correspond to the “firing” of an actor • Actor: • A single instruction • A sequence of instructions • Actors fire when all the inputs are available • Directors are the WF Engines that • Implement different computational models • Define the semantics of • execution of actors and workflows • interactions between actors • Process Network (PN) Director • Each actor executes as a separate thread or process • Data connections represent queues of unbounded size. • Actors can always write to output ports, but may get suspended (blocked) on input ports without a sufficient number of data tokens. • Performs buffer management, deadlock detection, allows data forks and merges

  16. The Director • Execution Phases • pre-initialize method of all actors • Run once per workflow execution • Are the data types of all actor ports known? Are transport protocols known? • type-check • Are connected actors type compatible? • run* • initialize • Executed per run • Are all the external services (e.g., web services) working? • Replace dead services with live ones… • iteration* • pre-fire • Are all data in place? • fire* • post-fire • Any updates for local state management? • wrap-up

  17. Polymorphic Actors: Components WorkingAcross Data Types and Domains • Actor Data Polymorphism: • Add numbers (int, float, double, Complex) • Add strings (concatenation) • Add complex types (arrays, records, matrices) • Add user-defined types • Actor Behavioral Polymorphism: • In dataflow, add when all connected inputs have data • In a time-triggered model, add when the clock ticks • In discrete-event, add when any connected input has data, and add in zero time • In process networks, execute an infinite loop in a thread that blocks when reading empty inputs • In CSP, execute an infinite loop that performs rendezvous on input or output • In push/pull, ports are push or pull (declared or inferred) and behave accordingly • In real-time CORBA*, priorities are associated with ports and a dispatcher determines when to add By not choosing among these when defining the component, we get a huge increment in component re-usability. But how do we ensure that the component will work in all these circumstances? Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/

  18. GEON: Geosciences Network • Multi-institution collaboration between IT and Earth Science researchers • Funded by NSF “large” ITR program • GEON Cyberinfrastructure provides: • Authenticated access to data and Web services • Registration of data sets and tools, with metadata • Search for data, tools, and services, using ontologies • Scientific workflow environment • Data and map integration capability • Visualization and GIS mapping www.geongrid.org

  19. R. Haugerud, U.S.G.S LiDAR Introduction Survey Interpolate / Grid Process & Classify D. Harding, NASA Point Cloud x, y, zn, … Analyze / “Do Science”

  20. LiDAR Difficulties • Massive volumes of data • 1000s of ASCII files • Hard to subset • Hard to distribute and interpolate • Analysis requires high performance computing • Traditionally: Popularity > Resources

  21. A Three-Tier Architecture • GOAL: Efficient LiDAR interpolation and analysis using GEON infrastructure and tools • GEON Portal • Kepler Scientific Workflow System • GEON Grid • Use scientific workflows to glue/combine different tools and the infrastructure Portal Grid

  22. Analyze Visualize move process move render display Lidar Workflow Process Portal • Configuration phase • Subset: DB2 query on DataStar Monitoring/ Translation Subset • Interpolate: Grass RST, Grass IDW, GMT… • Visualize: Global Mapper, FlederMaus, ArcIMS Scheduling/ Output Processing Grid

  23. Analyze Arizona Cluster Visualize move process Datastar move render display Fledermaus CreateScene file iView3D/Browser sd d1 IBM DB2 NFS Mounted Disk Lidar Processing Workflow (using Fledermaus) Subset d2 d1 d2 (grid file) d1 d2 NFS Mounted Disk

  24. Lidar Workflow Portlet • User selections from GUI • Translated into a query and a parameter file • Uploaded to remote machine • Workflow description created on the fly • Workflow response redirected back to portlet

  25. ArcSDE ArcInfo ArcIMS Map Parameters Grass Functions Grass surfacing algorithms: Spline IDW block mean … submit Download data LIDAR POST-PROCESSING WORKFLOW PORTLET x,y,z and attribute Client/ GEON Portal NFS Mounted Disk DB2 Render Map raw data Parameter xml Create Workflow description process output DB2 Spatial query Map onto the grid (Pegasus) Compute Cluster Binary grid ASCII grid Text file Tiff/Jpeg/Gif ASCII grid KEPLER WORKFLOW

  26. Portlet User Interface - Main Page

  27. Portlet User Interface - Parameter Entry 1

  28. Portlet User Interface - Parameter Entry 2

  29. Portlet User Interface - Parameter Entry 3

  30. Behind the Scenes: Workflow Template

  31. Filled Template

  32. Example Outputs

  33. With Additional Algorithms

  34. Kepler System Architecture Authentication GUI …Kepler GUI Extensions… Vergil Documentation Smart Re-run / Failure Recovery Provenance Framework Kepler Object Manager SMS Type System Ext Actor&Data SEARCH Kepler Core Extensions Ptolemy

  35. The Hybrid Type System • Every portal of an actor has a type signature • Structural Types • Any type system admitted by the actor • DBMS data types, XML schema, Hindley-Milner type system … • Semantic Types • An expression in a logical language to specify what a data object means • In the SEEK project, such a statement is expressed in a DL over an ontology • MEASUREMENTITEM_MEASURED.SPECIES_OCCURRENCE • A workflow is well-typed if • For every pair of connected ports • The structural type of the output port is a subtype of that of the input port • The semantic type of the output port is logically subsumed by that of the input port

  36. Hybridization Constraints • A hybridization constraint • a logical expression connecting instances of a structural type with instances of the corresponding semantic type for a port • For a relational type r(site, day, spp, occ) • I/O Constraint • A constraint relating the input and output port signatures of an actor • Propagating hybridization constraints Having a tuple in r implies that there is a measurement y of the type speciesoccurrence corresponding to xocc

  37. How can my (grid) application become a Kepler actor? • By making it a web service • For applications that have a command line interface • OPAL can convert the application into a web service • What is Opal? • a Web services wrapper toolkit • Pros: Generic, rapid deployment of new services • Cons: Less flexible implementation, weak data typing due to use of generic XML schemas

  38. Opal is an Application Wrapping Service Gemstone PMV/Vision Kepler State Mgmt Application Services Security Services (GAMA) Globus Globus Globus PBS Cluster Condor pool SGE Cluster

  39. The Opal Toolkit: Overview • Enables rapid deployment of scientific applications as Web services (< 2 hours) • Steps • Application writers create configuration file(s) for a scientific application • Deploy the application as a Web service using Opal’s simple deployment mechanism (via Apache Ant) • Users can now access this application as a Web service via a unique URL

  40. Scheduler, Security, DatabaseSetups Container Properties Tomcat Container Axis Engine Service Config Binary, Metadata, Arguments Opal WS Opal WS Cluster/Grid Resources Opal Architecture

  41. Service Operations • Get application metadata: Returns metadata specified inside the application configuration • Launch job: Accepts list of arguments and input files (Base64 encoded), launches the job, and returns a jobID • Query job status: Returns status of running job using the jobID • Get job outputs: Returns the locations of job outputs using the jobID • Get output as Base64: Returns an output file in Base64 encoded form • Destroy job: Uses the jobID to destroy a running job

  42. MEME+MAST Workflow using Kepler

  43. Kepler Opal Web Services Actor

  44. Opal and Gemstone

  45. Opal Summary • Opal enables rapidly exposing legacy applications as Web services • Provides features like Job management, Scheduling, Security, and Persistence • More information, downloads, documentation: • http://nbcr.net/services/

  46. Kepler System Architecture Authentication GUI …Kepler GUI Extensions… Vergil Documentation Smart Re-run / Failure Recovery Provenance Framework Kepler Object Manager SMS Type System Ext Actor&Data SEARCH Kepler Core Extensions Ptolemy

  47. Joint Authentication Framework • Requirements: • Coordinating between the different security architectures • GEON uses GAMA which requires a single certificate authority. • SEEK uses LDAP with has a centralized certificate authority with distributed subordinate CAS • To connect LDAP with GAMA • Coordinating between 2 different GAMA servers • Single sign-on/authentication at the initialize step of the run for multiple actors that are using authentication • This has issues related to single GAMA repository vs multiple, and requires users to have accounts on all servers. • Kepler needs to be able to handle expired certificates for long-running workflows and/or for users who use it for a long time. • A trust relation between the different GAMA servers must be established in order to allow for single authentication.

More Related