1 / 51

UK e-Science Future Infrastructure for Scientific Data Mining, Integration and Visualisation

UK e-Science Future Infrastructure for Scientific Data Mining, Integration and Visualisation Malcolm Atkinson Director of National e-Science Centre www.nesc.ac.uk 25 th October 2002 SDMIV workshop, e-Science Institute Edinburgh. Overview. UK e-Science

claire-lamb
Download Presentation

UK e-Science Future Infrastructure for Scientific Data Mining, Integration and Visualisation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UK e-Science Future Infrastructure for Scientific DataMining, Integration and Visualisation Malcolm Atkinson Director of National e-Science Centre www.nesc.ac.uk 25th October 2002 SDMIV workshop, e-Science InstituteEdinburgh

  2. Overview • UK e-Science • Reminder of Investment and Infrastructure • International e-Science • Examples and Collaboration • Data Access and Integration • Lego Bricks for Scientific Application Developers • Tailored: Application and Computing Scientists • A Computer Scientist’s Christmas List • Diversity and Opportunity • The Way Ahead

  3. e-Science • Fundamentally about Collaboration • Sharing • Ideas • Thought processes and Stimuli • Effort • Resources • Requires • Communication • Common understanding & Framework • Mechanisms for sharing fairly • Organisation and Infrastructure Requires Trust Scientists (Biologists) have done this for Centuries

  4. e-Science (take 2) Text, digital media, structured, organised & curated data, computable models, visualisation, shared instruments, shared systems, shared administration, … • Fundamentally about Collaboration • Sharing • Ideas • Thought processes and Stimuli • Effort • Resources • Requires • Communication • Common understanding & Framework • Mechanisms for sharing fairly • Organisation and Infrastructure Changing the ways Science is done Nationally & Internationally Distributed, … Routine, Daily, Automated, … That Requires very Significant Investment in DigitalSystems and their Support

  5. e-Science (take 3) • Fundamentally about Collaboration • Sharing • Ideas • Thought processes and Stimuli • Effort • Resources • Requires • Communication • Common understanding & Framework • Mechanisms for sharing fairly • Organisation and Infrastructure Digital networks, digital work-places, digital instruments, … Metadata, ontologies, standards, shared curated data, shared codes, … Common platforms, shared software, shared training, … Citation, Authentication, Authorisation, Accounting, Provenance, Policies, … Shared Provision of Platform, The Grid SHOULD make this much easier by providing a common, supported high-level of Software and Organisational infrastructure

  6. Grid Expectations • Persistence • Always there, Always Working, Always Supported • Stability • You can build on foundations that don’t move • Trustworthy & Predictable • Honours commitments • Digital policies, digital contracts, security, … • Data integrity, longevity and accessibility • Performance • High-level & Extensible • The capabilities you need are already there • Ubiquitous • Your collaborators use it

  7. Grid Reality Political, Economic & Technical issues to Solve • Persistence • Always there, Always Working, Always Supported • Stability • You can build on foundations that don’t move • Trustworthy & Predictable • Honours commitments • Digital policies, digital contracts, security, … • Data integrity, longevity and accessibility • Performance • High-level & Extensible • The capabilities you need are already there • Ubiquitous • Your collaborators use it Early days but Open Grid Services link with Web Services + GGF standardisation Only Show in Town Not yet but very substantial global effort to achieve this Good basis for extension Commitment to basic functionality WS + Community effort Global & Industrial Rallying Cry Must work with Web Services

  8. HPC(x) UK Grid Network Nationale-Science Centre Edinburgh Glasgow Newcastle Access Gridalways-on video walls Belfast Manchester Daresbury Lab Cambridge Oxford Hinxton RAL Cardiff London Southampton

  9. SuperJanet4, June 2002 20Gbps 10Gbps Scotland via Glasgow Scotland via Edinburgh 2.5Gbps 622Mbps WorldCom Glasgow WorldCom Edinburgh 155Mbps NNW NorMAN YHMAN WorldCom Manchester WorldCom Leeds Northern Ireland EMMAN MidMAN WorldCom Reading WorldCom London EastNet TVN External Links WorldCom Bristol WorldCom Portsmouth South Wales MAN LMN SWAN& BWEMAN Kentish MAN Tony Hey July 2001 LeNSE

  10. Events Workshops Research Meetings International Meetings History of Events GGF5 HPDC11 Summer school > 50 workshops held > 1000 people in total Many return often Planned Events 25 workshops Conferences to 2005 Visitors 3 arrived 4 arranged International collaboration, visits & visitors China Argonne National Lab SDSC NCSA … Centre Projects Pilot Projects Regional Support Research Projects EPSRC, MRC, WT, SHEFC National e-Science Centre Please use this Facility

  11. UCSF UIUC From Klaus Schulten, Center for Biomollecular Modeling and Bioinformatics, Urbana-Champaign

  12. HEP sites ESA sites DataGrid Testbed Testbed Sites (>40) Dubna Moscow Lund Estec KNMI RAL Berlin IPSL Prague Paris Brno CERN Lyon Santander Milano Grenoble PD-LNL Torino Madrid Marseille BO-CNAF Pisa Lisboa Barcelona ESRIN Roma Valencia Catania Francois.Etienne@in2p3.fr - Antonia.Ghiselli@cnaf.infn.it

  13. Scientific Users Distributed Scheduling Monitoring Accounting Diagnosis Authorisation Logging Data & Compute Resources A Simplified Grid Anatomy Scientific Application Application Developers Grid Plumbing & Security Infrastructure Operations Team Owners

  14. Scientific Users Distributed Scheduling Monitoring Accounting Diagnosis Authorisation Logging Data & Compute Resources The Crux Keep all the (pink)groups HAPPY Scientific Application Application Developers Grid Plumbing & Security Infrastructure Operations Team Owners

  15. Data Integration Distributed Data Access Scheduling Monitoring Accounting Diagnosis Authorisation Logging Data & Compute Resources Structured Data A SDMIV Grid Anatomy SDMIV Users Scientific Application Grid Plumbing & Security Infrastructure Data Providers Data Curators

  16. Database Growth PDB protein structures

  17. Data in files FTP a local copy /subset.ASCII or Binary. Each scientist builds own analysis toolkit Analysis is tcl script of toolkit on local data. Some simple visualization tools: x vs y Data in a database Standard reports for standard things. Report writers for non-standard things GUI tools to explore data. Decision trees Clustering Anomaly finders Data Mining:Science vs Commerce Jim Gray UCSC April 2002

  18. You can GREP 1 MB in a second You can GREP 1 GB in a minute You can GREP 1 TB in 2 days You can GREP 1 PB in 3 years. Oh!, and 1PB ~10,000 disks At some point you need indices to limit searchparallel data search and analysis This is where databases can help You can FTP 1 MB in 1 sec You can FTP 1 GB / min (= 1 $/GB) … 2 days and 1K$ … 3 years and 1M$ But…some science is hitting a wallFTP and GREP are not adequate 50,000 Kg 250 KW 60 Racks = 120m2 Jim Gray UCSC April 2002

  19. Grid Services OGSA & OGSI Grid Technology Web Services www.gridforum.org/ogsi-wg www.gridforum.org/ogsa-wg www.gridforum.org/

  20. Web Services • Rapid Integration • Dynamic binding • Commercial Power • Financial & Political • Independence • Client from Service • Service from Client • Separation • Function from Delivery • Description • WSDL, WSC, WSEF, … • Tools & Platforms • Java ONE, Visual .NET • WebSphere, Oracle, … www. w3c. org / TR / SOAP or TR/wsdl

  21. Grid Technology • Virtual Organisations • Sharing & Collaboration • Security • Single Sign in, delegation • Distribution & fast FTP • But Various Protocols • Resource Mangement • Discovery • Process Creation • Scheduling • Monitoring • Portability • Ubiquitous APIs & Modules • Gov’nm’t Agency Buy in • Industrial Buy in Foster, I., Kesselman, C. and Tuecke, S., The Anatomy of the Grid: Enabling Virtual Organisations, Intl. J. Supercomputer Applications, 15(3), 2001http://www.gridforum.org/ogsi-wg

  22. Applications Using operations Virtual Grid Services Implemented by Multiple implementations of Grid Services OGS infrastructure Open Grid Services Architecture Industrial Commitment Foster, I., Kesselman, C., Nick, J. and Tuecke, S., The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration

  23. Scientific Data • Deluge of Data • Exponential growth • Doubling timesAstronomy 12 monthsBio-Sequences 9 monthsFunctional Genomics 6 monthsBytes/dollar 12 to 18 months • Not How big it is but

  24. Scientific Data • Deluge of Data • Exponential growth • Doubling timesAstronomy 12 monthsBio-Sequences 9 monthsFunctional Genomics 6 monthsBytes/dollar 12 to 18 months • Not How big it is but • What you do with it • Sharing • Curation • Metadata • Automated movement, access & integration • Computational Access

  25. Scientific Data • Deluge of Data • Exponential growth • Doubling timesAstronomy 12 monthsBio-Sequences 9 monthsFunctional Genomics 6 monthsBytes/dollar 12 to 18 months • Not How big it is but • How you Embrace & Manage Change • The Database is a Knowledge chest • The Database is a Communication Hub • Autonomously Managed (Curated) change • An Essential part of e-BioMedical, Astronomical, …, Science & Engineering Data Federation & Integration is Hard

  26. Public curateddata Shared data Glasgow Edinburgh Leicester Oxford Netherlands London Wellcome Trust: Cardiovascular Functional Genomics

  27. Data Access & Integration • Central to e-ScienceAstronomy, Earth Sciences, Ecology, Biology, Medicine, … • Collaboration • Shared Databases • Curated Knowledge • Accumulated Observations • Accumulated Simulations • Computation • Data mining • Input to models • Calibration of models • Presentation • Publication of results • Visualisation

  28. GGF DAIS WG • Chairs • Norman Paton (Manchester Uni.) • Leanne Guy (CERN) • Dave Pearson (Oracle UK) • Activity • BoF GGF4 Toronto • WG Meeting GGF5 Edinburgh • Papers for GGF6 • Workshops & Mail lists • Goals • Agree Standards for Database Access & Integration • Freely available reference implementations • OGSA-DAI one source & focus for discussions Norman Paton, Inderpal Narang, Leanne Guy, Susan Maliaka, Greg Ricardi, … http://www.cs.man.ac.uk/grid-db/

  29. OGSA-DAI project • Lego kit for Data Access & Integration • Components for e-Science Applications • Accelerated Application Development • Multiple Data Models • Distributed Data • Access via Grid & Proxies • Integration, Translation & Transformation • Open Source Reference Implementation • For DAIS-WG standard • Trigger for Component Construction • Start a community

  30. Cambridge Hinxton OGSA-DAI Partners IBM USA EPCC& NeSC Glasgow Newcastle Belfast Manchester Daresbury Lab Oxford EPCC & NeSCIBM UK IBM USA Manchester e-SC Newcastle e-SCOracle Oracle RAL Cardiff London IBM Hursley Southampton £3 million, 18 months, started February 2002

  31. Primary Components

  32. Advanced Components

  33. Composed Components

  34. Composing Components OGSA-DAIComponent Data Transport OGSA-DAIComponent Data Transport OGSA-DAIComponent Data Transport Data Transport

  35. DAI Key Components GridDataService GDS Access to data & DB operations GridDataServiceFactory GDSF Makes GDS & GDSF GridDataServiceRegistry GDSR Discovery of GDS(F) & Data GridDataTranslationService Translates or Transforms Data GridDataTransportDepot GDTD Data transport with persistence Relational & XML models supported Role-based Authorisation Binary structured files

  36. Class GridService Registry NotificationConsumer NotificationProducer GDS Mandatory Optional Normal GDSF Mandatory Optional Normal GDSR Mandatory Mandatory Normal GDTS Mandatory GDTD Mandatory Optional Normal OGSA Relationship

  37. Class GridDataService DataTransport Factory GDS Mandatory Normal GDSF Optional Normal Mandatory GDSR Optional GDTS Optional Mandatory GDTD Optional Mandatory DAI portType Usage

  38. Distributed Query

  39. OGSA-DAI Time Line WS + GSI UK support ( > 100 downloads) XML + OGSA Prototypes for Early Adopters Design Documents & Demos for DAIS WG @ GGF5 XML + OGSA Prototype Available RDB + GT2 / OGSA Prototypes Available GGF6 WG Papers & Prototypes Ship Alpha Release for GT3 Integration Presentation & Beta @ GGF7 Productisation, RAMPS &Extension Feb ’02 May ’02 Jul ’02 Sep ’02 Dec ’02 Feb ’03 May ’03 Sep ’03 Phase 2 Starts Phase 1 Starts

  40. OGSA-DAI Summary • On Schedule & Going Well • Contributions via DAIS-WG @ GGF5 & 6 • Releases with GT3 Releases scheduled • Status: Early Days • Released prototypes • Tested Architectural Design • Using OGSA • Working with Early Adopter Pilot Projects • AstroGrid & MyGrid • First PRODUCT release Dec ‘02 • Influence OGSA-DAI direction • Via DAIS-WG & Direct messages to us

  41. Archive Archive Reference Data Instrument Raw Data Multi-stage Processing Processed Data In Silico Data Processing • Processing Characteristics • Well defined work flow • Correction, calibration, transformation,filtering, merging • Relatively static reference data • Stable processing functions (audited changes) • Periodic reprocessing from archive Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago

  42. Archive Summarisation Summarised Data Processed Data Analysis and Interpretation Analysis Characteristics - Variable workflow - Standard functions - Standard and personal filtering and summarisation - Retain drill down capability Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago

  43. Personalised Database Summarised Data Result data Retrieval & Update Processed Data Analysis and Interpretation • Conclusions/Inferences • Descriptions • Trends • Correlations • Relationships • Analysis and Interpretation Characteristics • Highly dynamic work flow • Multiple data types • Volatile data • Annotations, inferences, conclusions • Evidential reasoning • Shared multiple versions of truth • Periodic version consolidation Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago

  44. Metadata Requirements Technical Metadata • Direct referencing - Physical location and data schema/structure • Data currency/status – version, time stamping • Accreditation/Access permissions - Ownership (Dublin Core) • Query time/Governance - data volume, no. of records, access paths Contextual Metadata • Logical referencing physical data – semantic/syntactic ontologies • Lexical translation – Thesaurus, ontological mapping • Named derivations (summarisations) Scope of Requirements • All science communities • Related to provenance Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago

  45. Metadata Requirements Data Versioning • Distinguish latest/agreed version of data • Maintain history record of change • Synchronise and mirror replicated data • Distinguish shared personal interpretations and/or annotations Provenance • Record of data processing – calibration, filtering, transformation • Record of workflow – methods, standards and protocols • Reasoning – evidential justification for inferences & conclusions Scope of Requirements • All science communities • Includes Technical and Contextual Metadata Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago

  46. Provenance Issues • Schema evolution • Granularity of record • Processed v Derived • Inheritance • Lack of structured annotations, ontologies • Interactive analysis = dynamic workflow • Multiple derived data sources • Context of usage • Best practice can change • Multiple versions of the truth • Evidential reasoning • Existing data & applications • Where is the provenance record stored Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago

  47. Collaborative Annotation • See DAS • Distributed Annotation Service • Challenges • Autonomy • Selective viewing • Identification • Provenance • Derivation

  48. Biomedical e-Scientists • Is this one species? • Understanding bird energy • Understanding a river / ocean interaction • Understanding a biochemical pathway • Understanding a cell • Understanding a Heart or Brain • Understanding Rhododendra • Understanding Evolution • … • No One-Size fits all solutions • But sharable re-usable components

  49. Opportunities • Many, many … • More than we can address • Compute needs • Data management needs • Data integration needs • … • Must choose some pioneers • To meet a range of common requirements • To provoke rich & high-level platform • To generate re-usable components • A Long-Term Commitment Needed

  50. Data Integration Distributed Data Access Scheduling Monitoring Accounting Diagnosis Authorisation Logging Data & Compute Resources Structured Data Advancing SDMIV Grid SDMIV Users Scientific Application SDMIV (Grid) Application Component Library Grid Plumbing & Security Infrastructure

More Related