1 / 40

Distributed Data Mining in Discovery Net

Distributed Data Mining in Discovery Net. Dr. Moustafa Ghanem Department of Computing Imperial College London. What is Discovery Net Distributed Data Mining for Compute Intensive Tasks Distributed Data Mining for Sensor Grids Knowledge Discovery from Naturally Distributed Data Sources

tayte
Download Presentation

Distributed Data Mining in Discovery Net

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Data Mining in Discovery Net Dr. Moustafa Ghanem Department of Computing Imperial College London

  2. What is Discovery Net Distributed Data Mining for Compute Intensive Tasks Distributed Data Mining for Sensor Grids Knowledge Discovery from Naturally Distributed Data Sources What Do Scientists Really Want?

  3. 1. What is Discovery Net

  4. Funding : One of the eight UK national e-science Pilot Projects funded by EPSRC (£2.2M) Start Oct 2001, End March 2005 Goal :Construct the World’s first Infrastructure for Global Knowledge Discovery Services Key Technologies: Open Service Computing High Throughput Devices and Real Time Data Mining Real Time Data Integration & Information Structuring Cross Domain Knowledge Discovery and Management Discovery Workflow and Discovery Planning What is Discovery Net?

  5. Life Sciences High throughput genomics and proteomics Distributed Databases and Applications Environmental Modelling High throughput dispersed air sensing technology Sensor Grids Real time geo-hazard modelling Earthquake modelling through satellite imagery High performance Distributed Computation 10 9 8 7 6 5 4 3 2 1 M N A B C D E F G H I J K L Discovery Net Applications

  6. Goal: Plug & Play Data Sources, Analysis Components & Knowledge Discovery Processes Discovery Net Architecture DPML Web/Grid Services OGSA D-Net Clients: End-user applications and user interface allowing scientists to construct and drive knowledge discovery activities D-Net Middleware: Provides services and execution logic for distributed knowledge discovery and access to distributed resources and services Computation & Data Resources: Distributed databases, compute servers and scientific devices. High Performance Communication Protocol (GridFTP, DSTP..) Grid Infrastructure (GSI)

  7. Generic Data Mining Classification, Clustering, Associations, .. Unstructured-Data Mining Text Mining, Image Mining Domain-specific Mining Bioinformatics, Cheminformatics, .. Discovery Net Data Mining Components

  8. 2. Distribution of Compute Intensive Tasks a. Distributed Data Mining for Geo-hazard Prediction

  9. Data Warehousing & Modelling Co-registration & geo-rectification Image features extraction Cluster & classification Grid-based Geo-hazard Data Mining • Grid-based HPC Computation • Workflow to Co-ordinate Grid Computation • Automatically co-register a stack of imagery layers at high precision and speed. • Grid-based Data Access and Integration

  10. Image “before” Image “after” Reading Data set Reading Data set Setting search window Setting comparing window Setting comparing window Significant correlation coefficient N Y Delta X Delta X Correlation coefficient Normalised cross-correlation (NCC) template algorithm Operating on a remotely accessed MPI UNIX parallel computer through fast network with DNet interface. Slow but high accuracy: 24 processors 10 hours for one scene of Landsat-7 ETM+ Pan imagery data. The algorithm also run on GRID.

  11. 2. Distribution of Compute Intensive Tasks b. Distributed Clustering

  12. Workflows for Distributed Data Clustering

  13. 3. Distributed Mining over Sensor Grid Data Distributed Spatial Data Mining for Air Pollution Modelling

  14. Sensor Specification The GUSTOProject - Update(Generic UV Sensors Technologies & Observations) • High throughput open path spectrometer system • Robustalgorithm for pollutant concentration retrievals • Measures SO2, NO, NO2,O3 & Benzene to ppb levels every few seconds • Geared for networking of multiple GUSTO units within a GRID Infrastructure • Can support Remote Sensing data for (contour) mapping of pollutants www.gusto-systems.com

  15. GUSTOunit1 Wireless connectivity Monitoring and control software Sensor registry & control service GUSTOunit2 SensorML HTTP, SOAP, GSI HTTP, SOAP, GSI Data access service Warehouse Data upload service Public access Web visualizer GUSTOunit3 Archived weather data Visualisation and Data Mining Archived health data GUSTOunit4 GRID Infrastructure Networking of Multiple GUSTO Units www.gusto-systems.com

  16. Pollution analysis

  17. 4. Knowledge Discovery from Naturally Distributed Data Sources Distributed Data Mining in Life Sciences

  18. secondary structure tertiary structure polymorphism patient records epidemiology expression patterns physiology sequences alignments receptors signals pathways ATGCAAGTCCCT AAGATTGCATAA GCTCGCTCAGTT linkage maps cytogenetic maps physical maps Distributed Data Mining for Life Sciences

  19. Given a collection of microarray generated gene expression data, what kind of questions the users wish to pose. Design an integration schema? Information Integration Gene Expression Warehouse ExPASy SwissProt PDB ExPASy Enzyme OMIM Enzyme Disease Protein Affy Fragment LocusLink Known Gene MGD Sequence Pathway SNP Metabolite SPAD Sequence Cluster Genbank KEGG NCBI dbSNP NMR UniGene

  20. From Data Integration to Knowledge Unification In Silico Experiment D-World I-World K-World

  21. Identify Organism Chromosomes Organism’s DNA Identify Genes tRNAs, rRNAs High Throughput Sequencers Gene markers Non-translated RNAs EMBL NCBI genscan blast Regulatory Regions Repetitive Elements TIGR SNP grail Repeat Masker Segmental Duplication SNP Variations Nucleotide-level Annotation Literature References ….. E-PCR genscan Identify Proteins Classify into Protein Families Inter Pro Inter Pro blast 3D-PSSM Functional Characteisation Homologues SMART SWISS PROT Domain 3-D Structure PFAM Motif Search Protein-level Annotation Fold Prediction Secondary structure predator DSC Literature References ….. Relate Pathway Maps Ontologies Cell Cycle Metabolism GO CSNDB Process-level Annotation Drugs Biological Process….. AmiGO GeneMaps KEGG GK Cell death Embryogenesis virtual chip GenNav Literature References ….. 15 DBs 21 Applications Life Science Application: SC2002 HPC Challenge D-Net based Global Collaborative Real- Time Genome Annotation Genome Annotation

  22. Interactive Editor & Visualisation Download sequence from Reference Server Save to Distributed Annotation Server KEGG Inter Pro SMART Execute distributed annotation workflow SWISS PROT EMBL NCBI TIGR SNP GO Distributed data and computation HPC Challenge SC2002 Nucleotide Annotation Workflows Real-time sequencing in London • 1800 clicks • 500 Web access • 200 copy/paste • 3 weeks work in 1 workflow and few second execution

  23. Homology search against viral genome DB Homology search against protein DB D-Net: Integration, interpretation, and discovery Annotation using Artemis and GenSense Annotation using Artemis and GenSense Genbank Predicted genes Gene prediction Homology search against motif DB Exon prediction Key word search Protein localization site prediction Relationship between SARS and other virus Splice site prediction Protein interaction prediction Multiple sequence alignment GeneSense Ontology Relationship between SARS virus and human receptors prediction Phylogenetic analysis Mutual regions identification Immunogenetics Classification and secondary structure prediction Microarray analysis SARS patients diagnosis Epidemiological analysis Bibliographic databases Bibliographic databases Discovery Net in Action:China SARS Virtual Lab

  24. Discovery Net in Action: SARS Virus Mutation Analysis

  25. 5. What do Scientist Really Want? Does it really work?

  26. Workflow Deployment: Grid Service and Portal Native MPI OGSA-service Condor-G Web Service Web Wrapper Sun Grid Engine Unicore Oralce 10g Towards Compositional Grid Services Resource Mapping Service Browsing Workflow Execution A compositional GRID Workflow Authoring Composing services Workflow Warehousing Service Abstraction Workflow Management Collaborative Knowledge Management

  27. Discovery Net Service Composition

  28. Full Workflow

  29. Executing Protein Annotation Workflow

  30. Deployment of Node

  31. Deploying Protein Annotation Workflow

  32. Executing Deployed Service

  33. Locating & Executing Deployed Service from Discovery Net

  34. Workflow Provenance

  35. Workflow Warehousing

  36. Discovery Net Snapshot Real Time Data Integration Discovery Services Service Workflow Operational Data Literature Instrument Data Databases Integrative Knowledge Management Dynamic Application Integration Using Distributed Resources Images Scientific Information Scientific Discovery In Real Time

More Related