iu ore chem update n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
IU ORE-Chem Update PowerPoint Presentation
Download Presentation
IU ORE-Chem Update

Loading in 2 Seconds...

play fullscreen
1 / 40

IU ORE-Chem Update - PowerPoint PPT Presentation


  • 88 Views
  • Uploaded on

IU ORE-Chem Update. Marlon Pierce, Geoffrey Fox Indiana University. IU to lead New US NSF Track 2d $10M Award . See http://www.futuregrid.org for more information. What We Said We Would Do. Apply data-centric workflow technologies (Dryad) Significant effort Install and run triple store

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'IU ORE-Chem Update' - synclair


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
iu ore chem update

IU ORE-Chem Update

Marlon Pierce, Geoffrey Fox

Indiana University

slide2

IU to lead New US NSF Track 2d $10M Award

See http://www.futuregrid.org for more information.

what we said we would do
What We Said We Would Do
  • Apply data-centric workflow technologies (Dryad)
    • Significant effort
  • Install and run triple store
    • Done locally.
    • Need to do this in Azure.
  • Design alternative formats for ORE (JSON, Microformats)
    • Nothing to report yet
  • Design secure services, compositions, mash-ups
    • OAuth piece done.
    • Significant effort on social network interfaces
    • Nothing to report on ORE-chem enabled services yet
  • Investigate clouds for ORE-Chem
    • Infrastructure and runtime
    • Significant effort on virtual data stores, overheads of virtualization.
layer cake of iu activities
Layer Cake of IU Activities

Web 2.0 Research: Security for REST Services

Cloud Computing: Infrastructure and Runtimes

Infrastructure: Windows HPC Testbeds

cloud infrastructure
Cloud Infrastructure
  • Tempest: HP distributed shared memory cluster with 768 processor cores and 1.5 TB total memory capacity. The cluster includes 13.7 TB of local spinning disk.
    • Tempest can be dynamically reconfigured to act as either a Windows HPC or Linux cluster.
    • Smaller versions Madrid and Barcelona
  • Other machines:
    • The IBM iDataPlex system is an IBM e1350 distributed shared memory cluster with 1024 processor cores and 3 TB total memory capacity.
    • Cray XT5m distributed shared memory cluster with 672 processor cores and 1.3 TB total memory capacity.
    • A shared memory system with at least 480 cores and 640 GB of RAM will also be installed at IU as part of the FutureGrid award.
triple store intellidimension
Triple Store: Intellidimension
  • This has been installed on IU servers.
  • We are ready for data.
  • Efforts to install this on MS Azure were not successful.
    • Inadequate documentation earlier in the year.
    • We will revisit this.
open elastic block store
Open Elastic Block Store
  • Amazon EBS is a way to mount virtual disks in cloud-space.
    • Empty disk space or archived data stores
      • ORECHEM enabled data sets, for example.
    • Clone-able, so keep your own version of community data.
  • We are implementing an open version of this.
    • Contribute to Nimbus, an open-source EC2
    • But independent of Xen, etc.
    • Would be interesting to do this for Windows
  • Eventual backbone: IU has over a petabyte disk space of lustre file system.
    • Can be used to load and store VMs.
  • X. Gao won best student poster award at TG09.
    • Paper accepted to E-Science 2009
block store architecture
Block Store Architecture

Volume Server

Virtual Machine Manager (Xen Dom 0)

VM instance (

Xen Dom U)

VBD

iSCSI

Volume Delegate

VMM Delegate

Create Volume,

Export Volume,

Create Snapshot,Etc.

Import Volume,

Attach Device,

Detach Device,Etc.

VBS Web Service

VBS Client

integration with cloud computing systems
Integration with Cloud Computing Systems

iSCSI

Volume Server

Xen Dom 0

Xen Dom U

VBD

Xen Delegate

Volume Delegate

Create Volume,

Export Volume,

Create Snapshot,Etc.

Import Volume,

Attach Device,

Detach Device,Etc.

Nimbus Workspace Service

VBS Web Service

Query for Xen Dom0 Host and DomUId with <Nimbus Instance Id>

VBS_Nimbus Web Service

Attach-volume <volId>

<Nimbus Instance Id> <device>

VBS Client

multicore and cloud technologies to support data intensive applications
Multicore and Cloud Technologies to support Data Intensive applications
  • Using Dryad (Microsoft) and MPI to study structure of Gene Sequences on Tempest Cluster. We are working on PubChem.

See http://www.infomall.org/salsa for lab projects (X. Qiu).

slide15

Courtesy of

Jong Y. Choi

PubChem dataset consists of binary 166 MACCS keys (fingerprints), which indicate

whether a each chemical compound has a special functional molecule or not

We have total 26,466,421 chemical compounds. (i.e, the total PubChem dataset has

166 dimensions and 26M records)

Randomly selected 50K chemicals to produce 3D GTM map. GTM is an algorithm to

find a lower dimension structure from higher dimensional data (3D in this case).

http://www.youtube.com/watch?v=nylgjKgnSLg

iu s ore chem pipeline
IU’s ORE-CHEM Pipeline

Harvest NIH PubChem for 3D Structures

Convert Gaussian Output to CML

Convert CML to RDF->ORE-Chem

Convert PubChem XML to CML

Submit Jobs to TeraGrid with Swarm

Insert RDF into RDF Triple Store

Goal is to create a public, searchable triple store populated with ORE-CHEM data on drug-like molecules.

Convert PubChem XML to CML

Convert CML to Gaussian Input

Conversions are done with Jumbo/CML tools from Peter Murray Rust’s group at Cambridge. Swarm is a Web service capable of managing 10,000’s of jobs on the TeraGrid. We are developing a Dryad version of the pipeline.

iterative mapreduce kmeans clustering and matrix multiplication
Iterative MapReduce- Kmeans Clustering and Matrix Multiplication

Overhead of parallel runtimes – Matrix Multiplication

  • Compute intensive application O(n^3)
  • Higher data transfer requirements O(n^2)
  • CGL-MapReduce shows minimal overheads next to MPI

Iterative MapReduce algorithm for Matrix Multiplication

Overhead of parallel runtimes – Kmeans Clustering

  • O(n) calculations in each iteration
  • Small data transfer requirements O(1)
  • With large data sets, CGL-MapReduce shows negligible overheads
  • Extremely higher overheads in Hadoop and Dryad

Kmeans Clustering implemented as an iterative MapReduce application

Jaliya Ekanayake {jekanaya@cs.indiana.edu}

slide18

High Performance Parallel Computing on Cloud

  • Performance of MPI on virtualized resources
    • Evaluated using a dedicated private cloud infrastructure
    • Exactly the same hardware and software configurations in bare-metal and virtual nodes
    • Applications with different communication: computation ratios
    • Different virtual machine(VM) allocation strategies {1-VM per node to 8-VMs per node}

Performance of Matrix multiplication under different VM configurations

Overhead under different VM configurations for Concurrent Wave Equation Solver

  • O(1) communication (Smaller messages)
  • More susceptible to latency
  • Higher overheads under virtualized resources
  • O(n^2) communication (n = dimension of a matrix)
  • More susceptible to bandwidth than latency
  • Minimal overheads under virtualized resources

Jaliya Ekanayake {jekanaya@cs.indiana.edu}

conclusions dryad for scientific computing
Conclusions: Dryad for Scientific Computing
  • Investigated several applications with various computation, communication, and data access requirements
  • All DryadLINQ applications work, and in many cases perform better than Hadoop
  • We can definitely use DryadLINQ (and Hadoop) for scientific analyses
  • We did not implement (find)
    • Applications that can only be implemented using DryadLINQ but not with typical MapReduce
  • Current release of DryadLINQ has some performance limitations
  • DryadLINQ hides many aspects of parallel computing from user
  • Coding is much simpler in DryadLINQ than Hadoop (provided that the performance issues are fixed)
  • Key issue is support of inhomogeneous data
iu s ore chem pipeline1
IU’s ORE-CHEM Pipeline

Harvest NIH PubChem for 3D Structures

Convert Gaussian Output to CML

Convert CML to RDF->ORE-Chem

Convert PubChem XML to CML

Submit Jobs to TeraGrid with Swarm

Insert RDF into RDF Triple Store

Goal is to create a public, searchable triple store populated with ORE-CHEM data on drug-like molecules.

Convert PubChem XML to CML

Convert CML to Gaussian Input

Conversions are done with Jumbo/CML tools from Peter Murray Rust’s group at Cambridge. Swarm is a Web service capable of managing 10,000’s of jobs on the TeraGrid. We are developing a Dryad version of the pipeline.

architecture of swarm service
Architecture of Swarm Service

Swarm-Analysis

Standard Web Service Interface

Large Task Load Optimizer

Swarm-Grid Connector

Swarm-Dryad Connector

Swarm-Hadoop Connector

Local RDMBS

Swarm-Grid

Swarm-Dryad

Swarm-Hadoop

Grid HPC/

Condor Cluster

Cloud Comp.

Cluster

Windows

Server Cluster

swarm grid
Swarm-Grid
  • Swarm considers traditional Grid HPC cluster are suitable for the high-throughput jobs.
    • Parallel jobs (e.g. MPI jobs)
    • Long running jobs
  • Resource Ranking Manager
    • Prioritizes the resources with QBETS, INCA
  • Fault Manager
    • Fatal faults
    • Recoverable faults

Swarm-Grid

Standard Web Service Interface

Request Manager

QBETS Web Service

Resource Ranking Manager

Data Model Manager

Fault Manager

Hosted by UCSB

User A’s Job Board

Local RDMBS

Job Queue

Job Distributor

MyProxy

Server

Grid HPC/Condor pool Resource Connector

Hosted by

TeraGrid Project

Condor(Grid/Vanilla) with Birdbath

Grid HPC Clusters

Grid HPC Clusters

Condor Cluster

Grid HPC Clusters

Grid HPC Clusters

some details
Some Details
  • We can submit jobs to 3 different TeraGrid machines
    • Abe, Mercury, Cobalt (all at NCSA)
    • IU’s BigRed has some technical problems
  • Can do about 100-200 molecules per day in tests.
  • Approach is fragile because application/system admins have tendency to change things every few months.
    • Don’t validate Globus invocations of codes
dryad data partitioning
Dryad Data Partitioning
  • Two methods:
    • Manually place the files in every node or
    • Write a C# code that uses DryadLINQ partitioning operators like Hash Partition<T,K> or Range Partition<T,K>
  • A partitioned data set consists of 2 types of files:
    • A metadata file (.pt as extension) containing metadata that describes the partitions
    • Set of partition files, one for each data partition.

\DryadData\UserName\InputData (file path and name)

4 (number of partitions depending on number of nodes available)

0,2000,NODE01 (Partition files: Partition number, size(in bytes), node name : File path)

1,2000,NODE02,NODE03:FilePath

2,2000,NODE03,NODE04

3,2000,NODE04

programming the pipeline
Programming the Pipeline
  • IQueryable<T> represents query over the data
  • Input data is represented by a PartitionedTable<T> object
  • DryadLINQ programs apply LINQ query operations to PartitionedTable<T> objects.
  • LINQ queries on the PartitionTable object are executed on the Cluster.
  • Jobs will be executed on different nodes and the output would be collected in the outputDirectory.
  • IQueryable<LineRecord> filenames = PartitionedTable.Get<LineRecord> (filepathuri);
  • IQueryable<outputinfo> outputs = filenames.Select(s => XML2CML( execFileName, programSwitches, s.line, outputDirectory));
oauth rest security
OAuth: REST Security
  • This is actually a Year 2 deliverable but we made progress in Year 1.
  • OAuth is essentially security for REST.
    • Provide authentication and authorization
    • Relevant to ORE-CHEM services
  • Use REST URL and HTTP method
    • Resources are identified by URLs
    • Access privileges are identified by HTTP methods (GET, POST, PUT, DELETE)
  • Extend OAuth
    • Add finer-grained authorization information in requests and responses.
oauth security status
OAuth Security Status
  • OAuth *Core* Code provides the fundamental piece of OAuth specification 1.0.
  • Includes minimal webapp example
    • The sample web apps just support shared secret.
    • We extend to support PKI
    • Also fixed some bugs in the code.
    • To support OAuth extensions, more code is needed in OAuth core.
  • For OpenID, we use library OpenID4Java and it seems to offer enough functionalities so far.
  • Tutorial given at TeraGrid09
  • Slides: http://www.collab-ogce.org/ogce/images/3/39/OAuthOverview-TG09.ppt
  • Code: http://ogce.svn.sourceforge.net/viewvc/ogce/incubator/OGCE-OAuth/
acknowledgments
Acknowledgments
  • Geoffrey Fox
  • Judy Qiu and SALSA team: data mining
    • www.infomall.org/salsa
  • Jaliya Ekanayake: Dryad and Cloud performance
  • Sangmi Pallickara: Swarm service
  • Xiaoming Gao: Virtual Block Store
  • Zhenhua Guo: OAuth, OpenID, and Social Networks
dryad and dryadlinq

Dryad and DryadLINQ

Dryad is a high-performance, general-purpose distributed computing engine that simplifies the task of implementing distributed applications on clusters of computers running a windows operating system.

DryadLINQ allows us to implement Dryad applications in managed code by using an extended version of the LINQ programming model and API. LINQ was introduced with Microsoft .NET framework version 3.5.

DryadLINQ provider translates the application’s LINQ queries into a Dryad job and runs the job as a distributed application on a windows HPC cluster.

Dryad and DryadLINQ An Introduction version 1.0. June 30, 2009. by Microsoft Research

slide34

Client Workstation: runs DryadLINQ application.

DryadLINQProvider creates a Windows HPC job on the cluster to handle the Dryad processing, receives the results, and returns them to the application.

Job Manager: Windows HPC task that manages execution of associated Dryad job on the cluster.

Head Node: manages the cluster, hosts the Windows HPC Administration Console and Dryad management service.

Vertices: perform the data processing on the compute nodes.

slide35

java.exe + jumboconverters.jar

xml -> cml

cml -> gaussian input

Local Machine

Distribute gaussian Input files across the cluster and run gaussian.exe on every file at every node in the cluster.

DryadLINQ Provider

(LinqToDryad.dll)

Dryad Cluster

slide36

Distribute all the initial xml files over the cluster

Dryad Cluster

xml to cml conversion

Stage1

Stage2

cml to gaussian conversion

run gaussian on every file

Stage3

drilling though data clouds
Drilling Though Data Clouds
  • Applications
  • Cheminformatics: Mapping PubChem data into low dimensions to aid drug discovery
  • Biology: Expressed Sequence Tag (EST) sequence assembly (CAP3)
  • Biology: PairwiseAlu sequence alignment (SW)
  • Health: Correlating childhood obesity with environmental factors

Data mining Algorithm

Clustering (Pairwise , Vector)

MDS, GTM, PCA, CCA

Visualization

PlotViz

Cloud Technologies

(MapReduce, Dryad, Hadoop)

Classic HPC

MPI

FutureGrid/VM/Virtual Storage

Bare metal

(Computer, network, storage)

slide38

Architecture and Performance of Runtime Environments for Data Intensive Scalable Computing

CAP3 – Gene Assembly Program

  • Compute intensive application
  • Embarrassingly parallel operation
  • All runtimes performs equally well
  • Data intensive application
  • MapReduce style parallel operation
  • Both runtimes perform comparably well

Measured using 32 Compute nodes each with 8 cores and 16 GB of memory

Data/compute intensive applications implemented as MapReduce “filters”

Number of Reads processed

High Energy Physics Data Analysis

Jaliya Ekanayake {jekanaya@cs.indiana.edu}

Architecture of CGL-MapReduce