Grids and the school of informatics at indiana university
Download
1 / 55

Grids and the School of Informatics at Indiana University - PowerPoint PPT Presentation


  • 205 Views
  • Uploaded on

Grids and the School of Informatics at Indiana University. Sun Yat-sen University Guangzhou China November 4 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 gcf@indiana.edu http://www.infomall.org.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Grids and the School of Informatics at Indiana University' - bela


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Grids and the school of informatics at indiana university l.jpg

Grids and theSchool of Informatics at Indiana University

Sun Yat-sen University

Guangzhou China November 4 2006

Geoffrey Fox

Computer Science, Informatics, Physics

Pervasive Technology Laboratories

Indiana University Bloomington IN 47401

gcf@indiana.edu

http://www.infomall.org


The central goal of informatics l.jpg
The Central Goalof Informatics

data

information

knowledge


What is informatics l.jpg
What is Informatics?

  • Informatics is the integration of the art, science, and the human dimensions of information technology to provide solutions to discipline-specific problems

  • Informatics is a response to the data/information/knowledge gaps (data deluge) caused by “billions and billions of bits”

    • Grids are technology supporting this in distributed research


Bioinformatics data deluge challenge and opportunity l.jpg
Bioinformatics Data DelugeChallenge and Opportunity

2000

1985

1 experiment

1 experiment

1 gene

10,000 genes

OPPORTUNITY

10 data

10,000,000 data

CHALLENGE


Slide5 l.jpg

Tech Centered Informatics

Domain Centered Informatics

Computer & Information Science

including Web, Text, Data Mining

Bio-, Health-, Chemical-, Music-, etc.

Informatics, e-Science, Complex systems,

Modeling, Simulation

Human Centered Informatics

Human Computer Interaction,

New Media,Social/Organizational Informatics, Security


School of informatics programs l.jpg

B.S.

Computer Science (IUB)

Informatics (IUB/IUPUI/IUSB)

New Media: Media Arts and Science (IUPUI)

Health Information Administration (IUPUI)

M.S.

Computer Science (IUIB)

New Media: Media Arts and Science (IUPUI)

Human Computer Interaction (IUB/IUPUI)

Bioinformatics (IUB/IUPUI)

Chemical Informatics (IUB/IUPUI)

Music Informatics (IUB)

Laboratory Informatics (IUPUI)

Health Informatics (IUPUI)

Cybersecurity (IUB)

Ph.D.

Computer Science (IUB)

Informatics (IUB/IUPUI)

School of Informatics Programs

  • Indiana University has 8 separate campuses

  • School currently at 3 of 8 campuses

  • Largest Campuses:

    • IUB Bloomington

    • IUPUI Indianapolis


Iub faculty with one or more of degrees listed undergrad or grad of 65 total faculty l.jpg
IUB Faculty with One or More of Degrees Listed -- undergrad or grad -- of 65 total faculty

CS 40 Journalism 1

Math 7 Library/Info Science 2

Chemistry 4 Linguistics 1

Hist.of Sci./Tech. 3 Physics 5

Philosophy 2 Psychology 2

EE 3 Mathematics 5

Biology 2 Design 1

Comp. Lit. 1 Cog. Sci. 2

Anthropology 1 Aero. Engineering 1

Music 2 Public Policy 1


Undergraduate profile bloomington l.jpg
Undergraduate Profile– Bloomington or grad -- of 65 total faculty

  • Informatics Majors (BS):............................. 382 students

  • Computer Science (BS and BA): .................. 135 students

  • Women: ...................................................... 13%

  • International Students: .............................. 8%

  • Number of Undergraduates Statewide: ....... 1,250

  • Average Starting Salary : ............................ $42,000

  • Placement rate …………………………………….. 90%

  • Note BA in Computer Science administered by the College of Arts and Sciences


E moreorlessanything and the grid l.jpg
e-moreorlessanything and the Grid or grad -- of 65 total faculty

  • ‘e-Science is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it.’ from its inventor John Taylor Director General of Research Councils UK, Office of Science and Technology

  • e-Science is about developing tools and technologies that allow scientists to do ‘faster, better or different’ research

  • Similarly e-Business captures an emerging view of corporations as dynamic virtual organizations linking employees, customers and stakeholders across the world.

    • The growing use of outsourcing is one example

  • The Grid provides the information technology e-infrastructure for e-moreorlessanything.

  • A deluge of data of unprecedented and inevitable size must be managed and understood.

  • People, computers, data and instruments must be linked.

  • On demand assignment of experts, computers, networks and storage resources must be supported


Why grids cyberinfrastructure useful l.jpg
Why Grids/ Cyberinfrastructure Useful or grad -- of 65 total faculty

  • Supports distributed science – data, people, computers

  • Exploits Internet technology (Web2.0) adding management, security, supercomputers etc.

  • It has two aspects: parallel – low latency (microseconds) between nodes and distributed – highish latency (microseconds) between nodes

  • Parallel needed to get high performance on individual 3D simulations, data analysis etc.; must decompose problem

  • Distributed aspect integrates already distinct components

  • Cyberinfrastructure is in general a distributed collection of parallel systems

  • Grids are made of services that are “just” programs or data sources packaged for distributed access


Teragrid integrating nsf cyberinfrastructure l.jpg
TeraGrid: Integrating NSF Cyberinfrastructure or grad -- of 65 total faculty

Buffalo

Wisc

UC/ANL

Cornell

Utah

Iowa

PU

NCAR

PSC

IU

NCSA

Caltech

ORNL

USC-ISI

UNC-RENCI

SDSC

TACC

TeraGrid is a facility that integrates computational, information, and analysis resources at the San Diego Supercomputer Center, the Texas Advanced Computing Center, the University of Chicago / Argonne National Laboratory, the National Center for Supercomputing Applications, Purdue University, Indiana University, Oak Ridge National Laboratory, the Pittsburgh Supercomputing Center, and the National Center for Atmospheric Research.

Today 100 Teraflop; tomorrow a petaflop; Indiana 20 teraflop today.


Virtual observatory astronomy grid integrate experiments l.jpg
Virtual Observatory Astronomy Grid or grad -- of 65 total facultyIntegrate Experiments

Radio

Far-Infrared

Visible

Dust Map

Visible + X-ray

Galaxy Density Map


Grid capabilities for science l.jpg
Grid Capabilities for Science or grad -- of 65 total faculty

  • Open technologies for any large scale distributed system that is adopted by industry, many sciences and many countries (including UK, EU, USA, Asia)

    • Security, Reliability, Management and state standards

  • Service and messaging specifications

  • User interfaces via portals and portlets virtualizing to desktops, email, PDA’s etc.

    • ~20 TeraGrid Science Gateways (their name for portals)

    • OGCE Portal technology effort led by Indiana

  • Uniform approach to access distributed (super)computers supporting single (large) jobs and spawning lots of related jobs

  • Data and meta-data architecture supporting real-time and archives as well as federation

    • Links to Semantic web and annotation

  • Grid (Web service) workflow with standards and several successful instantiations (such as Taverna and MyLead)

  • Many grids including Bioinformatics Chemistry and Earth Science

  • http://www.nsf.gov/od/oci/ci-v7.pdf


Apec cooperation for earthquake simulation l.jpg
APEC Cooperation for Earthquake Simulation or grad -- of 65 total faculty

  • ACES is a seven year-long collaboration among scientists interested in earthquake and tsunami predication

    • iSERVO is Infrastructure to supportwork of ACES

    • SERVOGrid is (completed) US Grid that is a prototype of iSERVO

    • http://www.quakes.uq.edu.au/ACES/

  • Chartered under APEC – the Asia Pacific Economic Cooperation of 21 economies


Slide15 l.jpg

Field Trip Data or grad -- of 65 total faculty

Database

?

GISGrid

Discovery

Services

RepositoriesFederated Databases

Streaming Data

Sensors

Database

Sensor Grid

Database Grid

Research

Education

SERVOGrid

Compute Grid

Customization

Services

From Researchto Education

Data FilterServices

ResearchSimulations

Analysis and VisualizationPortal

EducationGrid

Computer

Farm

Grid of Grids: Research Grid and Education Grid


Servogrid and cyberinfrastructure l.jpg
SERVOGrid and Cyberinfrastructure or grad -- of 65 total faculty

  • Grids are the technology based on Web services that implement Cyberinfrastructure i.e. support eScience or science as a team sport

    • Internet scale managed services that link computers data repositories sensors instruments and people

  • There is a portal and services in SERVOGrid for

    • Applications such as GeoFEST, RDAHMM, Pattern Informatics, Virtual California (VC), Simplex, mesh generating programs …..

    • Job management and monitoring web services for running the above codes.

    • File management web services for moving files between various machines.

    • Geographical Information System services

    • Quaketables earthquake specific database

    • Sensors as well as databases

    • Context (dynamic metadata) and UDDI system long term metadata services

    • Services support streaming real-time data


Slide17 l.jpg

a or grad -- of 65 total faculty

Site-specific Irregular

Scalar Measurements

a

Constellations for Plate Boundary-Scale Vector Measurements

Ice Sheets

a

Volcanoes

PBO

Greenland

Long Valley, CA

Topography

1 km

Stress Change

Northridge, CA

Earthquakes

Hector Mine, CA


Some grid concepts i l.jpg
Some Grid Concepts I or grad -- of 65 total faculty

  • Services are “just” (distributed) programs sending and receiving messages with well defined syntax

  • Interfaces (input-output) must be open; innards can be open source (allowing you to modify) or proprietary

    • Services can be any language from Fortran, Shell scripts, C, C#, C++, Java, Python, Perl – your choice!!

    • Web Services supported by all vendors (IBM, Microsoft …)

  • Service overhead will be just a few milliseconds (more now) which is < typical network transit time

    • Any program that is distributed can be a Web service

    • Any program taking execution time ≥ 20ms can be an efficient Web service


Web services l.jpg
Web services or grad -- of 65 total faculty

  • Web Services build loosely-coupled, distributed applications, (wrapping existing codes and databases) based on the SOA (service oriented architecture) principles.

  • Web Services interact by exchanging messages in SOAPformat

  • The contracts for the message exchanges that implement those interactions are described via WSDL interfaces.


A typical web service l.jpg

Portal or grad -- of 65 total facultyService

Security

Catalog

A typical Web Service

  • In principle, services can be in any language (Fortran .. Java .. Perl .. Python) and the interfaces can be method calls, Java RMI Messages, CGI Web invocations, totally compiled away (inlining)

  • The simplest implementations involve XML messages (SOAP) and programs written in net friendly languages like Java and Python

PaymentCredit Card

Web Services

WSDL interfaces

Warehouse

Shipping

control

WSDL interfaces

Web Services


Some grid concepts ii l.jpg
Some Grid Concepts II or grad -- of 65 total faculty

  • Systems are built from contributions from many different groups – you do not need one “vendor” for all components as Web services allow interoperability between components

    • One reason DoD likes Grids (called Net-Centric computing)

  • Grids are distributed in services and data allowing anybody to store their data and to produce “their” view

    • Some think that University Library of future will curate/store data of their faculty

  • “2 level programming model”: Classic programming of services and services are composed using workflow consistent with industry standards (BPEL)

  • Grid of Grids: (System of Systems) Realistically Grid-like systems will be built using multiple technologies and “standards” –integrate separate Grids for Sensors, GIS, Visualization, computing etc. with OGSA (Open Grid Service Architecture from OGF) system Grid (Security, registry) into a single Grid

  • Existing codes UNCHANGED; wrap as a service with metadata


Slide22 l.jpg

TeraGrid User Portal or grad -- of 65 total faculty


Lead gateway portal l.jpg
LEAD Gateway Portal or grad -- of 65 total faculty

NSF Large ITR and Teragrid Gateway

- Adaptive Response to Mesoscale weather events

- Supports Data exploration,Grid Workflow


Slide24 l.jpg

6 ≤ M or grad -- of 65 total faculty

5 ≤ M ≤ 6

Background: Earthquake Forecast – Published Feb 19, 2002, in PNAS.

( JB Rundle et al., PNAS, v99, Supl 1, 2514-2521, Feb 19, 2002; KF Tiampo et al., Europhys. Lett., 60, 481-487, 2002; JB Rundle et al.,Rev. Geophys. Space Phys., 41(4), DOI 10.1029/2003RG000135 ,2003. http://quakesim.jpl.nasa.gov)

Color Scale  Decision Threshold

D.T. => “false alarms” vs. “failures to predict”

Eighteen significant earthquakes (M > 4.9; blue circles) have occurred in Central or Southern California. Margin of error of the anomalies is +/- 11 km; Data from S. CA. and N. CA catalogs:

After the work was completed

1. Big Bear I, M = 5.1, Feb 10, 2001

2. Coso, M = 5.1, July 17, 2001

After the paper was in press ( September 1, 2001 )

3. Anza I, M = 5.1, Oct 31, 2001

After the paper was published ( February 19, 2002 )

4. Baja, M = 5.7, Feb 22, 2002

5. Gilroy, M=4.9 - 5.1, May 13, 2002

6. Big Bear II, M=5.4, Feb 22, 2003

7. San Simeon, M = 6.5, Dec 22, 2003

8. San Clemente Island, M = 5.2, June 15, 2004

9. Bodie I, M=5.5, Sept. 18, 2004

10. Bodie II, M=5.4, Sept. 18, 2004

11. Parkfield I, M = 6.0, Sept. 28, 2004

12. Parkfield II, M = 5.2, Sept. 29, 2004

13. Arvin, M = 5.0, Sept. 29, 2004

14. Parkfield III, M = 5.0, Sept. 30, 2004

15. Wheeler Ridge, M = 5.2, April 16, 2005

16. Anza II, M = 5.2, June 12, 2005

17. Yucaipa, M = 4.9 - 5.2, June 16, 2005

18. Obsidian Butte, M = 5.1, Sept. 2, 2005

CL#03-2015

Plot of Log10(Seismic Potential)

Increase in Potential for significant events, ~ 2000 to 2010


Aces components l.jpg
ACES Components or grad -- of 65 total faculty


Grid workflow datamining in earth science l.jpg

Streaming Data or grad -- of 65 total faculty

Support

Transformations

Data Checking

Hidden MarkovDatamining (JPL)

Display (GIS)

Grid Workflow Datamining in Earth Science

NASA GPS

  • Work with Scripps Institute

  • Grid services controlled by workflow process real time data from ~70 GPS Sensors in Southern California

Earthquake


Grid workflow data assimilation in earth science l.jpg

Use a Portlet-based user portal to access or grad -- of 65 total facultyand control services and workflow

Grid Workflow Data Assimilation in Earth Science

  • Grid services triggered by abnormal events and controlled by workflow process real time data from radar and high resolution simulations for tornado forecasts


Slide28 l.jpg

Beijing or grad -- of 65 total faculty

Xi’an

Shanghai

Hong Kong

Hefei

Changsha

China National Grid

From Qian Depei Beihang


New drug discovery grid l.jpg
New drug discovery grid or grad -- of 65 total faculty

  • Undertaken by Shanghai Institute of Materia Medica CAS

  • Compound screening for new drug discovery

    • Speed up the process by computer simulation

    • Higher accuracy

  • Using HPC in P2P mode

  • New drug for diabetes is under development and will enter clinic testing by the end of 2005


New drug discovery grid platform l.jpg

曙光 or grad -- of 65 total faculty4

0

0

0

A

PC

神威

PC

机群

Beijing Medical Institute

Shanghai SCC

PC

Shanghai Institute of Materia Medica CAS

New Drug Discovery Grid Platform

主服务


Slide31 l.jpg

DDG Portal or grad -- of 65 total faculty


Bio informatics grid l.jpg
Bio-informatics Grid or grad -- of 65 total faculty

  • Undertaken by Genomics & Bioinformatics Institute, CAS

  • Provide computing, data, and information grids for bio-information research in the country



Chinagrid in a nutshell l.jpg
ChinaGrid in a Nutshell Technology Wuhan, China

  • China Education and Research Grid

  • Funded by Ministry of Education

  • As the pilot grid application supported by China National Grid (CNGrid)

  • Based on CERNET (China Education and Research Network)

  • First Phase

    • From 2003-2005

    • 12 key universities as initiative

    • 20 key universities now


Architecture of medical image processing grid l.jpg
Architecture of Technology Wuhan, ChinaMedical Image Processing Grid


Slide36 l.jpg

Bioinformatics Grid Technology Wuhan, China


Biogrid applications l.jpg
BioGrid Applications Technology Wuhan, China

  • Protein target selection for rice genome

  • Multi-sequence alignment for ganoderma family

  • Gene joint for white mice

  • Cardiovascular disease research


Chemical informatics and cyberinfrastructure collaboratory cicc grid vision l.jpg
Chemical Informatics and Cyberinfrastructure Collaboratory CICC Grid Vision

  • Drug Discovery and other academic chemistry and pharmacologyresearch will be aided by powerful modern information technology ChemBioGrid set up as distributed cyberinfrastructure in eScience model

  • ChemBioGrid will provide portals (user interfaces) to distributed databases, results of high throughput screening instruments, results of computational chemical simulations and other analyses

  • ChemBioGrid will provide services to manipulate this data and combine in workflows; it will have convenient ways to submit and manage multiple jobs

  • ChemBioGrid will include access to PubChem, PubMed, PubMed Central, the Internet and its derivatives like Microsoft Academic Live and Google Scholar

  • The services include open-source software like CDK, commercial code from vendors from BCI, OpenEye, Gaussian and Google, and any user contributed programs

  • ChemBioGrid will define open interfaces to use for a particular type of service allowing plug and play choice between different implementations

  • http://www.chembiogrid.org


Formal cheminformatics courses l.jpg
Formal Cheminformatics Courses CICC Grid Vision

  • I571 Chemical Information Technology (3 cr.)

    • Distance Ed section had 10 students in Fall 2005, from California to Connecticut

  • I572 Computational Chemistry and Molecular Modeling (3 cr.)

  • I573 Programming Techniques for Chemical and Life Science Informatics (3 cr.)

  • I553 Independent Study in Chemical Informatics (3 cr.)

  • Above courses required for the new Graduate Certificate Program in Chemical Informatics

  • I533 Seminar in Chemical Informatics

    • Spring 2006 Topic: Molecular Informatics, the Data Grid, and an Introduction to eScience

    • http://www.indiana.edu/~cheminfo/I533/533home.html

  • I647 Seminar in Chemical Informatics

    • Fall 2006 Topic: Bridging Bioinformatics and Chemical Informatics

    • http://www.indiana.edu/~cheminfo/I647/647home.html


Related courses l.jpg
Related Courses CICC Grid Vision

  • L519 Bioinformatics: Theory and Application (3 cr.) (at IUPUI: CSCI 548)

  • L529 Bioinformatics in Molecular Biology and Genetics: Practical Applications (4 cr.) (not offered at IUPUI)

  • I619 Structural Bioinformatics (3 cr.)

  • I617 Informatics in Life Sciences and Chemistry (3 cr.) (for non-majors)

  • B649 Topics in Systems: Service Architectures and Science (3 cr.)

  • I590 Topics in Informatics: Scientific Applications of XML (IUPUI)



Cicc prototype web services l.jpg

Next steps? Fall 2005

  • Define WSDL interfaces to enable global production of compatible Web services; refine CML

  • Ready to try “Prototype Production”

  • Develop more training material

  • Refine/go into production with key services including both tools, workflows and TeraGrid style simulations in capacity and capability modes

  • In-house algorithm work for new services in clustering, diversity analysis, QSAR methodologies

CICC Prototype Web Services

Basic cheminformatics

Key Ideas

Molecular weights

Molecular formulae

Tanimoto similarity

2D Structure diagrams

Molecular descriptors

3D structures

InChi generation/search

CMLRSS

  • Add value to PubChem with additional distributed services and databases

  • Wrapping existing code in web services is not difficult

  • Provide “core” (CDK) services and exemplars of typical tools

  • Provide access to key databases via a web service interface

  • Provide access to major Compute Grids

Application based services

Compare (NIH)

Toxicity predictions (ToxTree)

Literature extraction (OSCAR3)

Clustering (BCI Toolkit)

Docking, filtering, ... (OpenEye)Varuna simulation


Web service locations l.jpg
Web Service Locations Fall 2005

Cambridge University

  • InChi generation / search

  • CMLRSS

  • OpenBabel

Indiana University

  • Clustering

  • VOTables

  • OSCAR3

  • Toxicity classification

  • Database services

SDSCTypical

TeraGrid Site

InfoChem

  • SPRESI database

NIH

PubChem …..

Compare …..

Penn State University

CDK based services

  • Fingerprints

  • Similarity calculations

  • 2D structure diagrams

  • Molecular descriptors


Workflows using chemical literature l.jpg
Workflows Using Chemical Literature Fall 2005

Find similar

documents

Bulk download of

Pubmed abstracts

Find similar

molecules

All of PubMed “just” takes about a day to run through OSCAR3 on 2048 node Big Red

PDBBind

OSCAR3

Service

OSCAR3

program

PubChem

Local DTP

database

Extract chemical

structures

SMILES NAME Pubmed ID

CCC propane 1425356

CC ethane 3546453

..... ............. .............

Searchable

(structure/similarity)

Grid database

Clustering of documents linked to clustering of chemicals


Large scale calculations on all of pubchem med l.jpg
Large Scale Calculations on “All of PubChem/Med” Fall 2005

  • TeraGrid: 100 Teraflop now to 1000 Teraflop next year

  • IU 2048 node Big Red supercomputer: 20 Teraflop today

  • The CDK can currently calculate approx. 107 Descriptors

    • Whole of PubChem (6M compounds) – 276 hours, 1 CPU

    • On IU's Big Red, 2048 CPU's, 20 TF: < 7 minutes

    • Even increasing the descriptor count by 5 times gives us < 35 minutes of compute time on Big Red

  • OSCAR3 takes a few seconds per abstract to text-mine all compounds in it

    • All of PubMed would take < a day on Big Red

    • Cleanup and Iteration would take some time

  • Can pre-calculate properties of smaller compounds using CDK (logP, BCUT, CPSA, …) and programs likes GAMESS

    • 100,000 compounds take < a week each on a single CPU and would be a practical computation over next year


Prototype cicc project controlling the tgf b pathway collaboration between baik zhang at iu l.jpg

Web Service to Fall 2005

generate custom

force fields

Prototype CICC Project: Controlling the TGFb pathwayCollaboration between Baik & Zhang at IU

Simulations

in-house Molecules in Varuna

QM Database

AutoGeFF

Can afford few ms overhead!

TeraGridSupercomputers“Flocks”

VARUNA

Conceptual

Understanding of TGFb

Inhibition

Inactive TGFb

Active TGFb

With inhibitor

1IAS

  • Questions:

  • - What molecular feature controls inhibitor binding?

  • - How do mutations impact binding?

PubChem

Experimentsin the Zhang

Lab

PDB


Mlscn post hts biology decision support l.jpg
MLSCN Post-HTS Biology Decision Support Fall 2005

Percent Inhibition or IC50 data is retrieved from HTS

Grids can link data analysis ( e.g image processing developed in existing Grids), traditional Chem-informatics tools, as well as annotation tools (Semantic Web, del.icio.us) and enhance lead ID and SAR analysis

A Grid of Grids linking collections of services atPubChem

ECCR centers

MLSCN centers

Workflows encoding plate & control well statistics, distribution analysis, etc

Question: Was this screen successful?

Workflows encoding distribution analysis of screening results

Question: What should the active/inactive cutoffs be?

Question: What can we learn about the target protein or cell line from this screen?

Workflows encoding statistical comparison of results to similar screens, docking of compounds into proteins to correlate binding, with activity, literature search of active compounds, etc

Compounds submitted to PubChem

PROCESS

CHEMINFORMATICS

GRIDS


Mlscn data how services and workflows are used l.jpg
MLSCN Data - How services and workflows are used Fall 2005

PubChem interfaces to workflows via SOAP

Data is stored in Pubchem

MLSCN submits HTS data to Pubchem and/or sends directly to workflow for real-time feedback

Workflows perform different kinds of analysis on the MLSCN data, including SAR, clustering, literature searching, protein searching, toxicity testing, etc…

End-user applications and interfaces utilize the information streams from the workflows for human interaction with the data and analysis


Example hts workflow finding cell protein relationships l.jpg
Example HTS workflow: finding cell-protein relationships Fall 2005

A protein implicated in tumor growth with known ligand is selected (in this case HSP90 taken from the PDB 1Y4 complex)

The screening data from a cellular HTS assay is similarity searched for compounds with similar 2D structures to the ligand.

Docking results and activity patterns fed into R services for building of activity models and correlations

LeastSquares

Regression

RandomForests

NeuralNets

Similar structures are filtered for drugability, are converted to 3D, and are automatically passed to the OpenEye FRED docking program for docking into the target protein.

Once docking is complete, the user visualizes the high-scoring docked structures in a portlet using the JMOL applet.

SImilar structures to the ligand can be browsed using client portlets.


Protein function l.jpg

Protein Function Fall 2005

ubiquitination site

Automated functional annotation

Prediction of global functional class

molecular function

biological process

cellular localization

Prediction of residue based annotation

post-translational modifications

binding sites

active sites

deleterious mutations (disease implications)

Inferences made from

amino acid sequence

protein 3D structure

evolutionary data

protein-interaction (network) data

Molecular function:

transcription regulator activity

(GO:0030528)

Predrag Radivojac

www.informatics.indiana.edu/predrag


Proteomics l.jpg

Proteomics Fall 2005

Approaches based on MS/MS

Peptide identification

using machine learning

using novel scoring for database searching

post-translationally modified peptides

de novo

Protein identification and quantification

label-free methodology

based on machine learning

Glycomics and glycoproteomics

Glycan sequencing

Mapping of site-specific glycosylations

Predrag Radivojacand Haixu Tang

www.informatics.indiana.edu/predrag

www.informatics.indiana.edu/hatang


Comparative genomics l.jpg

Comparative genomics Fall 2005

From bacterial to eukaryotic genomes

Platcom: a comparative genomics platform

a web based system for comparing genomes on the web at

several systems are developed on top of Platcom:

A pathway analysis system, ComPath,

A comparative genome annotation system, CGAS

Non-coding sequences in eukaryotic genomes

segmental duplications in human genome

LTR retrotransposons

RNA regulatory elements

Sun Kimand Haixu Tang

bio.informatics.indiana.edu/sunkim

www.informatics.indiana.edu/hatang


Motif discovery in proteins l.jpg

Motif Discovery in Proteins Fall 2005

From unaligned and aligned sequences

iGibbs: An improved Gibbs Motif Sampler for Proteins by Sequence Clustering and Iterative Pattern Refinement.

ADBG : Motif Discovery Using Approximate De Bruijn Graphs

ARCS : An Aggregated Related Column Scoring Scheme for

For a list of current algorithms:

http://bio.informatics.indiana.edu/bioalgo/

Mehmet Dalkilic, Sun Kimand Haixu Tang

www.informatics.indiana.edu/dalkilic

bio.informatics.indiana.edu/sunkim

www.informatics.indiana.edu/hatang


Systems biology disease l.jpg

http://www.informatics.indiana.edu/dalkilic Fall 2005

Integrated Discovery in Gene Networks

Using Drosophila m. data to discover disease-related protein interactions in humans

Systems Biology & Disease

Mehmet Dalkilic, James Costello, John Colbourne, Brian Eads

In collaboration with Dept. Biology, CGB, & DGRC


Curation and alignment tool for protein annotation www catpa org l.jpg

Provide a database system that allows annotation at the residue level of protein families (multiple and/or non-continguous) with images, text. Deletions and Insertions can be displayed too. Motifs can be automatically brought in from any motif discovery system provided a simple XML format is used.

Curation and Alignment Tool for Protein Annotation: www.catpa.org

Annotate any collection of residues

High level search

Mehmet Dalkilic, Andrew Albrecht