1 / 15

Summary of SDM ETC Kickoff for the Data Integration Task

Summary of SDM ETC Kickoff for the Data Integration Task. Terence Critchlow. Calton Pu Ling Liu David Buttler. Bertram Ludaescher Amarnath Gupta Mladen Vouk Tom Potok. People Terence Critchlow (LLNL) Calton Pu (GT) Ling Liu (GT) David Buttler (GT) Bertram Ludaescher (UCSD)

adsila
Download Presentation

Summary of SDM ETC Kickoff for the Data Integration Task

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Summary of SDM ETC Kickoff for theData Integration Task Terence Critchlow Calton Pu Ling Liu David Buttler Bertram LudaescherAmarnath Gupta Mladen VoukTom Potok

  2. People Terence Critchlow (LLNL) Calton Pu (GT) Ling Liu (GT) David Buttler (GT) Bertram Ludaescher (UCSD) Amarnath Gupta (UCSD) TDB: Ph.D. student at Georgia Tech Developer at UCSD Mladen Vouk / Tom Potok NCSU / ORNL Commitment per institution LLNL 0.25 (likely) – 1.0 FTE Georgia Tech 2 Ph.D. Students X months Calton’s time Y months Ling’s time UCSD 1 FTE 1 month Bertram’s time 1 month Gupta’s time Agent team 2-4 months over the course of the year People involved:

  3. Application ties • Primary domain: bioinformatics • Secondary domains: • Material science • Air / water quality • Scientists (early adopters) • Matt Coleman (LLNL) • Allen Christian (LLNL) • Phil Bourn (PDB) Contacted by Terence Contacted by Bertram / Gupta

  4. Use Case 1: Finding out everything about a sequence • Bob starts with one or several DNA or protein sequences that he wants to analyze • OR: Bob finds protein or gene sequences of interest by querying databases/web sites for metabolic pathways/cell signaling pathways (e.g., KEGG); • OR: Bob looks at a database of microarray experiments and chooses those genes that exhibit specified patterns of co-occurrence (what subsets of genes “go hand in hand” across a large number of experiments) • The relevant sequences are submitted to one or more sequence databases for blast search • The homologous sequences found in the searched database(s) are • directly returned to the user, sorted by score • OR: post-processed by the mediator (duplicate elimination, groupings, links to additional contextual data) • The resulting sequences can be queried for their associated information • Bob can use these sequences for new similarity searches

  5. Use Case 1: Additional scenerios • Helpful features for users • Multiple sequences entered through a single file • Ability to tie in other programs to preprocess data before passing it to wrappers / mediator • Follow-up searches may be more than just blasts • Selection / project / join queries through the interface • Tie in other tools such as RasMol • Other types of search such as phiblast, psiblast or other structural similarity searches

  6. ExternalProgram Data Integration Architecture if invoked, pre-processes query parameters and post-processes results Query Dispatch and Collection (QDaC) XML Wrapper XQuery (subsets e.g. Sel/Proj) : API Medline VIPAR Integration component / KB-Mediator (KBM) XML Wrapper CM Wrapper PDB CM Wrapper XML Wrapper XML Wrapper df CM Wrapper Source / Agent MetaData Registry XQuery interface Select/project only XWRAP Wrapper Generator

  7. Architecture comments • Communication protocol: • Use agent technology to communicate between components • Don’t use full capabilities when on the same machine • Between QDaC and wrappers, QDaC and mediator, mediator and CMs, CMs and wrappers • NOT expected between wrappers and source • Embedded representation: • XML sources are queried using a subset of XQuery (fragments) • Primarily concerned with selection and projection – not join • Query results are returned in XML

  8. Architecture comments • Meta-data repository (=metadata server) • Contains: • Location, schema • Query capabilities (blast, keyword, XPath) of sources • May be duplicated / shared between QDaC and KBM • Eventually may be treated as an agent • External programs • Will be included as preprocessing steps • May need wrappers to handle translations properly • Will be tied in to interface where possible • Gives users access to tools they need / want / are familiar with

  9. Architecture comments • Expect most wrappers to be generated by XWrap in practice, but it shouldn’t matter as long as they follow the specified protocol and representation • VIPAR used to wrap publication sources • Simple SQL wrapper for direct database access • Definitions: • CM – conceptual mapping: a wrapper that translates source-specific XML into

  10. Year 1 deliverables • Send XQuery command to BLAST sources, combine results, and return to user interface • Interact with at least 4 sources • Integration component will have at least 2 sources • QDaC will directly query NCBI and at least one other • Operate QDaC and mediator in a distributed environment • Interface / QDaC at LLNL and mediator at UCSD Have agent stubs at UCSD and LLNL passing text strings within 3 months

  11. Detailed tasks • Interface (LLNL) • Extended to handle blast against new sources • Some of which are not integrated • QDaC (LLNL) • Identify available wrappers from meta-data • This includes the SDSC component • Query wrappers using XQuery • Collect and sort responses • Adopt agent protocol

  12. Detailed tasks • XWrap (GT) • Accept XPath/XQuery input • Handle complex BLAST interfaces • Adopt agent protocol • Mediator (UCSD) • Model of pathways, gene and protein expressions ==> ontology to be used for driving BLAST queries and interpreting their results • Accept XQuery queries • Identify available sources from meta-data • Modify CM wrappers to generate XQuery commands • Agent technology (ORNL, LLNL, UCSD) • Use VIPAR to wrap Medline database • Use protocols to communicate between LLNL and SDSC components

  13. Administrative • Reports • Quarterly reports • to be collected by Terence, (possibly) summarized, and forwarded on to Arie • Short – bulleted form (word file or plain text preferred) • Center-wide communications • Telecon 1st Monday of the month 11:00 – 12:00 PST • It is ok to miss this • Semi-annual meetings • next at ORNL in mid-March • Center web site will point to individual task sites • Shared CVS repository at NC State • Primarily for major releases / sharing code between tasks

  14. Administrative • Advisory committee • Potential names from bioinformatics area • Carole Goble (Univ of Manchester), Tom Slezak (LLNL), ??? • Unclear who pays travel for members • This is for us, so they will not be generating reports

  15. Mail list For our task ONLY sdmctr-integrate@llnl.govis being set up Will be archived Site contacts Terence (LLNL) Bertram (UCSD) Calton (GT) Tom (Agents) Web site Being set up at GT Use main CVS repository for major releases Code sharing option 1 Task-only CVS repository for day-to-day work Unlikely LLNL could host this service Code sharing option 2 Site specific cvs repositories for day-to-day work Alexandria repository for inter-task code sharing https://www-casc.llnl.gov/alexandria/ Disadv: tar-balls Adv: we don’t all need an account on the repository machine Task specific

More Related