1 / 33

Managing Biological Data and Data

Managing Biological Data and Data. Sources on the Desktop PC. Greg Quinn, Ph.D. SDSC Notebook Project. Integrative Biosciences Group. San Diego Supercomputer Center. Microsoft Scientific Data Intensive Computing Workshop 2004. May 26 th 2004. ::Notebook Project. Overview. Data overload

dillian
Download Presentation

Managing Biological Data and Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Managing Biological Data and Data Sources on the Desktop PC Greg Quinn, Ph.D. SDSC Notebook Project Integrative Biosciences Group San Diego Supercomputer Center Microsoft Scientific Data Intensive Computing Workshop 2004 May 26th 2004

  2. ::Notebook Project Overview • Data overload • Re-purposing web data • Interface issues • Overview of the goals of the notebook project

  3. ::Notebook Project The Encyclopedia of Life Project • Currently more than 800 genomes completely or partially sequenced • This number is increasingly exponentially • A need to provide data analysis for biological researchers • Data from each genome contains something like 15 - 25k putative protein sequences

  4. ::Notebook Project The Encyclopedia of Life Project • For each putative protein sequence derived from genomic data, EOL attempts to locate structural domains and correlate this data with other publicly available sequence annotation • A large amount of information for end-users to collate

  5. ::Notebook Project NCBI BLAST updated nightly Protein NR (1.8 million sequences) XML

  6. ::Notebook Project Information formatted to be human-readable • This is formatted text, not data • How can we manipulate it, e.g. 100 residues per line • How to store? • How to re-purpose this information? • How to annotate this information?

  7. ::Notebook Project Many excellent online data sites

  8. ::Notebook Project but… the large amounts of data that can be collected by the end user during a web session can be overwhelming, leading to… DATA OVERLOAD! • No simple way to use data from web sites • No mechanism to share data • No mechanism to locate data sources • Difficult to keep track of data acquired during a web session • Resource-intensive searches are often frequently repeated in entirety, tying up server resources • When data is retrieved, it can be difficult to manipulate and/or search • No simple way to annotate downloaded data

  9. ::Notebook Project At Issue • How to locally store web-based data • How to re-purpose information • How to better present data

  10. ::Notebook Project • How to locally store web-based data Currently, no meaningful way to store scientific web pages • How to search? • PC has no understanding of data

  11. ::Notebook Project 2. How to re-purpose information The Semantic Web The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation. Tim Berners-Lee, James Hendler and Ora Lassila Scientific American,May 17, 2001

  12. ::Notebook Project 2. How to re-purpose information HTML <html> <head> <title>Sample Data</title> </head> <body> <b>Protein function:</b> Enolase<br> <b>Source Organism:</b> Human<br> <b>Protein ID#:</b> P000925<br> <b>Data Source:</b> SwissProt<br> <b>Sequence Length:</b> 435 residues </body> </html> XSLT transformation / CSS XML styling RDF XML data islands <?xml > <protein idVal= “P000925” source= “Human”> <function>Enolase</function> <sourceDB>SwissProt</sourceDB> <length>435</length> </protein> XML

  13. ::Notebook Project 2. How to re-purpose information HTML <html> <head> <title>Sample Data</title> </head> <body> <b>Protein function:</b> Enolase<br> <b>Source Organism:</b> Human<br> <b>Protein ID#:</b> P000925<br> <b>Data Source:</b> SwissProt<br> <b>Sequence Length:</b> 435 residues </body> </html> Parser <?xml > <protein idVal= “P000925” source= “Human”> <function>Enolase</function> <sourceDB>SwissProt</sourceDB> <length>435</length> </protein> XML

  14. ::Notebook Project 3. How to better present data Web display paradigm Smart/fat client model Web server SOAP server Data Text/markup Leverages HTML/DHTML, scripting and any plugins Leverages full power of the operating system Smart/fat client Thin client

  15. ::Notebook Project The SDSC Notebook A desktop application to better enable the scientific researcher and knowledge worker utilize network information resources and manage data Feature List • Leverages features of Windows and the .Net development paradigm • Powerful local db with search functionality • “Knowledge” of data types • Ability to annotate stored data • Peer-to-peer querying of stored data and annotations • Data export capability to popular formats • Unattended/automatic data updates via background use of web services & HTTP • User notification of new data • Plugin API for data visualization components – c/w basic data viewers for popular Bio-data types, e.g. text, protein sequences, molecules etc. • Smart client framework for SOAP-based, data-intensive, web services • Point-and-click interface to support new breed of Tablet PC’s and ink data types

  16. ::Notebook Project Notebook Overview Data source XML doc Local datastore XML doc personal database Data presentation and Smart client for network data services personal database personal database Toolbar to support web data integration

  17. ::Notebook Project Prototype design of the Notebook Application Data display area

  18. ::Notebook Project Prototype design of the Notebook Application Data browser

  19. ::Notebook Project Prototype design of the Notebook Application Smart client availability

  20. ::Notebook Project Prototype design of the Notebook Application P2P collaboration group

  21. ::Notebook Project Prototype design of the Notebook Application Fast search options

  22. ::Notebook Project Interface SOAP server Web server Smart client XML doc XML doc XML XML doc personal database personal database

  23. ::Notebook Project Interface Web server Unattended data updates from web HTML scraping <data> <proteinSequence> adddsfsttggeyyygggdd </proteinSequence> </data> XML doc personal database personal database custom scripted data input/output Scripting environment

  24. ::Notebook Project Data import from Web browser using toolbar control Toolbar recognizes URL & page signature “Import data” button enabled – user presses button Toolbar activates a specific script from a collection of local PERL scripts PERL script parses HTML page and scrapes data from it, converting it to XML format and importing into the data store personal database

  25. ::Notebook Project XAML-based smart client development framework Embedded XAML interpreter Data service broker Service manager Data services Service wrapper OntologyMapper XAML + C# GUIObserver Service wrapper ServerObserver Service wrapper ResourceManager Service wrapper Data viz. controls DNA sequence display Local DB Collaboration subsystem Protein sequence display 3D structure display Graphing component

  26. ::Notebook Project Notebook runtime services New system service A windows system service is created to manage notebook application features and functions SDSC Notebook

  27. ::Notebook Project Unattended data updates and job retrieval Web server SOAP server Notebook runtime services Scheduler scripts Scripting environment personal database

  28. ::Notebook Project Flashing icon signifying availability of new data User notification of new data A flashing system tray notebook icon notifies end-user of new data availability

  29. ::Notebook Project Data annotation Data annotation Check whether this protein is also found plant Data annotation personal database

  30. ::Notebook Project Data export Excel Word XSLT personal database Analytical programs

  31. ::Notebook Project The connected research environment Web Interface SOAP Services SOAP-based method calls to access and update search data Web access to data Report and paper preparation P2P Group collaboration Research annotation

  32. ::Notebook Project Important support subprojects within the notebook project Data viz. components Example smart clients SOAP server development A variety of basic data viewers are needed for Notebook developers to utilize Necessary because we cannot rely on 3rd party SOAP service support yet Essential to demonstrate a proof of concept DNA sequence display Protein sequence display 3D structure display Graphing component Scientific datasets PDB smart client Legacy data integration Flatfile, SQL, XPath Crucial to hook ordinary users into using the system, and semantic constructs & SOAP is not yet prevalent for data Scientific tools e.g. EMBOSS TeraGrid resources MSIE/Netscape toolbar e.g. iGAP EOL smart client

  33. ::Notebook Project Acknowledgements • Dan Fay & Microsoft Research • Blair Jennings, Software Lead • Mark Miller, project co-manager Support for pen-based input and ink data types http://www.notebookproject.org

More Related