Managing Biological Data and Data

Managing Biological Data and Data Sources on the Desktop PC Greg Quinn, Ph.D. SDSC Notebook Project Integrative Biosciences Group San Diego Supercomputer Center Microsoft Scientific Data Intensive Computing Workshop 2004 May 26th 2004

::Notebook Project Overview • Data overload • Re-purposing web data • Interface issues • Overview of the goals of the notebook project

::Notebook Project The Encyclopedia of Life Project • Currently more than 800 genomes completely or partially sequenced • This number is increasingly exponentially • A need to provide data analysis for biological researchers • Data from each genome contains something like 15 - 25k putative protein sequences

::Notebook Project The Encyclopedia of Life Project • For each putative protein sequence derived from genomic data, EOL attempts to locate structural domains and correlate this data with other publicly available sequence annotation • A large amount of information for end-users to collate

::Notebook Project NCBI BLAST updated nightly Protein NR (1.8 million sequences) XML

::Notebook Project Information formatted to be human-readable • This is formatted text, not data • How can we manipulate it, e.g. 100 residues per line • How to store? • How to re-purpose this information? • How to annotate this information?

::Notebook Project Many excellent online data sites

::Notebook Project but… the large amounts of data that can be collected by the end user during a web session can be overwhelming, leading to… DATA OVERLOAD! • No simple way to use data from web sites • No mechanism to share data • No mechanism to locate data sources • Difficult to keep track of data acquired during a web session • Resource-intensive searches are often frequently repeated in entirety, tying up server resources • When data is retrieved, it can be difficult to manipulate and/or search • No simple way to annotate downloaded data

::Notebook Project At Issue • How to locally store web-based data • How to re-purpose information • How to better present data

::Notebook Project • How to locally store web-based data Currently, no meaningful way to store scientific web pages • How to search? • PC has no understanding of data

::Notebook Project 2. How to re-purpose information The Semantic Web The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation. Tim Berners-Lee, James Hendler and Ora Lassila Scientific American,May 17, 2001

::Notebook Project 2. How to re-purpose information HTML <html> <head> <title>Sample Data</title> </head> <body> Protein function: Enolase Source Organism: Human Protein ID#: P000925 Data Source: SwissProt Sequence Length: 435 residues </body> </html> XSLT transformation / CSS XML styling RDF XML data islands <?xml > <protein idVal= “P000925” source= “Human”> <function>Enolase</function> <sourceDB>SwissProt</sourceDB> <length>435</length> </protein> XML

::Notebook Project 2. How to re-purpose information HTML <html> <head> <title>Sample Data</title> </head> <body> Protein function: Enolase Source Organism: Human Protein ID#: P000925 Data Source: SwissProt Sequence Length: 435 residues </body> </html> Parser <?xml > <protein idVal= “P000925” source= “Human”> <function>Enolase</function> <sourceDB>SwissProt</sourceDB> <length>435</length> </protein> XML

::Notebook Project 3. How to better present data Web display paradigm Smart/fat client model Web server SOAP server Data Text/markup Leverages HTML/DHTML, scripting and any plugins Leverages full power of the operating system Smart/fat client Thin client

::Notebook Project The SDSC Notebook A desktop application to better enable the scientific researcher and knowledge worker utilize network information resources and manage data Feature List • Leverages features of Windows and the .Net development paradigm • Powerful local db with search functionality • “Knowledge” of data types • Ability to annotate stored data • Peer-to-peer querying of stored data and annotations • Data export capability to popular formats • Unattended/automatic data updates via background use of web services & HTTP • User notification of new data • Plugin API for data visualization components – c/w basic data viewers for popular Bio-data types, e.g. text, protein sequences, molecules etc. • Smart client framework for SOAP-based, data-intensive, web services • Point-and-click interface to support new breed of Tablet PC’s and ink data types

::Notebook Project Notebook Overview Data source XML doc Local datastore XML doc personal database Data presentation and Smart client for network data services personal database personal database Toolbar to support web data integration

::Notebook Project Prototype design of the Notebook Application Data display area

::Notebook Project Prototype design of the Notebook Application Data browser

::Notebook Project Prototype design of the Notebook Application Smart client availability

::Notebook Project Prototype design of the Notebook Application P2P collaboration group

::Notebook Project Prototype design of the Notebook Application Fast search options

::Notebook Project Interface SOAP server Web server Smart client XML doc XML doc XML XML doc personal database personal database

::Notebook Project Interface Web server Unattended data updates from web HTML scraping <data> <proteinSequence> adddsfsttggeyyygggdd </proteinSequence> </data> XML doc personal database personal database custom scripted data input/output Scripting environment

::Notebook Project Data import from Web browser using toolbar control Toolbar recognizes URL & page signature “Import data” button enabled – user presses button Toolbar activates a specific script from a collection of local PERL scripts PERL script parses HTML page and scrapes data from it, converting it to XML format and importing into the data store personal database

::Notebook Project XAML-based smart client development framework Embedded XAML interpreter Data service broker Service manager Data services Service wrapper OntologyMapper XAML + C# GUIObserver Service wrapper ServerObserver Service wrapper ResourceManager Service wrapper Data viz. controls DNA sequence display Local DB Collaboration subsystem Protein sequence display 3D structure display Graphing component

::Notebook Project Notebook runtime services New system service A windows system service is created to manage notebook application features and functions SDSC Notebook

::Notebook Project Unattended data updates and job retrieval Web server SOAP server Notebook runtime services Scheduler scripts Scripting environment personal database

::Notebook Project Flashing icon signifying availability of new data User notification of new data A flashing system tray notebook icon notifies end-user of new data availability

::Notebook Project Data annotation Data annotation Check whether this protein is also found plant Data annotation personal database

::Notebook Project Data export Excel Word XSLT personal database Analytical programs

::Notebook Project The connected research environment Web Interface SOAP Services SOAP-based method calls to access and update search data Web access to data Report and paper preparation P2P Group collaboration Research annotation

::Notebook Project Important support subprojects within the notebook project Data viz. components Example smart clients SOAP server development A variety of basic data viewers are needed for Notebook developers to utilize Necessary because we cannot rely on 3rd party SOAP service support yet Essential to demonstrate a proof of concept DNA sequence display Protein sequence display 3D structure display Graphing component Scientific datasets PDB smart client Legacy data integration Flatfile, SQL, XPath Crucial to hook ordinary users into using the system, and semantic constructs & SOAP is not yet prevalent for data Scientific tools e.g. EMBOSS TeraGrid resources MSIE/Netscape toolbar e.g. iGAP EOL smart client

::Notebook Project Acknowledgements • Dan Fay & Microsoft Research • Blair Jennings, Software Lead • Mark Miller, project co-manager Support for pen-based input and ink data types http://www.notebookproject.org

Managing Biological Data and Data

Managing Biological Data and Data

Presentation Transcript

Managing Data

Managing Data and Concurrency

Managing Inconsistent Data in Data Integration and Data Exchange

Managing Data

BIOLOGICAL Data Mining

Mining Biological Data

Biological Data Integration

Biological Data - Redwoods

Analyzing Biological Data

Managing data

Managing and Curating Data

Data Integration and Extraction over Molecular Biological Data

Managing Data

Managing Data

Biological Data Mining

Biological Data Mining

Managing Data Resources

Biological Data Mining

Collecting and Managing Data

Clustering Biological Data

Biological Data Mining

Biological Data Mining