1 / 24

Speeding Science Solutions for Data Curation from Microsoft (Research)

Speeding Science Solutions for Data Curation from Microsoft (Research). Lee Dirks D irector, E ducation & Scholarly Communication External Research Division Microsoft Corporation. Microsoft External Research.

millie
Download Presentation

Speeding Science Solutions for Data Curation from Microsoft (Research)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speeding ScienceSolutions for Data Curation from Microsoft (Research) Lee Dirks Director, Education & Scholarly Communication External Research Division Microsoft Corporation

  2. Microsoft External Research • Division within Microsoft Research focused on partnerships between academia, industry and government to advance computer science, education, and research in fields that rely heavily upon advanced computing • Supporting groundbreaking research to help advance human potential and the wellbeing of our planet • Developing advanced technologies and services to support every stage of the research process • Microsoft External Research is committed to interoperability and to providing open access, open tools, and open technology

  3. Mission • Optimize and extend Microsoft software to meet the specific needs of the academic community • Our approach: • Conduct applied projects to enhance academic productivity by evolving Microsoft’s scholarly communication offerings • Microsoft External Research is uniquely positioned to drive this initiative across Microsoft

  4. The Scholarly Communication Lifecycle Excel 2010 Windows Server HPC “Astoria” / “Pop Fly” Collaboration SharePoint LiveMeeting Office Live • Office 2010: • Word • PowerPoint • Excel • OneNote • Tablet PC/UMPC Office OpenXML XPS Format SQL Server & Entity Framework Rights Management Data Protection Manager Discoverability FAST MSR Academic Search “Bookweb” SharePoint 2010 Word 2010 + PowerPoint 2010 WPF & Silverlight “Sea Dragon” / “PhotoSynth” / “Deep Zoom”

  5. Goal: Transform Scholarly Communication • Interoperability is essential • Actively lobby and drive for consensus around technical standards and standardized protocols proactively adopted by the community; enable broad community engagement • Customers have told Microsoft that interoperability is OUR responsibility • Leverage Existing Community Protocols, Practices, Guidelines, etc. • Example – metadata conventions / taxonomies / ontologies: a traditional strength for libraries – and a critical component in enabling Web 2.0 • Optimize for data-driven research • To both data (scientific) and to information (scholarly publications) • Reproducible research + computational science • Properly document / annotate scholarly output • Data preservation (and provenance) should be baseline • Documentation of the data’s provenance • Preservation needs to be like “accessibility” features – i.e., assumed as required • Semantic knowledge discovery & social networking • Harnessing collective intelligence must be a consideration – since accessing research is a core step in the life-cycle. Enable knowledge discovery • Optimize for Web 2.0 scenarios and allow end-users/experts to find things easier

  6. Open Access Open Source Open Data Open Science “In order to help catalyze and facilitate the growth of advanced CI, a critical component is the adoption of open access policy for data, publications and software.” NSF Advisory Committee on Cyberinfrastructure (ACCI) • Microsoft Interoperability Principles • Open Connections to Microsoft Products • Support for Standards • Data Portability • Open Engagement http://www.microsoft.com/interop/

  7. Membership / Participation DataCite is an international consortium to establish easier access to scientific research data on the Internet increase acceptance of research data as legitimate, citable contributions to the scientific record, and to support data archiving that will permit results to be verified and re-purposed for future study. The Open Planets Foundation has been established to provide practical solutions and expertise in digital preservation, building on the €15 million investment made by the European Union and Planets consortium. OPF members benefit from the Planets results, new developments and the growing OPF community that includes experts at some of the most prestigious research, technology and memory institutions in Europe. The Confederation of Open Access Repositories (COAR) is a not-for-profit association of repository initiatives launched in October 2009. It aims to enhance greater visibility and application of research outputs through global networks of Open Access digital repositories. The Coalition for Networked Information (CNI) is an organization dedicated to supporting the transformative promise of networked information technology for the advancement of scholarly communication and the enrichment of intellectual productivity. Membership includes some 200 institutions representing higher education, publishing, network and telecommunications, information technology, and libraries and library organizations. ICSTI, the International Council for Scientific and Technical Information, offers a unique forum for interaction between organizations that create, disseminate and use scientific and technical information. ICSTI's mission cuts across scientific and technical disciplines, as well as international borders, to give member organizations the benefit of a truly global community. CrossRef is a not-for-profit membership association whose mission is to enable easy identification and use of trustworthy electronic content by promoting the cooperative development and application of a sustainable infrastructure. CrossRef's general purpose is to promote the development and cooperative use of new and innovative technologies to speed and facilitate scholarly research.

  8. Who we work with

  9. GenePattern Reproducible Research Add-in Services: Connects to GenePattern database Relationships: Inline graphics are synchronized to dataset Data: Control and execute query pipelines into GenePattern Data: Resulting data (and provenance) stored within Word document Source code and binary: http://GenepatternWordAddin.codeplex.com

  10. Creative Commons Add-in for Office 2007 Intent: Insert Creative Commons licenses from within Office 2007 Services: Integrates with Creative Commons Web API to create new licenses Relationships: license information stored as RDF XML within the document OOXML Source code and binary: http://ccaddin2007.codeplex.com

  11. Ontology Add-in for Word 2007 Services: Ontology download web service • John Wilbanks • Phil Bourne • Lynn Fink Intent: Term recognition & disambiguation Relationships: Ontology browser Source code and binary: http://research.microsoft.com/ontology/

  12. Article Authoring Add-in for Word 2007 Services: repository deposit via SWORD Structure: Read, convert, and author NLM XML documents Relationships: ORE Resource Map creation Relationships: Citation lookup and reference management Structure: Client-side XML validation Binary (version 2.0): http://research.microsoft.com/authoring/ This work is licensed under a Creative Commons Attribution 3.0 United States License.

  13. Chem4Word - Chemistry Drawing in Word Author/edit 1D and 2D chemistry. Change chemical layout styles. • Peter Murray-Rust • Joe Townsend • Jim Downing Intent: Recognizes chemical dictionary and ontology terms Relationships: Navigate and link referenced chemistry Data: Semantics stored in Chemistry Markup Language <?xmlversion="1.0" ?> <cmlversion="3" convention="org-synth-report" xmlns="http://www.xml-cml.org/schema"> <moleculeid="m1"> <atomArray> <atomid="a1" elementType="C" x2="-2.9149999618530273" y2="0.7699999809265137" /> <atomid="a2" elementType="C" x2="-1.5813208400249916" y2="1.5399999809265137" /> <atomid="a3" elementType="O" x2="-0.24764171819695613" y2="0.7699999809265134" /> <atomid="a4" elementType="O" x2="-1.5813208400249912" y2="3.0799999809265137" /> <atomid="a5" elementType="H" x2="-4.248679083681063" y2="1.5399999809265137" /> <atomid="a6" elementType="H" x2="-2.914999961853028" y2="-0.7700000190734864" /> <atomid="a7" elementType="H" x2="-4.248679083681063" y2="-1.907348645691087E-8" /> <atomid="a8" elementType="H" x2="1.0860374036310796" y2="1.5399999809265132" /> </atomArray> <bondArray> <bondatomRefs2="a1 a2" order="1" /> <bondatomRefs2="a2 a3" order="1" /> <bondatomRefs2="a2 a4" order="2" /> <bondatomRefs2="a1 a5" order="1" /> <bondatomRefs2="a1 a6" order="1" /> <bondatomRefs2="a1 a7" order="1" /> <bondatomRefs2="a3 a8" order="1" /> </bondArray> </molecule> </cml> Intelligence: Verifies validity of authored chemistry Available soon: http://research.microsoft.com/chem4word/

  14. Project Trident: Scientific Workflow Workbench Author, Execute and Monitor Workflows View data products, performance metrics, and provenance data Compose and modify workflows via drag & drop canvas Organize collection of individual workflow activities Available now: http://research.microsoft.com/collaboration/tools/trident.aspx

  15. Other relevant projects

  16. The Windows Azure platformoffers a flexible, familiar environment for developers to create cloud applications and services. With Windows Azure, you can shorten your time to market and adapt as demand for your service grows. Windows Azure offers a platform that is easily implemented alongside your current environment. • Offerings: • Windows Azure: operating system as an online service • Microsoft SQL Azure: fully relational cloud database solution • Windows Azure platform AppFabric: connects cloud services and on-premises applications • Microsoft Codename “Dallas”: information marketplace for data and web services

  17. Azure – Project “Dallas” • Microsoft "Dallas" is a service allowing developers and information workers to easily discover, purchase, and manage premium data subscriptions in the Windows Azure platform. • Dallas is an information marketplace that brings data, imagery, and real-time web services from leading commercial data providers and authoritative public data sources together into a single location, under a unified provisioning and billing framework. • Dallas APIs allow developers and information workers to consume this premium content with virtually any platform, application or business workflow. • More: http://www.microsoft.com/windowsazure/dallas/

  18. Excel Services & Excel Web Access • Excel Calculation Services (ECS) is the "engine" of Excel Services that loads the workbook, calculates in full fidelity with Microsoft Office Excel 2007, refreshes external data, and maintains sessions. • Excel Web Access (EWA) is a Web Part that displays and enables interaction with the Microsoft Office Excel workbook in a browser by using Dynamic Hierarchical Tag Markup Language (DHTML) and JavaScript without the need for downloading ActiveX controls on your client computer, and can be connected to other Web Parts on dashboards and other Web Part Pages. • Excel Web Services (EWS) is a Web service hosted in Microsoft Office SharePoint Services that provides several methods that a developer can use as an application programming interface (API) to build custom applications based on the Excel workbook. • More: http://msdn.microsoft.com/en-us/library/ms546696.aspx

  19. Microsoft’s “OData” Initiative • What is it? • The Open Data Protocol (OData) is a Web protocol for querying and updating data that provides a way to unlock your data and free it from silos that exist in applications today. OData does this by applying and building upon Web technologies such as HTTP, Atom Publishing Protocol (AtomPub) and JSON to provide access to information from a variety of applications, services, and stores. The protocol emerged from experiences implementing AtomPub clients and servers in a variety of products over the past several years.  • OData is being used to expose and access information from a variety of sources including, but not limited to, relational databases, file systems, content management systems and traditional Web sites. • OData is consistent with the way the Web works - it makes a deep commitment to URIs for resource identification and commits to an HTTP-based, uniform interface for interacting with those resources (just like the Web).   This commitment to core Web principles allows OData to enable a new level of data integration and interoperability across a broad range of clients, servers, services, and tools. • OData is released under the Open Specification Promise to allow anyone to freely interoperate with OData implementations. • Find out more • http://odata.org & http://msdn.com/data • Contact Pablo Castro (pablo.castro@microsoft.com) / Blog: http://blogs.msdn.com/pablo

  20. Microsoft’s Open Government Data Initiative • The Open Government Data Initiative (OGDI) is a cloud-based collection of software assets that enables publicly available government data to be easily accessible. Using open standards and application programming interfaces (API), developers and government agencies can retrieve the data programmatically for use in new and innovative online applications, or mash-ups that can help: • Improve citizen services • Enhance collaboration between government agencies and private organizations • Increase government transparency • OGDI promotes the use of this data by capturing and publishing re-usable software assets, patterns, and practices. The data repository already holds over 60 different government datasets that are readily available for use in new applications, and is continuously updated with additional government datasets. • More: http://www.microsoft.com/industry/government/opengovdata/

  21. Data Curation Add-in for Microsoft Excel PROPOSED • In partnership with the California Digital Library’s Curation Center • In collaboration with Tricia Cruse & John Kunze • Part of the DataONE (an NSF DataNet Project)

  22. Data Curation Add-in for Microsoft Excel PROPOSED Proposed functionality under consideration: • Support for versioning, so that revision history and the original raw data can be easily protected and recovered, • Standardized date/time stamps so that researchers can easily determine when the data were created and last updated. • A “workbook builder” allowing researchers to select from globally shared standardized layouts for capturing data, • Ability to export metadata in a standard format (e.g., a DataCite citation or an EML document that describes the dataset(s) in a workbook) so that researchers can readily share their data, • Ability to select from a globally shared vocabulary of terms for data descriptions (e.g., column names), and as needed to add new terms to the globally shared vocabulary, to enable wide collaboration between researchers • Ability to import term descriptions from the shared vocabulary and annotate them locally to refine their definitions as used in the dataset, • “Speed bumps” to discourage use of macros and customizations that would impede interoperation of data imported from Excel into other applications, and • Ability to deposit data and metadata directly into a data archive to enable compliance with funding agency requirements to preserve and publish research data.

  23. Questions? Lee Dirks Director—Education & Scholarly Communication Microsoft External Research ldirks@microsoft.com http://research.microsoft.com/people/ldirks URL – http://www.microsoft.com/scholarlycomm/ Facebook: Scholarly Communication at Microsoft

More Related