slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Building Infrastructure for Data Management 25 April 2014 Larry Lannom Corporation for National Research Initiatives htt PowerPoint Presentation
Download Presentation
Building Infrastructure for Data Management 25 April 2014 Larry Lannom Corporation for National Research Initiatives htt

Loading in 2 Seconds...

play fullscreen
1 / 47

Building Infrastructure for Data Management 25 April 2014 Larry Lannom Corporation for National Research Initiatives htt - PowerPoint PPT Presentation


  • 113 Views
  • Uploaded on

Building Infrastructure for Data Management 25 April 2014 Larry Lannom Corporation for National Research Initiatives http://www.cnri.reston.va.us/. Corporation for National Research Initiatives. Three Part Talk. Organizing for Infrastructure: RDA

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Building Infrastructure for Data Management 25 April 2014 Larry Lannom Corporation for National Research Initiatives htt


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Building Infrastructure for Data Management

25 April 2014

Larry LannomCorporation for National Research Initiativeshttp://www.cnri.reston.va.us/

Corporation for National Research Initiatives

three part talk
Three Part Talk
  • Organizing for Infrastructure: RDA
  • Building Infrastructure: Data Type Registries
  • Using Infrastructure: Deep Carbon Observatory,Handles/DOIs

Corporation for National Research Initiatives

the information age extraordinary p otential for driving science and bettering society
The Information Age – Extraordinary Potential for Driving Science and Bettering Society

More Efficient PhysicalInfrastructure

Contribution to a safer and more secure world

Transformative strategies for disease treatment and well-being

More goods and services

More Research Insights

data sharing is a global issue
Data Sharing is a Global Issue

Libraries, Archives, Repositories, Museums

Science, Humanities, Arts Communities

Cyberinfrastructure professionals, data analysts, data center staff, …

Data Scientists

slide6

Key Driver 2: Community Effort Accelerating Impact

Development of public access shared data collection enabling new resultsfor Alzheimer’s

Creation / adoption of data sharing policieshave accelerated research innovation

Development and adoption ofshared parallel communication protocolsthrough the MPI Forum drove a generation of advances

Now 25 years old, the Internet Engineering Task Force’s mission “to make the Internet work better” has resulted in key specifications of Internet common community standards that support innovation

MPI Forum photo by ErezHeba, PDB molecule of the month at http://www.rcsb.org/pdb/home/home.do

“Just do it”-- Focused efforts help communities drive tangible progress

slide7

Enabling Technologies

ID

ID

ID

ID

ID

ID

010001010

010011011

010101001

101010000

010001010

010011011

010101001

101010000

ID

010001010

010011011

010101001

101010000

ID

ID

ID

ID

ID

Scientists, Data Curators,

End Users, Applications

Datasets

slide8

Enabling Technologies

ID

ID

ID

ID

ID

0100

0101..

0100

0101..

0100

0101..

0100

0101..

0100

0101..

ID

ID

ID

ID

ID

ID

ID

ID

ID

ID

ID

ID

ID

Scientists, Data Curators,

End Users, Applications

Datasets

Accessed via Repositories

slide9

Enabling Technologies

Enabling

Technologies

ID

ID

ID

ID

ID

0100

0101..

0100

0101..

0100

0101..

0100

0101..

0100

0101..

Discovery

ID

ID

ID

ID

ID

ID

ID

ID

ID

ID

ID

ID

ID

Scientists, Data Curators,

End Users, Applications

Datasets

Accessed via Repositories

discovery evaluation
Discovery & Evaluation
  • Search
    • Metadata registries
      • Subject
      • Parties
      • Dates
      • Etc
    • Crawlers – more ad hoc
  • Citation
    • Formats
  • Permissions
    • Can I see it?
    • Can I use it?
  • Trust

Corporation for National Research Initiatives

slide11

Enabling Technologies

Enabling

Technologies

ID

ID

ID

ID

ID

0100

0101..

0100

0101..

0100

0101..

0100

0101..

0100

0101..

Discovery

ID

ID

ID

ID

ID

ID

ID

ID

Access

ID

ID

ID

ID

ID

Scientists, Data Curators,

End Users, Applications

Datasets

Accessed via Repositories

access
Access
  • ID / reference resolution
  • Access Protocols
    • How to get it
    • Protocol registries
    • Bootstrapping into new protocols
  • Authentication & Authorization
    • Proof of identity (tradeoff: usability vs security)
    • Permissions: with the object or in some external system?

Corporation for National Research Initiatives

slide13

Enabling Technologies

Enabling

Technologies

ID

ID

ID

ID

ID

0100

0101..

0100

0101..

0100

0101..

0100

0101..

0100

0101..

Discovery

ID

ID

ID

ID

ID

ID

ID

ID

Access

ID

ID

ID

ID

ID

Scientists, Data Curators,

End Users, Applications

Interpretation

Datasets

Accessed via Repositories

interpretation
Interpretation
  • Registries
    • Schemas
    • Vocabularies
    • Formats
    • Available services
    • Useful client-side tools
  • Trust
    • Who did this?
    • Who owns this?
  • Provenance
    • Data Source
    • Processing steps
    • Computing environment
      • what is needed to trust the numbers?
      • Domain specific?

Corporation for National Research Initiatives

slide15

Enabling Technologies

Enabling

Technologies

ID

ID

ID

ID

ID

0100

0101..

0100

0101..

0100

0101..

0100

0101..

0100

0101..

Discovery

ID

ID

ID

ID

ID

ID

ID

ID

Access

ID

ID

ID

ID

ID

Scientists, Data Curators,

End Users, Applications

Interpretation

Datasets

Accessed via Repositories

Reuse

reuse
Reuse
  • Everything from Interpretation slide + Permissions
    • Example: I need to understand a data set for peer review but that doesn’t give me permission to use the data
  • Validation
  • Education & Training
    • Integrate ‘live’ data into education and training
  • Repurpose data

Corporation for National Research Initiatives

the research data alliance rda
The Research Data Alliance (RDA)
  • Global community-driven organization launched in March 2013 to accelerate data-driven innovation
  • RDA focus is on building the social, organizational and technical infrastructure to
    • reduce barriers to data sharing and exchange
    • accelerate the development of coordinated global data infrastructure

RESEARCHDATA ALLIANCE

rda vision and mission
RDA Vision and Mission
  • Research Data Alliance Mission:RDA builds the social and technical bridges that enable data sharing.
  • Research Data Alliance Vision:Researchers and innovators openly share data across technologies, disciplines, and countries to address the grand challenges of society.
slide19
Goal of RDA Infrastructure: Support Data Sharing and Interoperability Across Cultures, Scales, Technologies
  • Common data types for data Interoperability
  • Persistent identifiers
  • Domain-focused portals
  • Harmonized standards
  • Data access and preservation policy and practice
  • Tools for data discoverability, …

Harmonized standards

Policy and Practice

create adopt use
CREATE  ADOPT  USE

RDA Members come together as

  • Working Groups – 12-18 month efforts to build, adopt, and use specific pieces of infrastructure
  • Interest Groups – longer-lived discussion forums that spawn Working Groups as specific pieces of needed infrastructure are identified.
  • Working Group efforts focus on the development and use of data sharing infrastructure
  • Code, policy, infrastructure, standards, or best practices that are adopted and used by communities to enable data sharing
  • “Harvestable” efforts for which 12-18 months of work can eliminate a roadblock
  • Efforts that have substantive applicability to groups within the data community, but may not apply to everyone
  • Efforts for which working scientists and researchers can start today
rda plenaries venue for community building and wg ig progress
RDA Plenary 1 / Launch

March 2013 in Gothenburg, Sweden

240 participants

3 WG, 9 IG

RDA Plenary 2

September 2013 in Washington, DC

380 participants

6 WG, 17 IG, 5 BOF

RDA Plenary 3

March 2014 in Dublin, Ireland

497 participants

12 WG, 22 IG, 14 BOF

6 co-located events

RDA Plenary 4

Sept 2014 in Amsterdam

RDA Plenaries: Venue for community building and WG / IG progress

Plenary 1

Plenary 2

Plenary 3

Fran Berman

rda plenaries emerging as a data community town square
RDA Plenaries Emerging as a Data Community “Town Square”

Emerging Plenary Format:

  • All-hands sessions: Place for community networking and exchange of information (funding agencies, data organizations, key stakeholders)
  • Working sessions: Face-to-face opportunities for global Interest Groups, Working Groups, and BOFs to meet and advance their agendas
  • Neutral meeting place: Place for multiple groups to meet and form a common agenda and action plan (e.g. Plenary 2 Data Citation Harmonization Summit)
precipitous growth
Precipitous Growth

First Org.Assembly

6 co-located events

14 BOF, 12 Working Groups, 22 Interest Groups

497 participants

First “neutral space” community meeting (Data Citation Summit)

First Org. Partner Meet-up

First BOFs

380 participants from 22 countries

First Working Groups and Interest Groups

240 participants

Amsterdam

RDA Launch / First Plenary

March 2013

RDA Second

Plenary

September 2013

RDA Third

Plenary

March 2014

RDA Fourth

Plenary

September 2014

rda community evolving rapidly over 1500 members from 70 countries as of 3 15 14
RDA Community Evolving Rapidly:Over 1500 members from 70+ countries (as of 3/15/14)

Africa

2%

SouthAmerica

1%

Map courtesy traveltip.org

Asia

4%

Austral-pacific

4%

rda interest ig and working groups wg effectively doubling each plenary groups as of 1 14
RDA Interest (IG) and Working Groups (WG) effectively doubling each Plenary (Groups as of 1/14)

Community Needs - focused

  • Community Capability Model IG
  • Engagement IG
  • Clouds in Developing Countries IG

Domain Science - focused

  • Toxicogenomics Interoperability IG
  • Structural Biology IG
  • Biodiversity Data Integration IG
  • Agricultural Data Interoperability IG
  • Digital History and Ethnography IG
  • Defining Urban Data Exchange for Science IG
  • Marine Data Harmonization IG
  • Materials Data Management IG

Reference and Sharing - focused

  • Data Citation IG
  • Data Categories and Codes WG
  • Legal Interoperability IG

Data Stewardship - focused

  • Research Data Provenance IG
  • Certification of Digital Repositories IG
  • Preservation e-infrastructure
  • Long-tail of Research Data IG
  • Publishing Data IG
  • Domain Repositories IG
  • Global Registry of Trusted Data Repositories and Services IG

Base Infrastructure - focused

  • Data Foundations and Terminology WG
  • Metadata Standards WG
  • Practical Policy WG
  • PID Information Types WG
  • Data Type Registries WG
  • Metadata IG
  • Big Data Analytics IG
  • Data Brokering IG
rda organizational frameworknearly at steady state
RDA Organizational Frameworknearly at Steady State

RDA Council Responsible for overarching mission, vision, impact of RDA

Technical Advisory Board

Responsible for Technical roadmap and interactions

Secretary-General and Secretariat

Responsible for administration and operations

Organizational Advisory Board and Organizational Assembly

Responsible for organizational and strategic advice

RDA Membership

Working GroupsResponsible for impactful, outcome-oriented efforts

Interest GroupsResponsible for defining and refining common issues

RDA Colloquium (Research Funders)Operational and community sponsorship

coming in fall first rda infrastructure deliverables
Coming in Fall: First RDA Infrastructure Deliverables

Scheduled to Complete Summer 2014

Data Type Registries WG

  • Deliverables: System of data type registries, formal model for describing types, working model of a registry.
  • Initial Adopters and Users: CNRI, International DOI Foundation, Deep Carbon Observatory

Practical Code Policies

  • Deliverables: Survey of policies in production use, testbed of machine actionable policies, deployment of 5 policy sets, policy starter kits
  • Initial Adopters and Users: RENCI, DataNet Federation Consortium, CESNET, Odum Institute

Persistent Identifier Information Types

  • Deliverables: Minimal set of PID types, API
  • Initial Adopters and Users: Data Conservancy, DKRZ

Scheduled to Complete Fall 2014

Language Codes

  • Deliverables: Operationalization of ISO language categories for repositories.
  • Initial Adopters and Users: Language Archive, Paradisec

Data Foundations and Terminology

  • Deliverables: Common vocabulary for data terms, formal definitions and open registry for data terms
  • Initial Adopters and Users: EUDAT, DKRZ, Deep Carbon Observatory, CLARIN, EPOS

Metadata Standards

  • Deliverables: Use cases and prototype director of current metadata standards starting from DCC directory
  • Initial Adopters and Users: JISC, DataOne
rda medium term 3 5 year goals
RDA Medium Term (3-5 year) Goals
  • Create a pipeline of data sharing infrastructure efforts
    • that are adopted and used by communities during their development
    • that increase their impact through greater adoption over time
  • Build and expand the research data community for effective impact
    • globally, regionally, and within constituent groups
  • Evolve as a useful, relevant, and agile organization
    • that helps the community capitalize on opportunity and respond to challenges within the data community
rda as an accelerant of existing projects
RDA as an Accelerant of Existing Projects
  • This is already the case
  • RDA is helping expand the impact of at least two Sloan-funded projects.
    • CNRI Interoperability Platform
      • LEI Prototype
      • Type Registry
    • Deep Carbon Observatory (DCO)
      • Data science infrastructure (RPI)
    • DCO now working with CNRI in the context of the RDA Data Type Registries Working Group

Corporation for National Research Initiatives

slide30

What are Data Types?

  • Characterize data structures at multiple levels of granularity
    • Serve as macro or shortcut for understanding and processing data
  • File formats & mime types are examples of solved problems at the container level but don’t solve finer grained interpretation
    • It’s a number in cell A3 but what does it mean
  • Other structures with more limited use, e.g., many sci. data sets, may need multiple levels of typing
  • Data types enable humans and machines to discover, process, and reason about data

Corporation for National Research Initiatives

data type registries
Data Type Registries
  • Each type registered with unique identifier
  • Common data model and expression
  • Associate with services, tools, format registries, etc.
  • Common API for machine consumption

Corporation for National Research Initiatives

slide32

RDA Data Type Registries WG

  • Goal: Interoperable set of Type Registries
  • Approved as RDA WG at Plenary 1
  • Co-chairs
    • Larry Lannom – CNRI
    • Daan Broeder - Max Planck Institute for Psycholinguistics
  • Membership
    • 44 participants
    • U.S., UK, Netherlands, Germany, Italy, Australia, Finland, Canada, Kenya, Japan
    • Various scientific fields, Practitioners, Librarians, Publishers
  • Schedule
    • 3/2013 – 9/2013: gather use cases, begin design, including data model
    • 10/2013 – 12/2013: refine model, begin prototyping
    • 1/2014 – 5/2014: finalize data model & functional specs, deploy functional registry for Handle types, release turnkey registry

Corporation for National Research Initiatives

slide33

DTR Use Cases

  • Broad Functional Classification
    • Repos hold widely varying levels of data & metadata
    • High-level functional classification of the identified object needed to make sense of what is available, e.g., data object, metadata, repo description, contact info, etc.
  • Simple License Information via PID Resolution
    • Data set access conditions cannot be predicted based on ID
    • For DataCite DOIs, a handle/type/value triple could be used to provide access information, probably through a level of indirection, resulting in a pop-up or intervening page or open linked data
  • Object Types as a Short-cut for Dependent Services to Match Processing Requirements to Data Objects
    • Using data acquisition as an example
      • Determine object type you are trying to build
      • Consult registry to index into an ontology to dynamically define required and optional properties
      • Does the input data have what is needed?
  • Registration of PID Types (in ID/Type/Value triples) for Data Processing and Interpretation
    • Distinguish pointers to objects from pointers to metadata from pointers to services
    • Enable complex client interactions as opposed to simple one-to-one re-direction

Corporation for National Research Initiatives

slide34

Discovery Use Case

ID

ID

Type

ID

ID

Type

Users

ID

Payload

Type

Type

ID

Payload

Type

Payload

Payload

Type

Payload

Payload

2

2

3

4

1

1

3

4

Repositories and Metadata Registries

Federated Set of Type Registries

Clients (process or people) look for types that match their criteria for data. For example, clients may look for types that match certain criteria, e.g., combine location, temperature, and date-time stamp.

Type Registry returns matching types.

Clients look up in repositories and metadata registries for data sets matching those types.

Appropriate typed data is returned.

slide35

Process Use Case

ID

ID

Type

ID

ID

Type

Users

ID

Payload

Type

Type

Federated Set of Type Registries

ID

Payload

Type

Payload

Payload

Type

Payload

Payload

4

3

2

3

4

2

1

1

4

Typed Data

Terms:…

I Agree

10100

11010

101….

Visualization

Rights

Data Set

Dissemination

Data Processing

Services

Client (process or people) encounters unknown type.

Resolved to Type Registry.

Response includes type definitions, relationships, properties, and possibly service pointers. Response can be

used locally for processing, or, optionally

Typed data or reference to typed data can be sent to service provider.

slide36

Deep Carbon Observatory Data Science and Data Management Infrastructure Overview

slide37

Global research program to transform our understanding of carbon in Earth

  • Community of scientists --- biologists, physicists, geoscientists, chemists, and many others --- whose work crosses these disciplinary lines, forging a new, integrative field of deep carbon science
  • 10-year initiative to intensify global attention and scientific effort in the burgeoning field of deep carbon science
  • DCO infrastructure includes: public engagement and education, online and offline community support, innovative data management, and novel instrumentation

deepcarbon.net

slide38

Alfred P. Sloan Foundation pledged $50 million over the duration to fund: infrastructure development, scientific workshops, novel technology development, and preliminary research and fieldwork.

  • “Seed funding” awarded to catalyze collaborative scientific efforts around the world, increase public and private sector spending in deep carbon science, and leave a thriving community of international scientists as its legacy.
  • DCO will synthesize 10 years of scientific research to generate unique and unprecedented views of Earth, looking at both scientific and human societal issues through a new, sharper lens.

deepcarbon.net

slide39

DCO-Data Science World View: Everything is a first-class (science) object

deepcarbon.net

slide41

DTR Use Cases

  • Broad Functional Classification
    • Repos hold widely varying levels of data & metadata
    • High-level functional classification of the identified object needed to make sense of what is available, e.g., data object, metadata, repo description, contact info, etc.
  • Simple License Information via PID Resolution
    • Data set access conditions cannot be predicted based on ID
    • For DataCite DOIs, a handle/type/value triple could be used to provide access information, probably through a level of indirection, resulting in a pop-up or intervening page or open linked data
  • Object Types as a Short-cut for Dependent Services to Match Processing Requirements to Data Objects
    • Using data acquisition as an example
      • Determine object type you are trying to build
      • Consult registry to index into an ontology to dynamically define required and optional properties
      • Does the input data have what is needed?
  • Registration of PID Types (in ID/Type/Value triples) for Data Processing and Interpretation
    • Distinguish pointers to objects from pointers to metadata from pointers to services
    • Enable complex client interactions as opposed to simple one-to-one re-direction

Corporation for National Research Initiatives

rda brings together dco dtr
RDA Brings Together DCO & DTR
  • Benefits to DTR
    • DCO brought the data acquisition use case – no one else thought of it
    • DCO as early adopter will benefit testing and use of RDA result
  • Benefits to DCO
    • Needed facility specified and prototyped with DCO use case in mind
    • Turn-key DTR will be available to DCO
    • DCO data science approaches and accomplishments presented to wide multi-disciplinary audience
  • Benefits to Sloan
    • Two funded projects each augmented through interaction in RDA

Corporation for National Research Initiatives

slide43

Types and the Handle System

  • Typing makes sense of data, which is just bits
  • Handles resolve to type/value pairs – all other functions reside in the applications
  • Handles identify digital entities which are implicitly or explicitly typed
  • So – to develop Handle-based applications
    • Must understand the types of returned values
    • Will at some point need to understand the downstream data identified by handles

Corporation for National Research Initiatives

slide44

Example DTR Use Cases

  • Broad Functional Classification
    • Repos hold widely varying levels of data & metadata
    • High-level functional classification of the identified object needed to make sense of what is available, e.g., data object, metadata, repo description, contact info, etc.
  • Simple License Information via PID Resolution
    • Data set access conditions cannot be predicted based on ID
    • For DataCite DOIs, a handle/type/value triple could be used to provide access information, probably through a level of indirection, resulting in a pop-up or intervening page or open linked data
  • Object Types as a Short-cut for Dependent Services to Match Processing Requirements to Data Objects
    • Using data acquisition as an example
      • Determine object type you are trying to build
      • Consult registry to index into an ontology to dynamically define required and optional properties
      • Does the input data have what is needed?
  • Registration of PID Types (in ID/Type/Value triples) for Data Processing and Interpretation
    • Distinguish pointers to objects from pointers to metadata from pointers to services
    • Enable complex client interactions as opposed to simple one-to-one re-direction

Corporation for National Research Initiatives

slide45

What do Data Type Records contain?

  • Data type records contain
    • textual description for human understanding
    • provenance information (who created when and what)
  • Records could contain
    • structured metadata about types for machines to process
    • encoding information (think file formats)
    • service information (think APIs to systems or applications that can process typed data)
    • semantic information (think description or predicate logic, useful for reasoning)
  • Records do not enforce or define new ways to describe or represent data structures, but rely on existing frameworks and technologies
    • File formats (mime types), etc., may be used for describing encoding information
    • WSDL, REST APIs, etc., may be used for describing service information
    • OWL, KIF, etc., may be used for representing semantics and knowledge

Corporation for National Research Initiatives

slide46

Proposed Data Type Data Model

Corporation for National Research Initiatives

slide47

Proposed Use of Data Types

  • Multiple type registries will be deployed; perhaps one per community
  • Type registries federate across each other; local policies may restrict (the scope of) such federation
  • Users register data structures within a type registry and acquire a unique, persistent identifier (data type)
  • Data type identifiers are then associated with corresponding data
  • Registered type records are additionally disseminated by type registries as Linked Data compatible outputs
  • General Guidelines
    • Users decide what data structures to register or not. If a data structure is expected to play a global role, then users are encouraged to register that data structure
    • Users are encouraged to first search if the data structure is registered prior to registering to avoid duplicates
    • Users decide the encoding, service, and semantic technology or framework that best suits them

Corporation for National Research Initiatives