1 / 66

ResourceSync: A Modular Framework for Web-Based Resource Synchronization

This presentation introduces ResourceSync, a framework for synchronizing web-based resources. It discusses the problem domain, scope, and technology of the framework, and provides a demonstration of its capabilities.

dgreer
Download Presentation

ResourceSync: A Modular Framework for Web-Based Resource Synchronization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ResourceSync A Modular Framework for Web-Based Resource Synchronization Herbert Van de Sompel Los Alamos National Laboratory @hvdsomp Martin Klein Los Alamos National Laboratory @mart1nkle1n http://www.openarchives.org/rs #resourcesync ResourceSync was funded by the Sloan Foundation & JISC

  2. ResourceSync • Collaboration between NISO and the Open Archives Initiative • Funded by the Sloan Foundation and JISC • Goal: Devise a specification for web-based resource synchronization

  3. This ResourceSync Presentation • Problem Domain • Scope • Framework - Overview • Framework – Technology • Demonstration • Status

  4. Background - OAI-PMH • Recurrent metadata exchange from a Data Provider to Service Providers • XML metadata only • Repository centric • Devised 1999-2002, prior to REST, prior to dominance of web search engines

  5. Revisit the Problem Domain - ResourceSync • Synchronization of resources from a Source to Destinations • Web resources, anything with an HTTP URI & representation • Resource centric • Devised 2012-2013, leverages key ingredients of web interoperability, existing specifications, existing Search Engine Optimization practice

  6. Problem Statement • Consideration: • Source (server) A has resources that change over time: they get created, modified, deleted • Destination (servers) X, Y, and Z leverage (some) resources of Source A • Problem: • Destinations want to keep in step with the resource changes at Source A

  7. A Source’s Resources

  8. A Source’s Resources Evolve over Time

  9. A Source’s Resources Evolve over Time

  10. A Source’s Resources Evolve over Time

  11. A Source’s Resources Evolve over Time

  12. A Source’s Resources Evolve over Time

  13. A Source’s Resources Evolve over Time

  14. Problem Statement • Consideration: • Source (server) A has resources that change over time: they get created, modified, deleted • Destination (servers) X, Y, and Z leverage (some) resources of Source A • Problem: • Destinations want to keep in step with the resource changes at Source A • Goal: • Design an approach for resource synchronization aligned with the Web Architecture that has a fair chance of adoption by different communities

  15. This ResourceSync Presentation • Problem Domain • Scope • Framework - Overview • Framework – Technology • Demonstration • Status

  16. Scope – Collection Size • Size of a Source’s resource collection: • A few resources - small web sites, repositories • Millions of resources – large repositories, datasets, linked data collections

  17. Scope – Change Frequency • Change frequency of a Source’s resources: • Low – daily, weekly, monthly • High – seconds, minutes

  18. Scope – Synchronization Latency • Destination’s requirements regarding synchronization latency: • High latency acceptable • Low latency essential

  19. Scope – Collection Coverage • Destination’s requirements regarding the coverage of a Source’s resources: • Partial coverage of the Source’s resources acceptable • Full coverage of the Source’s resources verifiable

  20. Scope – Bitstream Accuracy • Destination’s requirements regarding bitstream accuracy: • Unverifiable bitstream accuracy acceptable • Verifiable bitstream coverage essential

  21. One to One Synchronization

  22. One to Many – Master Copy

  23. Many to One - Aggregator

  24. Selective Synchronization

  25. Metadata Harvesting

  26. This ResourceSync Presentation • Problem Domain • Scope • Framework - Overview • Framework – Technology • Demonstration • Status

  27. A Source’s Resources Evolve over Time

  28. Solution Perspective - Destination • Destination needs regarding synchronization: • Baseline synchronization: Initial catch-up operation to align with the Source’s resources • Incremental synchronization: Remain synchronized as the Source’s resources evolve • Audit: Destination determines whether it effectively is in sync with the Source • Bitstream accuracy • Coverage of resources

  29. Solution Perspective - Source • Source communicates about the state of its resources: • Publish inventory: snapshot of the state of resources at a moment in time • Publish changes: enumeration of resource changes that occurred during a temporal interval • Notify about changes: send notifications as changes occur • Communication payload: • Minimal, e.g. HTTP URI of resource • Additional, e.g. content-based hash of resource

  30. Resource List • In order to meet a Destination’s need for baseline synchronization, the Source may publish a Resource List • A Resource Listis an inventory, a snapshot of existing resources • Per resource, it minimally provides the resource’s URI • Process: • Destination obtains the Resource List • Destination obtains listed resources by their URI • Optimization: Resource Dump, a list pointing to ZIP files that contain resource representations

  31. Publish Resource List: Inventory at Tx Resource List@Tx= { A ; B ; C }

  32. Change List • In order to meet a Destination’s need for incremental synchronization, the Source may publish a Change List • A Change List enumerates resource change events that occurred in a temporal interval • For each event, it minimally lists datetime, URI of the resource, the nature of the change • Process: • Destination obtains the Change List • Destination obtains created/updated resources, removes deleted resources • Optimization: Change Dump

  33. Publish Change List: Resource Changes During Interval Ty-Tz Change List[Ty,Tz] = { A updated @Tc ; B updated @Tc ; C created @Td ; D deleted @Te ; C updated @Tf }

  34. Change Notification • In order to meet a Destination’s need for incremental synchronization and low latency, the Source may send Change Notifications • A Change Notification conveys resource change events as they occur • For each event, it minimally lists datetime, URI of the resource, the nature of the change • Process: • Destination receives Change Notification • Destination obtains created/updated resources, removes deleted resources

  35. Send Change Notification – Resource Changes at Ta Change Notification @Ta = { A updated @Ta }

  36. Send Change Notification – Resource Changes at Tb Change Notification @Tb = { D updated @Tb }

  37. Send Change Notification – Resource Changes at Tc Change Notification @Tc = { A updated @Tc ; B updated @Tc }

  38. Send Change Notification – Resource Changes at Td Change Notification @Td = { C created @Td }

  39. Send Change Notification – Resource Changes at Te Change Notification @Te = { D deleted @Te }

  40. Send Change Notification – Resource Changes at Tf Change Notification @Tf = { C updated @Tf }

  41. Communication Payload – Metadata & Links • A Source may provide additional metadata and links pertaining to resources conveyed in Resource Lists, Change Lists, Change Notifications • Metadata about a resource: content encoding, content length, mime type, content-based hash • Linking to related resources: mirror copies, alternate representations, resource versions, diff between current and previous version, metadata-to-content link, content-to-metadata link, collection membership, etc.

  42. Communication Payload – Metadata – Hash • In order to meet a Destination’s need for audit, the Source may provide a content-based hash pertaining to a resource • Source computes the content-based hash for a resource • Source provides the hash as metadata pertaining to the resource in its communication payload • Destination processes communication payload, obtains the resource • Destination computes the content-based hash for the obtained resource, compares with the Source’s

  43. Communication Payload – Link – Interlink Metadata & Content • In order to allow a Destination to establish the relationship between a Source’s metadata and a Source’s content, the Source may provide appropriate links • Metadata resources and content resources are just resources identified by HTTP URIs • Both can independently be subject to synchronization and can be interlinked using appropriately typed links

  44. Communication Payload – Link – Link to Diff • In order to minimize content transfer, a Source may link to a diff between the previous and the new version of a resource • Destination can obtain the diff and patch its (previous) version of the resource • Connection between the resource and the diff is established by means of appropriately typed link • Nature of the diff is established by means of MIME type • Few diff MIME types exist. Communities can establish their own.

  45. Further Framework Characteristics • Modular: A Source does not have to implement all capabilities • Source decides which capabilities to support based on local and community requirements • Sets of Resources: Division of a Source’s resource collection in logical groupings. • Supported capabilities can differ per set • Discovery: Mechanisms for Destinations to determine whether and how a Source supports ResourceSync • Based on conventions for web discovery and documents that detail the level of support

  46. This ResourceSync Presentation • Problem Domain • Scope • Framework - Overview • Framework – Technology • Demonstration • Status

  47. Sitemap Protocol • ResourceSync builds on the Sitemap protocol used by major search engines • Similarity between resource synchronization and resource discovery/indexing • Extends the Sitemap protocol to meet synchronization needs • Cf. Metadata and Links • Sitemap document format is used throughout the framework to express Resource Lists, Change Lists, etc. • Type of ResourceSync document can be determined through explicit declaration

  48. Common Sitemap <urlsetxmlns="http://www.sitemaps.org/schemas/sitemap/0.9”> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> </url> <url> <loc>http://example.com/res2</loc> <lastmod>2013-01-02T14:00:00Z</lastmod> </url> … </urlset>

  49. Resource List <urlsetxmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcelist" at="2013-01-03T09:00:00Z” /> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" length="8876" type=”application/pdf” /> </url> <url> … </url> </urlset>

  50. Change List <urlsetxmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:mdcapability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z” /> <url> <loc>http://example.com/res2</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated”/> <rs:lnhref=“http://example.com/res2/meta” rel=“describedby” /> </url> <url> … </url> </urlset>

More Related