Introduction

Introduction Zachary G. Ives University of Pennsylvania CIS 700 – Internet-Scale Distributed Computing January 13, 2004

Welcome! • To the initial version of the Penn Systems Seminar • First of an ongoing series, focusing on systems research topics of general interest • Format: reading and discussion (no homework or exams) • Independent Study encouraged to supplement the seminar • Our focus: P2P and distributed ad hoc systems

What Is the Vision of Peer-to-Peer Computing? Loose coupling, auto configuration: • No central administration • Scalability • Adaptability/resiliency • Nodes contribute as well as consume resources • System continues as peers join and leave

How Does P2P Work? • P2P infrastructure forms an overlay network over the real Internet, which supports: • Schemes for distributing resources (data, computation) without a directory structure • Unstructured: query by flooding or over advertisements • Structured: query according to an algorithm that organizes the peers into a consistent structure (hash table, tree, …) • Graceful handling of loss or gain of nodes • Replication “where appropriate” • Provides reliability/availability • Improves performance (self-tuning) • More on this later, from Honghui

The Promise of P2P • Major challenge for applications is generally scalability • Traditional systems definition: • Scalability of systems to numbers of requests, clients, etc. • But we need “human” scalability as well: • Avoid human administration, tuning, oversight, custom code • Self-administering; auto-tuning • Providing the “right” abstractions • Human contributors often create heterogeneity among components, data, participation levels, etc. • Aspects of P2P should help with all of these

The Central Questions:Goals of this Seminar • “What is the killer app for a P2P substrate?” • Is there more to this P2P idea than pirating music and searching for little green men (and women)? • What applications can benefit from P2P-like techniques? • What are their key properties? • What programming models are most appropriate for building such applications? • How can P2P techniques be improved to better support the applications we want to build? • Security, trust, reliability, consistency, …

Some P2P Applications • Early in the semester: examining apps built over P2P overlay networks • We’ll start with two projects here at Penn • We’d like to talk with you if you’re interested in working or collaborating on these projects! • BRIEF overviews of the issues – more detailed talks later in the semester • Later: P2P games • First: Orchestra – P2P meets data integration…

Key Problem: Coordinating Efforts between Collaborators • Today, to collaboratively edit structured data, we centralize • For many applications, this isn’t a good model, e.g.: • Bioinformatics groups have multiple standard schemas and warehouses for genomic information – each group wants to incorporate the info of the others, but have it in their format, with their own unique information preserved, and the ability to override info from elsewhere • Different neuroscientists have may data from measuring electrical activity in the same part of the brain – they may want to share common information but maintain their specific local information; each scientist wants the ability to control when their updates are propagated Work-in-progress with Nitin Khandelwal; other contributors: Murat Cakir, Charuta Joshi, Ivan Terziev

The Orchestra System: Infrastructure for Collaborative Data Sharing • Each participant is a logical peer, with some XML schema that is mapped to at least one other peer’s schema • Schemas’ contents are logically synchronized initially and then on demand Translated updates from 3: + XML tree A’ - XML tree B’ Part2 mappings between XML schemas Schema 2 mappings Translated updates from 3: + XML tree A’’ - XML tree B’’ Updates: + XML tree A - XML tree B Part3 Part1 Schema 1 Schema 3

Some Challenges in Orchestra • Mappings • How to express them • Using them to translate updates, queries • Inconsistency • How to represent conflicts • How to resolve them • Update propagation • Consistency with intermittent connectivity • Scaling • To many updates • To many queries Logical & semantics- level Implementation- level (P2P-based)

Mappings • Some peers may be replicas • Others need mappings, expressed as “views” • Views: functions from one schema to another • Can be inverted (may lose some information) • Can be “chained” when there is no direct connection • (Much research in generating these automatically [DDH00][MB01], …) • Prior work on propagating updates through relational views [BD82][K85][C+96]… • Ensuring the mapping specifies a deterministic, side-effect-free translation • Algorithmically applying the translation • Ongoing work with Nitin Khandelwal: • Extending the model to handle (unordered) XML • Challenge: dealing with XML’s nesting and its repercussions

A Globally Consistent Model that Encodes Conflicts • Even in the presence of conflicts, want a “global state” (from perspective of some schema) when we synchronize • Allows us to determine what’s agreed-upon, what’s conflicting • Can define conflict resolution strategies • Goal: “union of all states” with a way of specifying conflicts • Define conditional XML tree based on a subset of c-tables [IM84] • Each peer pi has a boolean flag Pi representing “perspective i” root If P2 auth If P1 auth Lee Smith

Propagating Updates with Intermittent Connectivity • How to synchronize among n peers (even assuming the same schema)? • Not all are connected simultaneously • Usual approaches: • Locking (doesn’t scale) • Epidemic algorithms (only eventually consistent) • Approach: • “Shadow instance” of the schema, replicated within the other peers of the network • Everyone syncs with the shadow instance • Benefits: state is deterministic after each sync

Scaling, Using P2P Techniques • Update synchronization • Key problem: find values conflicting with “shadow instance” • Partition the “shadow instance” across the network • Query execution • Partition computation across multiple peers (PIER does this) • Query optimization • Optimization breaks the query into sub-problems, uses dynamic programming to build up estimates of the costs of applying operators • Can recast as recursion + memoization • Use P2P overlay to distribute each recursive step • Memoize results at every node • Why is this useful? Suppose 2 peers ask the same query!

Current Status • Have a basic strategy for addressing many of the problems in collaborative data sharing • Initial sketches of the core algorithms • Need to develop them further • … And to implement (and validate) them in a real system!

Introduction

Introduction

Presentation Transcript

Introduction to introduction to introduction to … Optimization

INTRODUCTION/ INTRODUCTION

Introduction

INTRODUCTION

Introduction

Introduction