- 59 Views
- Uploaded on
- Presentation posted in: General

(The Case for) Methodology Research

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

(The Case for)Methodology Research

Indranil Gupta

March 7, 2006

CS598IG Fall 2006

- Distributed systems with large numbers of processes…
- Grid, P2P systems, Web, …

- …require scalable and reliable distributed protocols inside
- Multicast, Replication, Voting, …

- Researchers design protocols to optimize message and time complexity, reliability, process overheads, etc.
- However, the only assistance for this design comes from research literature and experience. This is a laborious, almost “seat of the pants” approach.
- Leads to complex system internals, e.g., credit-card systems [Spec03], information systems [CRA], the Grid, the Internet,…
Efforts to understand existing systems, and design simple, effective systems.

- The research community generates thousands of ideas every month. How many of these are used? Reused? Preserved? When projects finish, papers go into archives. No reuse may lead to reinvention of the wheel.
- Today, there is minimal reuse of ideas from one research project in another
- exceptions: use of modified programming languages such as Cyclone (a variant of C) by the Security community

- This “barrier” is likely because of the inherent requirement that research projects maximize the percentage of “unique” contributions
- in the above example, the new PL was not known in the security community, hence it worked.

- A different kind of gap is the one between theory and systems.
- Other fields of science have already developed methodologies
- Synthesis in hardware design. [Ambrosio et al, Bluespec]
- Design patterns
- Methodologies are needed for maturity in a field of science.

Design Methodologies

- Simple Thesis
For any "project" or "problem", design a (i) solution, and (ii) a methodology underlying the design for the solution(s), and (iii) (optional) tie this methodology to at least one other methodology.

- Calls for a new layer of Methodology Research
- Does not solvethe mentioned problems, but attacks them
- Is more powerful than meets the eye

Protocol Design Methodology =

An organized, documented set of building blocks, rules and/or guidelines for design of a class of protocols, possibly amenable to automated code generation.

[adapted from FOLDOC]

Composable Methodologies

- “Archival” of ideas and results.
- Systematic reuse of ideas and results.
- Help designer systematically design new protocols with provable properties.
- Theoreticians and Practitioners
- Methodologies are understood by both theoreticians and practitioners. E.g., Classes of survivable storage archs. [Wylie et al] and ones on [Probabilistic I/O automata].
- “Composability" is a term as familiar to both sides as "fault-tolerance" and "scalability“ (with slightly varying interpretations)
- Methodologies also allow both theoreticians and practitioners to apply their solutions more “generally” and to exchange ideas in a systematic manner.

Innovative Methodologies

- Systematic Generalization of an Approach
- in a sense, a methodology captures the mode of thinking of the designer (without a psychological examination or MRI).

- Systematic tie-in with existing systems
- Shorten Life-span of research projects.
- These advantages are especially evident after a methodology has been discovered
- “If only I had realized there was an underlying design methodology, I might have designed these protocols much quicker”

- No! It’s been going on for decades (see papers in this session). The goal is to recognize this, encourage it and bring it to the surface.
- How is a methodology different from
- a design philosophy (e.g., end to end principle or localized algorithms)?
- a philosophy is more generally applicable and is a frame of mind for the designer. Methodologies have the power to build entire solutions, and already have multiple philosophies inherently embedded in them. Methodologies are closer to building the actual protocols and the system.

- a protocol family or a framework or a paradigm? it's the same; it encourages the development of these for specific problem areas.

- a design philosophy (e.g., end to end principle or localized algorithms)?

- Should there be an overarching methodology?
- Probably not, too expansive and raises too much contention. Allow order to emerge.

- Should there be standard ways to express methodologies ?
- Not yet. Over time, disparate methodologies may merge. Standardization should be emergent through integration.

- Inherent Nature
- InnovativeMethodologies: create opportunity to create completely novel protocols
- Composable Methodologies: building blocks and composition rules
- We will see examples of both of these today

- Expression
- Formal Rules (restricted, but rigorous) – either formal rules or a high level code generation PL
- Informal (guidelines, larger # of interpretations)

- Discovery of Methodologies
- Retroactive: for existing systems
- Progressive: for novel protocols (e.g., innovative M’s)
- Auxiliary

- Are there systematic protocol design methodologies?
- Can we automate part of protocol design?
Marshall McLuhan: “Technology is an extension of our natural facilities”.

Bill Gates: “Automation of any activity will magnify both its efficiencies and inefficiencies”.

- How does one assist the innovative process of design?
- Scientific disciplines use differential equations to represent ideas, results, and phenomena
- Biology, Physics, Chemistry, Electrical Engg., Economics, Sociology..
- Many phenomena here are scalable and reliable

- Methodologies to translate differential equations into protocols.
- Potential to innovate protocols that inherit scale and reliability of original equations.
- We give rigorousdesign methodologies for this
- We show how to design practical protocols for real applications

- Differential Equations used to study algorithms for independent vertex sets [Worm.95], 3-SAT [Achl.01], load balancing [Mitz.01]
- Our focus isopposite direction: converting differential equations into distributed protocols

- Distributed Computing with infinite number of processes, and relation to very large groups: [Kur.81, Mer.00, Mitz.01] – We analyze infinite groups
- We assume an asynchronous system with no clock drift
- [FLP85], Randomized protocols [Motwani text], Probabilistic I/O Automata [Lyn.97, Wu97]

- Endemic Diseases: e.g., Flu, Measles [in static populations]
x= fraction of receptives, y=stashers, z=averse

- translate into Migratory Replication
- E.g., Persistent Distributed Storage of Files.
- [R. Anderson] “Where a file once inserted, can never be deleted, even by a gun at your wife’s head”.

- E.g., Migrating leader committee membership, e.g., for multicast buffering

- E.g., Persistent Distributed Storage of Files.

Mapping

- Differential Eqn. State Machine
- Map
- Each Variable to a state
- Each Term to an Action

x

y

z

Flipping Action

One-Time-Sampling Action

“Endemic Protocol” for Migratory Replication

Analysis

- System analysis through
- Phase Portraits
- Behavior starting from different
- initial points

- Differential Equation (hence system)
- has a trivial and a non-trivial equilibrium
- point.
- The trivial point is a saddlepoint.
- The non-trivial point is a stable point.

Convergence Complexity: typically exponentially fast

Performance

Set of stashers changes every 40.6 s (on average)

No long horizontal lines

No vertical stripes

No temporal or hostid-wise correlation of stasher set

Performance

Endemic Protocol under Massive Failures: 50% of computers in this

100,000-computer system fail at time t=5000 s.

The file does not disappear.

Performance

Endemic Protocol under Churn: Even under 25% churn

(injected throughout), file does not disappear.

Performance

File Flux Rate (system-wide): Number of transfers of given file

per protocol period. Low at 1-2 per second

- Lotka-Volterra Model of Competition
x=#rabbits, y=#sheep

- “Two species competing for the same resource typically cannot co-exist”.

“LV Protocol” for Majority Selection

e.g., “Voting” on good and bad replicas

of a file required in digital libraries

[LOCKSS 03]

All One-Time-Sampling Actions

Phase Portrait of the LV Protocol

- Four equilibrium points
- X=Y=0: unstable point
- X=N and Y=N: stable point
- X=Y=Z(=N/3): saddlepoint

- Initial points with X<Y converge to Y=N
- X>Y X=N
- X=Y X=Y=Z(=N/3)
- (last disturbed by small perturbations)

- Methodology
f(X) may be

- Complete: all right hands sum to zeros, and have
- Completely Partitionable: (a) complete and (b) negative and positive terms are matched
- Polynomial: all terms polynomials
- Restricted polynomial: (a) polynomial and (b) for each x in X, each negative term in contains an x product term

e.g.,

- Theorem: Flipping and One-Time-Sampling suffice to map a differential equation system that is completely partitionable and restricted polynomial into an equivalent protocol.
- E.g.,

- This class includes many interesting and useful processes, e.g., endemic replication, majority selection protocol and epidemic multicast

Not Completely Partitionable…

- Equation Rewriting into equivalent forms
- To rewrite as complete equations, introduce new variable z and set to
- Rewrite equation to have
- Massage Terms to be completely partitionable
e.g., LV equations:

- Rabbits and Sheep (LV Model) Voting protocol
- Distributed digital libraries

- Bees (D’Silva Model) Adaptive Grid Computing
- Grids and clusters

- Spread of Epidemics and Rumors Epidemic protocols (retroactive!)
- Used in Kelips

- Methodologies for mapping Differential Equation systems into equivalent Distributed Protocols
- Flipping and One-Time-Sampling Actions restricted polynomial equation systems
- Equation Rewriting

- Generated Protocols: Endemics for migratory replication, LV protocol for majority selection, epidemic multicast
- Folklore file system based on endemic protocol

- Many more details in PODC paper.

- Equation Rewriting techniques, e.g., Is complete == completely partitionable?
- Mapping Equations with implicit t variable, or no t variable
- Do methodologies for these other differential equation types make sense?

- Building file and web caching systems using these protocols

Differential equations

Automatic Code Generation

D[x] = 0.3*x^2*z^2 - 0.3*x^2*y^2

D[y] = 0.3*y^2*z^2 - 0.3*x^2*y^2

D[z] = -0.3*x^2*z^2 -0.3*y^2*z^2 +

0.3*x^2*y^2 + 0.3*x^2*y^2

C code over

Berkeley sockets

void schedule_timer_event (int nodeid,

struct pp_payload* payload)

{

int curr_term, to_state, prev_state;

int* curr_state;

float p;

curr_state = get_state();

prev_state = *curr_state;

if (*curr_state != payload->state) return;

curr_term = payload->term;

if (*curr_state == ST_x && curr_term == 0)

{

int num_states;

int *states, *exponents;

num_states = 2;

states = (int*)malloc(num_states*sizeof(int));

exponents = (int*)malloc(num_states

*sizeof(int));

states[0] = ST_y;states[1] = ST_x;

exponents[0] = 1;exponents[1] = 0;

ots (ST_z, 0.5, num_states,

states, exponents);

}

if (*curr_state == ST_y && curr_term == 0)

…

}

equation

variable

positive

terms

negative

terms

positive

terms

negative

terms

differential equation

constant

match

term

differential equation

variable

exp

variable

exp

equation term

differential system

DIFFGEN Toolkit

schedule_timer_event snippet

fixedclient.c

- Appealing exercise but ridden with potholes
- Is the phenomenon a good match?
- Does the distributed system protocol behave exactly as the original phenomenon does?
- Are there any side-effects because we have PCs and not bees interacting?

- Design methodologies are a simple answer to these quandarie
- Derive the protocol from a model of the phenomenon, not from the phenomenon itself
- Run the “stupid test”: can I design a simpler, more efficient algorithm without using the natural analogy?

- Setting:
- Federated Storage Architectures: Federated array of Bricks (FAB), HP or Collective Intelligent Bricks (CIB), IBM.
- Clients make requests to collection of servers (fopen(), fwrite(), fread(), fclose())

- Need to support different “assumptions” (application- and deployment- dependent) from one piece of software
- How?

(very very (very) brief)

- Develop a family of protocols, each for a specific system model. Allow the application flexibility of choosing right mix at install-time or run-time
- Possible models from combinations of:
- Timing: synchronous or asynchronous
- Server: crash-stop, omission, crash-recovery, Byzantine, hybrid
- Client: same choices as above
- Repair by clients: allow client to repair or not

- Relevance: synchr LAN, crash-stop closely controlled, Byzantine untrusted environment

- Evident: reuse of protocols from broad distributed systems literature
- A lot from theory too!

- Lacking/needed:
- How are the protocols (for different system models) composed?
- What are the building blocks?
- What composition rules are used?

- Variety of overlays have been designed
- Chord, Pastry, Kelips – DHTs
- Narada, SRM, RMTP, Bimodal Multicast – multicasts

- Question: can we specify the design of each of these systems (or a class of them) as a declarative language
- Specify the goalsof the system rather than the lower level implementation
- P2: declarative logic language for overlay design
- Prolog-like rules

- materialize(succ, 120, infinity, keys(2))
- Each node maintains table succ, whose tuples retained for 120 s, unbounded size; keys specifies position in tuple of primary key

- stabilize (X) :- periodic(X,E,3)
- stabilize is a table that has a row for X if periodic has a row (X,E,3) for some E
- In reality, stabilize is an event that gets invoked according to the stream periodic, i.e., once every 3 seconds

- OurDHT is organized as a (surprise!) logical ring
- Each object in the p2p system lies somewhere along the ring
- [email protected](R,K,S,SI,E) :- [email protected](NI,N), [email protected](NI,K,R,E),[email protected](NI,S,SI), K in (N,S]
- returns a succesful lookup result if the received lookup seeks a key K is found between the receiving node's identifier and that of its successor

- [email protected](SI,NI) :- stabilize @NI(NI,_), [email protected](NI,_,SI)
- a node asks its successors (all if there are multiple successors) to send it their own successors, whenever the stabilize event is issued at that node

- [email protected](PI,S,SI) :- [email protected](NI,PI), [email protected](NI,S,SI)
- installs the returned successor at the original node

- We’ve specified, using 5 rules, a ring-based DHT
- The P2 paper goes on to specify the entire Chord protocol in 47 rules! (compare to the 1000’s of lines of code that would need to be “hand-written”)
- Performance of P2-generated Chord is comparable to hand-coded Chord

- The P2 paper also specifies the Narada multicast protocol in a mere 16 rules!

Ease of Protocol Specification: A protocol designer no longer has to write a C/C++/Java program several thousand lines long to design a new system. Design is a matter of writing only a few rules.

Formal Verification: Any such declarative design can potentially be run through specially-built verification engines that find bugs in the design, or better still, analyze the scalability and fault-tolerance of the protocol.

On-line distributed debugging: Execution history can be exported as a set of relational tables, distributed debugging of a deployed distributed system can be achieved by writing the appropriate P2 rules.

- Breadth: The same language P2 can be used to design other p2p overlays beyond Chord (e.g., the Narada overlay) - this makes possible quantitative comparisons among these systems that are much more believable than mere simulation-based comparisons. In addition, hybrid designs can be explored.
Yet another language

- Learning Curve
- When will all the get done (if ever)?
- What about optimizations – will P2-generated code create room for discovering as many optimizations as hand-coded Chord?

- RAML=metarouting framework for routing [Maltz et al, SIGCOMM 04]
- Useful for designing protocols such as BGP, etc.

- Goal: create new resources for the protocol designer
- beyond research literature and experience
Approach:

- beyond research literature and experience
- Methodologies: E.g., [Innovative] To translate differential equation systems into equivalent protocols. [Composable] to reuse protocol design (P2).
- Automation: E.g., DiffGen toolkit (PODC 2004 poster) that takes as input diff. eqns (Mathematica format), and spews out ready-to-deploy code.

[Distributed Protocols Research Group, UIUC] http://www-faculty.cs.uiuc.edu/~indy/rsrch.htm

(to be continued)