240 likes | 338 Views
An architecture for scalable inter-domain multicast/broadcast. Broadcast Federation. Mukund Seshadri mukunds@cs.berkeley.edu. with Yatin Chawathe Yatin@research.att.com. http://www.cs.berkeley.edu/~mukunds/bfed/. Spring 2002. Motivation. One-to-many or many-to-many applications
E N D
An architecture for scalable inter-domain multicast/broadcast. Broadcast Federation Mukund Seshadri mukunds@cs.berkeley.edu with Yatin Chawathe Yatin@research.att.com http://www.cs.berkeley.edu/~mukunds/bfed/ Spring 2002
Motivation • One-to-many or many-to-many applications • e.g. Internet live audio/video broadcast • No universally deployed multicast protocol. • IP Multicast • Limited Scalability (due to router state or flooding nature) • Address scarcity • Need for administrative boundaries. • SSM - better semantics and business model, still requires smart network. • Overlays - Application-level “routers” form an overlay network and perform multicast forwarding. • Less efficient, but easier deployability • Maybe used in CDNs (Content Distribution Networks) for pushing data to edge • Heavy duty edge servers replicate content
Goals • Design an architecture for the composition of different, non-interoperable multicast/broadcast domains to provide an end-to-end one-to-many packet delivery service. • Design and implement a high performance (clustered) broadcast gateway for the above architecture
Requirements • Intra-domain protocol independence (both app-layer and IP-layer) • Should be easily customizable for each specific multicast protocol. • Scalable (throughput, number of sessions) • Should not distribute info about sessions to entities not interested in those sessions. • Should use available multicast capability wherever possible.
Basic Design Source • Broadcast Network (BN) – any multicast capable network/domain/CDN) • Broadcast Gateway (BG) • Bridges between 2 BNs • Explicit BG peering • Overlay of BGs • Analogous to BGP routers. • App-level • For both app-layer and IP-layer protocols • Less efficient link usage, and more delay • Commodity hardware • Easier customizability and deployability • Inefficient hardware Clients BG BN Peering Data
Naming • Session Owner BN • Facilitates shared tree protocols • Address space limited only by individual BNs’ naming protocols. • Session Description • Owner BN • Session name in owner BN • Options • Metrics – hop-count, latency,bandwidth, etc. • Transport – best-effort, reliable, etc. • Number of sources – single, multiple. • URL style- bin://Owner_BN/native_session_name?pmtr=value&pmtr2=value2…
B-Gateway components 3 loosely coupled components – • Routing – for “shortest” unicast routes towards sources • Tree building – for “shortest” path distribution tree. • Data forwarding – to send data efficiently across tree edges. • NativeCast interface – interacts with local broadcast capability
Routing • Peer BGs exchange BN/BG level reachability info • Path – vector algorithm • Different routes for different metrics/options • e.g. BN-hop-count + best-effort+multi-source, latency+reliable, etc. • Session-agnostic • Avoids all BNs knowing about all sessions. • BG-level selectivity available using SROUTEs. • Policy hooks can be applied to such a protocol.
One reverse shortest path-tree per session Tree rooted at owner BN. “Soft” BG tree state:(Session:Upstream node: list of downstream nodes) Can be bi-directional Fine-grained selectivity using SROUTE messages before JOIN phase. Source Source Source B1 B1 (S1:N:B2) (S1:N:B2) B1 (S1:N:B2) B2 B2 B2 (S1:B1:N,B3) (S1:B1:N) C2 JOINs C3 JOINs (S1:B1:N,B3) C1 C1 B3 C1 B3 B3 (S1:B2:B4) (S1:B2:B4,B5) C1 JOINs B5 B4 B4 (S1:B3:N) (S1:B3:N) (S1:B3:N) C2 C3 Peering Link C2 JOIN BG and its tree state N NativeCast (Session:Parent:Child1,Child2,..) Client/Mediator Distribution Trees
Mediator How does a client tell a BG that it wants to join a session? • Client in owner BN no interaction with the federation. • Client not in owner BN needs to send JOIN to a BG in its BN. • BNs are required to implement the Mediator abstraction, for sending JOINs for sessions to BGs. • Modified clients which send JOIN to BGs • Well-known Mediator IP Multicast group. • Routers or other BN-specific aggregators • Can be part of the NativeCast interface.
JOIN TRANSLATION Data Forwarding Source (S1:L:P2) P1 • Decouples control from data • …control nodes from data nodes. • TRANSLATION messages carry data path addresses per session • e.g. TCP/UDP/IP Multicast address+port. • e.g. a transit SSM network might require 2+ channels to be setup for one session. • Label negotiation, for fast forwarding. • Can be piggy-backed on JOINs UDP:IP2,Port2 UDP:IP1,Port1 (S1:P1:L) P2 IPM:IPm1,Portm1 IPM:Null C1 IPM:Null P3 IPM:IPm1,Portm1
BGx: Broadcast Gateway IPMul or CDN stream Cx Sources Broadcast Gateway Control Node BN1 Data Stream D11 D12 Control Mesgs. BG1 C1 Dxx: Data Node Or Dnode BG2 C2 D21 D22 IPMul or CDN stream BN2 Receivers Clustered BG design • 1 Control node+`n’ Data nodes. • Control node routing + tree-building. • Independent data paths flow directly through data nodes. • TRANSLATION messages contain IP addresses of data nodes in the cluster. • Throughput bottlenecked only by the IP router/NIC. • “Soft” data-forwarding state at data nodes.
NativeCast • Encapsulates all BN-specific customization • Interface to local broadcast capability • Send and Receive broadcast data • Allocate and reclaim local broadcast addresses • Subscribe to and unsubscribe from local broadcast sessions • Implement “Mediator” functionality – intercept and reply to local JOINs • Get SROUTE values. • Exists on control and data nodes.
Implementation • Linux/C++ event-driven program • Best-effort forwarding. • NativeCast implemented for IP Multicast, a simple HTTP-based CDN and SSM. • Each NativeCast implementation ~ 700 lines of code. • Tested scalability of clustered BG (throughput, sessions) using HTTP-based NativeCast. • Used Millennium cluster.
No. of sources = no. of sinks = no. of Dnodes. (so that sources/sinks don’t become bottleneck). 440Mbps raw TCP throughput. 500MHz PIII’s; 1 Gbps NICs. >50Gbps switch. Sources of two types – rate-limited,and unlimited. BGx: Broadcast Gateway IPMul or CDN stream Cx Sources Broadcast Gateway Control Node BN1 Data Stream D11 D12 Control Mesgs. BG1 C1 Dxx: Data Node Or Dnode BG2 C2 D21 D22 IPMul or CDN stream BN2 Receivers Experimental Setup • Note: IPMul is based on UDP; CDN is based on HTTP (over TCP).
Results • Vary number of data nodes, use one session per data node. • Near-linear throughput scaling. • Gigabit speed achieved. • Better with larger message size. Note: maximum (TCP-based) throughput achievable using different data message (framing) sizes is shown above.
Multiple Sessions • Variation of total throughput when no. of sessions is increased to several sessions per Dnode shown. The sources are rate-unlimited. • High throughput is sustained when no. of sessions is large. With 1 Dnode With 5 Dnodes
Multiple Sessions … • Rate-limited sources (<103Kbps). • 5 Dnodes, 1 KB message size. • No significant reduction in throughput.
Future Work • Achieve large number of sessions+high throughout for large message sizes. • Transport-layer modules (e.g. SRM local recovery). • Wide area deployment?
Links • “Broadcast Federation: An Application Layer Broadcast Internetwork” – Yatin Chawathe, Mukund Seshadri (NOSSDAV’02) http://www.cs.berkeley.edu/~mukunds/bfed/nossdav02.ps.gz • This presentation: http://www.cs.berkeley.edu/~mukunds/bfed/bfed-retreat.ppt
SROUTEs… Source • …are session-specific routes to source in the owner BN • All BGs in owner BN know all SROUTEs for owned sessions. • SROUTE-Response gives all SROUTEs. • Downstream BGs can cache this value to reduce SROUTE traffic. • Downstream BG(s) compute best target BG in owner BN and send JOINs towards that BG. • JOINs contain SROUTEs received earlier. • Session info sent only to interested BNs. • Increases initial setup latency Phase 1 Phase 2 Client JOIN BG SROUTE-Request BN SROUTE-Response REDIRECT Peering TRANSLATION
More Results • Varied data message size from 64 bytes to 64 KB. • 1 Dnode • Clearly, higher message sizes are better • Due to forwarding overhead –memcpys, syscalls, etc.
Some More Results • Used 796 MHz PIII’s as Dnodes. • Varied no.of Dnodes, single session per Dnode. • Achieved Gigabit-plus speeds with 4 Dnodes.