360 likes | 369 Views
EtherRake is a centralized structure for monitoring and diagnosing large networks of switches in data centers and enterprise networks. It collects information such as neighbor data and forwarding tables, constructs a logical topology, and detects problems like STP errors and end-to-end connectivity issues.
E N D
EtherRake: Diagnosis and Monitoring in Data Center & Enterprise Networks Lab for Internet and Security Technology (LIST) Northwestern Univ.
General Idea of EtherRake • Problem statement: Emerging DC and enterprise networks are mainly comprised of large # of switches which need monitoring and diagnosis
General Idea of EtherRake • A centralized structure. • Collector at each switches • Collect Neighbors • Collect port information • Collect forwarding tables • Monitor Plane • Transmit collected information • Processing Center • Link the frames • Construct Logical Topology • Find the problems
Collector at each switches • Take Cisco switches for example • Port information • show port status (display interface ethernet0/1 for huawei) • Neighbor Information • show CDP neighbors • Forwarding tables (aka switch table) • show MAC – interface mapping
Collector at each switches • Port information • Port Number: 2 Bytes • Status: 4 bits • Total: 3 Bytes * 100 = 300 Bytes < 0.4KB per switch • Neighbor Information • Mac Address: 48 bits • Total: 6Bytes* 100 = 600 Bytes < 0.6KB per switch • Forwarding Tables • To be decided. We are not using it in our approach now. We can transfer updates only which means normally we don’t need to transfer anything. • Total: 1 KB * 1024 (number of switches) = 1MB in one round.
Collector at each switches • Synchronization • Cristian's algorithm (P is processing center, and S is a collector) • P requests the time from S • After receiving the request from P, S prepares a response and appends the time T from its own clock. • P then sets its time to be T + RTT/2 • Multiple measurement can reduce the error. • Accuracy. (T + min) to (T + RTT - min) where min is the minimum one-way time.
Monitor Plane • Monitor Plane is a plane that co-exists with data plane and control plane in the same channel. It is used to transfer monitoring data. Monitor Plane Control Plane Assist Adjust Monitor Data Plane Control
Monitor Plane • Monitor plane is used to collect data for monitoring data plane. • Switching in monitor plane has two methods. • Normally, control plane will assist monitor plane forwarding. • Under error, monitor plane will do flooding.
Processing Center • Collect port information, forwarding tables and neighbor information from all the switches. • Construct the logical topology of switches based on the port & neighbor info • Detect loops in the logical topology for STP loop problems • Check for any missing/dead switches
Problems to Solve • STP Error Detection • End-to-end Error Detection • Other Hardware/Software Errors of Switches and Their Detection • TRILL Potential Problems
End-to-end Connectivity Monitoring • Based on the neighbor and port information, check if all switches and end hosts are on a connected ST. • End hosts are also neighbors for leaf node switch. • Forwarding table also records info of past connectivity
Other Software Errors of Switches and its Detection • One-Way Link Problem. No backward frames. • From EtherRake’s view, interface of the other direction is dead. • Deferred Frames. Buffer is full. Frames have to be dropped. • Encode the buffer status (e.g., full) to the status bit • Links between switches and routers disabled/unactivated. • Detected by the port status bits or lack of heartbeat • Switches down, e.g., unbootable IOS problems • Same as above
Limitations on Other Switch Software Errors Detection • Some errors have to be detected at the data plane or application plane. • VLAN Problems. Hosts in the same VLAN cannot communicate with each other.
Hardware Errors of Switches and its Detection • Switch Port Errors. • Switch Module Errors. • Both will be detected by the port status reports
STP Errors (1) • Count to Infinity when removing the root
STP Errors (2) • Forwarding Loops • BPDU Loss Induced Forwarding Loops. If the blocked port fails to receive BPDUs from its peer bridge for an extended period of time, it may start forwarding data.
STP Errors (3) • Forwarding Loops • MaxAge Induced Forwarding Loops (MaxAge = 6)
STP Errors (4) • Forwarding Loops • Count to Infinity Induced Forwarding Loops • Pollution of Forwarding Tables
Previous STP Errors Detection • EtherFuse (sigcomm 07) • Plug a fuse into Ethernet • Problem Remaining • Where to plug it? • How many do we need?
Previous STP Errors Detection • Cisco Prevention Methods • Loop Guard. Prevent loss BPDU induced loops.
Some Existing Solutions • Cisco Discovery Protocol (CDP) • Discovery cisco apparatus in neighborhood • Monitoring aliveness of neighboring nodes • Limitations • No detail status report for diagnosis • Limited by one hop. • Cisco Unidirectional Link Detection (UDLD). • Detect One-Way Link Problem.
General Monitoring Metrics for Detection • Connectivity. Based on frames tree, EtherRake can find the connectivity of a path. • Delay. EtherRake can link frames and calculate the time spent on each switch. • Throughput. EtherRake can calculate throughput by collected frames.
TRILL Potential Problems • Routing loops • Caused by inconsistent views of network topology. • Mitigated using hop count • Scalability issue: • No clear idea on how much TRILL can scale
Detection of STP Errors by EtherRake • Find STP errors by EtherRake. • Link collected frames into traces • Detect frame forwarding loops • Leverage on the switch and ARP table info • Challenges • Scalability: optimize collection of traces • Ambiguity and accuracy: frame linking
End-to-end Connectivity Monitoring • Diagnose Connectivity Problem from A to B by EtherRake • Find the frames that are on the way from A to B. • Link the frames and find a path. • Locate the problem.
IP Router Errors – OSPF (1) • Network Convergence Time. The time taken by all the OSPF routers in the network to go back to steady state operations after there is a change in the network state.
IP Router Errors – OSPF (2) • Routing Load on Processors
IP Router Errors – OSPF (3) • Route Flaps. Routing table changes in a router, usually in response to a network failure or a recovery.
Cisco Solution • Bi-directional Forwarding Detection (BFD) • Try to Speed Network Convergence (three parts). • Failure detection: the speed with which a device on the network can detect and react to a failure of one of its own components, or the failure of a component in a routing protocol peer. • Information dissemination: the speed with which the failure in the previous stage can be communicated to other devices in the network • Repair: the speed with which all devices on the network-having been notified of the failure-can calculate an alternate path through which data can flow.
IP Router Errors – DHCP • DHCP problem • Configuration problem. • Inability to acquire or renew a lease. • How to keep the same IP address in multi-boot machines?
EtherFuse (1) • A Ethernet Fuse that is plugged into the network for monitoring the status of network.
EtherFuse (2) • Detection of Count to Infinity • Detecting cost to the same root R of BPDUs
Detection of Forwarding Loops. • Combination of Passive Sniffing and Active Probing.
Package View Switching • Forwarding packages from the view of packages. • Each package will have memory about the history of the path it has already gone through and decide which way to go based on the memory it has. • Here is the steps. (Generally speaking, it is deep-first searching from the view of packages.)
Package View Switching • Normally, when a package arrives at a switch, it will choose the default port which is the port that control plane provide. • If the package has already tried the default port, it will randomly choose a new port that it has never been to. • If the package tried every port at this switch, it will go back to the port where it is from. • Package will be discarded when it arrived at its origin and finds no other way to go. Or package arrives at the destination which is the monitor center.