1 / 30

Internet2 Netflow data analysis

Internet2 Netflow data analysis. Malathi Veeraraghavan & Zhenzhen Yan University of Virginia. Outline. Problem statement Software architecture Solution approach Findings. Problem statement.

portia
Download Presentation

Internet2 Netflow data analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Internet2 Netflow data analysis Malathi Veeraraghavan & Zhenzhen Yan University of Virginia

  2. Outline • Problem statement • Software architecture • Solution approach • Findings

  3. Problem statement • Can long flows be identified automatically at PE routers and redirected to dynamic circuits across the SDN? • Implement prototype and demonstrate on DOE ANI Prototype • Hybrid Network Engineering Software (HYNES)

  4. Big picture: vision for how HYNES could be used (if it all works) ESnet Provider Edge (PE) router dynamic circuit (long flows)

  5. Outline • Problem statement • Software architecture • Solution approach • Findings

  6. Hybrid Network Engineering Software (HYNES) • MFDB: Monitored Flow Data Base • Some components can be centralized and rest distributed • Centralized: OFAT, IDC interface module, user-interface module, initialization module

  7. Components • Offline Analysis Tool (OFAT): statistical R programs to identify which flows are long • Most challenging component • Leverage “human knowledge” about large file transfer servers and applications • Populate MFDB • Monitored flow data base (MFDB)

  8. Components contd. • Packet header processing module • receives packets for flows in MFDB • initiates the reservation and provisioning of a circuit • initiates circuit release when packet flow “ends” • IDC interface • interfaces with ESnet’s Inter-Domain Controller (IDC) • User-interface module • supports human and programmatic interface to MFDB • Router control interface module • set PBR for MFDB flows to mirror packets to HYNES server • set PBR route to redirect packets from default IP-routed path to newly established circuits, and reset when done

  9. Coming back to the key problem • Can long flows be identified automatically at PE routers and redirected to dynamic circuits across the SDN?

  10. Outline • Problem statement • Software architecture • Solution approach • Findings

  11. Solution methodology I. Netflow data (analysis with R programs) II. Network requirements workshop reports (“human knowledge” mining) Long flows separated by apps IP addresses for scientific computing (data transfers) servers III. Understanding applications (with tcpdump, talking to developers: SCP, SFTP, GridFTP, BBCP) Goal: Identify suitable candidate flows for the MFDB

  12. Track I: Netflow data analysis • Methodology: • Download Netflow data from Internet2 • Use flow-export tools to get ASCII file • Shows 5-tuple, bytes, timestamps of first and last packet in flow • Statistical package R programs: • Find flowlengths and isolate out flows of length 59sec from each 5-min file • Concatenate flows from all 5-minute files in one day (one week): • gaps (1-in-100 sampling): 5-minute gaps acceptable • “definition” of “long flow”: >= 10minutes • Output: all flows longer than 10 minutes

  13. Methodology contd. • Sort long flows by protocol number and only save tcp, GRE, ESP, AH, IP-in-IP flows (removed ICMP and UDP) and print statistics • Sort on ip protocol field and src and dst ports, and separate out flows for different applications into different files

  14. Next steps • Check if the n-tuple (n <=5) used to identify a long flow occurs in a short flow; if it does often, then cannot place this flow descriptor in the MFDB • Check for repeated occurrences of a flow on different days • Sensitivity analysis to acceptable gap parameter (now: 5 minutes) • Look for temporal patterns on flow-arrival times for forecasting

  15. Track II: Mining “human knowledge” • Metholodogy: • For each report (NP, BES, BER, FES, ASCR) • For each project, • Determine if users use file transfer applications to move data or just take home the data collected from instruments on DVDs • For each participating institution (ESnet sites and major universities), access the high-performance computing web site • Look for servers dedicated to data transfers • Identify applications run (scp, sftp, bbcp, GridFTP, RFT, etc.) • Use ping to find IP addresses • Use arin.net to find IP address space allocation for participating universities

  16. III: Understanding apps • SCP – learned about HPN-SSH patch • receive buffer resizing • can disable payload encryption • GridFTP • Data flows: port numbers in the 50000-51000 range • obtained a tcpdump of a GridFTP session • obtained a globus-url-copy (GridFTP client) with debug enabled output

  17. Outline • Problem statement • Software architecture • Solution approach • Findings

  18. Track I: Netflow analysis • CHIC and LOSA routers of Internet2 • One-day data analysis • 5-day (Mon-Fri) analysis

  19. Unidata one-day (HYNES) 14 minutes Between NCAR and Michigan State University

  20. Top ten fat flows in one day encapsulated:3; ssh:5; Unidata:1; rsync:1 2 long ssh flows to University of Texas at Austin Texas Advanced Computing Center (129.114.48.0) from Fermilab (131.225.192.0). 141.142.24.0 corresponds to NCSA (National Center for Supercomputing Applications) for two other ssh flows. The Unidata LDM flow is from NCAR (National Center for Atmospheric Research) with address 128.117.136.0.

  21. Data for a per-day basis five-weekday period (July 6-10, 2009) Fattest data: remember it is 1-in-100 sampled data 26882 sec = 7.5 hours

  22. Data for a five-weekday period (July 6-10, 2009)

  23. Repeat customers: ssh long flows

  24. A GridFTP Example (from CHIC) • Between University of Nebraska-Lincoln and Rutherford Appleton Laboratory • Appears to be a parallel transfer (though cannot be sure that it is not striped because of anonmyized addresses)

  25. Correlation between SSH flows and ESP flows • Looked for correlation between long SSH flows and ESP flows (IP protocol number = 50) • Found none • Hypothesis: If HPN-SSH is used, then SSH flows (port 22) will likely be short and port 22 long flows could be long scp transfers • If regular ssh is used, the long flow from an scp transfer will be an ESP flow, but the SSH flow will be short

  26. Findings from track II • Teragrid servers: 15 sites (server names and IP addresses found) • ESG data grid servers found • So far: NP and BES reports studied • Number of servers found so far: 51 (BES: single) + 15 (BES: ranges) + 39 (NP) • Some IP address ranges used for participating institutions

  27. “Match” rate between Track I and Track II • Percent of flows for which the src or dst IP address matches one of the server addresses found from the Track II study of science projects • Number of long flows in CHIC is 33717 • Number of long flows in LOSA is 27279

  28. Summary • Ready to provide ESnet the R programs and shell scripts • Track II: FES and BES report analysis • Track III application pattern recognition (scp, sftp, bbcp, GridFTP) • Run Internet2 Netflow analysis for more days and at all routers

  29. Thoughts for ESnet Netflow data analysis • qsub problem: large jobs submitted in batch mode on front-end. Actual servers involved are not advertised on science project web sites. • Ask sites for GridFTP/scp/bbcp/sftp server addresses (cluster nodes) • UVA provides ESnet R programs to just extract these long flows, and ESnet runs them, and only provides UVA with long flow data for matched server addresses

  30. Contd. • Anonymizing problem – will make MFDB candidacy determination difficult. Need to know that the flow concatenation process was not accidentally merging two different flows – e.g. the GridFTP problem • Need to estimate the returns on investment: percent of bytes that are candidates for handling on SDN

More Related