1 / 19

Nfsen + Hadoop

Nfsen + Hadoop. Vytautas Krakauskas LITNET CERT Swedbank SIRT. Problems. Limited storage capacity Large data set processing time. Storage capacity. Steadily increasing network traffic Up to six months of history for incident handling I/O is the major bottleneck. Processing time.

berg
Download Presentation

Nfsen + Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nfsen + Hadoop Vytautas Krakauskas LITNET CERT Swedbank SIRT

  2. Problems • Limited storage capacity • Large data set processing time

  3. Storage capacity • Steadily increasing network traffic • Up to six months of history for incident handling • I/O is the major bottleneck

  4. Processing time • Currently no SMP support in nfdump • Important if I/O bottleneck is resolved

  5. Processing with Nfdump

  6. Distributed processing

  7. The idea • Distribute nfcap files between multiple nodes • Process the files using nfdump • Combine the output and return to nfsen • Nfsen and nfdump usage should feel the same

  8. 1. File distribution • nfcapd stores files on a temporary file system • due to "random" write of stat header • copy to HDFS at the end of each interval • bonus: limited backup while system is being tested • Redundant copies on multiple nodes • higher redundancy for faster processing and better reliability • lower redundancy for larger storage capacity

  9. Modified architecture

  10. 2. Processing • Process using nfdump • I/O through stdin/stdout • Each node works only with locally stored files • Currently based on the first block • Aggregate when possible based on: • stats type, aggregation options, filters • Copy the results back to the HDFS for the combiner

  11. 3. Combining • Combine the results as a single stream • a custom tool (nfcat) • some information is lost (e.g. ident) • nfdump does the final processing • single instance (a bottleneck) • Displays the results

  12. Modified architecture

  13. Comparison • Limited to nfdump • Additional delays when using nfsen • Original • single nfdump instance • files on a local file system • Distributed • Two nodes • processes per node: 2 • HDFS replication factor: 2

  14. Comparison • Top10 IPs, ordered by flows • 1-18 files (5-90 minute period) • Filter “proto icmp”

  15. Comparison

  16. Conclusions • Overhead has a significant impact for short periods • Initialization • Job scheduling • Combining and re-processing • Limited speed gains due to aggregation • Filtering is essential for achieving good speed gains • Still needs some issues to be addressed

  17. Thank you!

  18. The code • https://github.com/vytautas/nfdist • Patches (nfdist branch) • https://github.com/vytautas/nfdump • https://github.com/vytautas/nfsen

  19. Comparison: bad case

More Related