1 / 18

BlueSSD: Distributed Flash Store for Big Data Analytics

BlueSSD: Distributed Flash Store for Big Data Analytics. Sang Woo Jun, Ming Liu, Kermin Fleming, Arvind Computer Science and Artificial Intelligence Laboratory MIT. Introduction – Flash Storage. Low latency, high density Throughput per chip is fixed

thisbe
Download Presentation

BlueSSD: Distributed Flash Store for Big Data Analytics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BlueSSD: Distributed Flash Store for Big Data Analytics Sang Woo Jun, Ming Liu, Kermin Fleming, Arvind Computer Science and Artificial Intelligence Laboratory MIT

  2. Introduction – Flash Storage • Low latency, high density • Throughput per chip is fixed • Many chips are organized into multiple busses that can work concurrently • High throughput is achieved with more busses • Read/write speed difference, limited write lifetime • Not the main focus… yet

  3. Flash Deployment Goals • High Capacity / Low Unit Cost • COREFU - Share distributed Storage over commodity network • TBs of storage at <1ms latency, 1GB throughput at high distribution • High Throughput / Low Latency • FusionIO - Maximum performance using many busses/chips and PCIE • 100s of GB at 100s of us latency, 3GB throughput

  4. BlueSSD – Best of Both Worlds • Shared distributed storage over faster custom network to accelerate big data analytics • PCIE • 8x PCIe 2.0 (~1GB/s) • Inter-FPGA SERDES • Low latency sideband network (<1us, ~1GB/s) • Automatic network/flow control synthesis

  5. The Physical System (Old) PCIe (~1GB/s) Sideband Link (~1GB/s) Flash Board (~80MB/s)

  6. The Physical System (Now-4 Nodes)

  7. System Configuration • 6 Xilinx ML605 Development Boards + Hosts • 4 Custom Flash Boards • 4 busses with 8 chips, 16GB per board • 2 Xilinx XM104 Connector Expansion Boards • 5 SMA Connections SMA Hub node FPGA FPGA XM014 XM014 SMA Host PC PCIE FPGA1 FPGA2 FPGA3 FPGA4 Custom Flash Board Custom Flash Board Custom Flash Board Custom Flash Board Storage Node The ML605 only has one SMA port, requiring hubs

  8. System Configuration • Single software host can access all nodes • All nodes have identical memory maps of the entire address space • Requests are redirected to nodes that have the data SMA FPGA FPGA XM014 XM014 SMA Host PC PCIE FPGA1 FPGA2 FPGA3 FPGA4 Custom Flash Board Custom Flash Board Custom Flash Board Custom Flash Board

  9. Network Flash Controller Requests Data PCIE Host PC Client Interface SMA Address Mapping FPGA Remote Node XM014 Host PC PCIE Flash Controller FPGA1 FPGA1 Custom Flash Board Flash Board

  10. Network Hub • Programmatically define high-level connections • N-to-N crossbar-like network is generated SMA FPGA1 FPGA1 ML605 ML605 ML605 XM014 FPGA2 FPGA2 ML605 ML605 FPGA3 FPGA3 FPGA4 FPGA4

  11. Software • FUSE provides a file system abstraction • Custom FUSE module interfaces with FPGA • The entire storage can be accessed as a single regular file • Currently running SQLite off-the-shelf • How to benchmark? SQLite stdio File System FUSE PCIE Driver FPGA

  12. Storage Structure • Focusing on read-intensive workloads • Writes are done offline, no coherence issues • Address is striped across FPGAs • Concurrent writes will require more than coherence • SQLite assumes exclusive access to storage • If we are to have more than one file, file system metadata will need o be synchronized

  13. Performance Measurement Throughput bottlenecked by custom flash card *COREFU performance at 32 nodes

  14. Scalability • Latency increase is small enough to accommodate 16+ FPGAs • Single SMA cable can accommodate 10+ Flash board throughput • More should be possible with good topology • Different story if flash boards are faster(link compression?)

  15. Future Work (1) • Bring up the 4 node system • Bring up the 8 node system • 8 more ML605 boards have been asked from Xilinx • More capacity + throughput

  16. Future Work (2) • Offload computation to FPGA • Do computation near storage • Relational algebra processor • Complex analytics? • Looking for interesting application

  17. Future Work (3) • Multiple concurrent writers • Software level transaction management • Hardware level pseudo-filesystem is probably required

  18. The End • Thank you!

More Related