1 / 20

Clustering Very Large Multi-dimensional Datasets with MapReduce

Clustering Very Large Multi-dimensional Datasets with MapReduce. 蔡跳. INTRODUCTION. large dataset of moderate-to-high dimensional elements serial subspace clustering algorithms TB、PB e.g.,Twitter crawl: > 12TB Yahoo! operational data: 5PB 方法:combine a fast, scalable serial algorithm

matia
Download Presentation

Clustering Very Large Multi-dimensional Datasets with MapReduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering Very Large Multi-dimensional Datasets with MapReduce 蔡跳

  2. INTRODUCTION • large dataset of moderate-to-high dimensional elements • serial subspace clustering algorithms • TB、PB • e.g.,Twitter crawl: > 12TB Yahoo! operational data: 5PB • 方法:combine a fast, scalable serial algorithm and makes it run efficiently in parallel

  3. INTRODUCTION • bottleneck: I/O, network • Best of both Worlds -- BoW automatically spots the bottleneck and picks a good strategy serial clustering methods as a plugged-in clustering subroutine

  4. RELATED WORK • MapReduce--简化的分布式编程模式,用于大规模数据集的并行运算 • mapper, reducer • map stage:input file and outputs(key, value)pairs • shuffle stage:transfers the mappers'output to the reducers based on the key • reduce stage: processes the received pairs and outputs thefinal result

  5. BoW • ParC:数据划分,合并结果 • SnI:先抽样,牺牲I/O减少network cost • trade-off

  6. ParC--Parallel Clustering • 划分数据、分配数据到不同的机器 • 每台机器在分配到的数据中聚类,得到簇称为β-clusters • 合并β-clusters得到最终的类

  7. SnI--Sample and Ignore • 抽样,聚类得到clusters • 排除属于clusters空间内的数据 • ParC

  8. COST-BASED OPTIMIZATION • ParC Cost: • Map Cost: • Shuffle Cost: • Reduce Cost:

  9. SnI Cost:

  10. Bow • compute ParC Cost->costC • compute SnI Cost->costCs • if costC > costCs then clusters = result of SnI • else clusters = result of ParC

  11. EXPERIMENTAL RESULTS • 采用Hadoop • M45:1.5PB storage,1TB memory, • DISC/Cloud:512 cores,64 machines,1TB RAM,256TB disk storage,

  12. Quality of results • 聚类的平均准确率、召回率 • 模拟数据

  13. Scale-up results • 增加reducer

  14. Scale-up results • 增加数据,r=128,m=700

  15. Accuracy of our cost equations

  16. 感谢聆听! Thanks for your time

More Related