190 likes | 195 Views
Utilizing Distributed Environment for Clustering Web Page Streams in Search Engines. Saeed Rahmani , Dr. Mohammd Hadi Sadroddini Shiraz University. Overview. Introduction Proposed Model FICA as Crawler Algorithm Incremental Clustering Distributed Environment MapReduce PowerGraph
E N D
Utilizing Distributed Environment for Clustering Web Page Streams in Search Engines Saeed Rahmani, Dr. MohammdHadiSadroddini Shiraz University
Overview • Introduction • Proposed Model • FICA as Crawler Algorithm • Incremental Clustering • Distributed Environment • MapReduce • PowerGraph • Reference
The Age of Big Data Social Media Web Advertising Science 72 Hours each Minute YouTube 6 Billion Flickr Photos 1 Billion Facebook Users 28 Million Wikipedia Pages
Powerful tools for tackling large-data problems Search Engine Bioinformatics DNA sequence assembly protein-protein interaction networks Recommendation system Text processing Machine Translation MapReduce, PowerGraph, Spark, Storm, … How are you?
Proposed Model Web PowerGraph PowerGraph [2] Web Graph FICA [1] Crawler Temp Repository Web Page Pre-processing [3] Urls and Links MapReduce MapReduce [5] Incremental Clustering [6] Important N-gram Detection Unit [4] Application Search Engine News Analysis …. • Clustered Web Page Repository
Fast Intelligent Crawling Algorithm(FICA) Throughput of crawling algorithms where the benchmark ranking is PageRank[1]. Logarithmic Distance in FICA[1]
Incremental clustering [6] D1 D2 D3 D4 D5 D6
Main Concept D1 D2 D3 D4 D5 - Document Representation - Documents Similarity - Nearest Neighbor
N-gram • N-Grams are sequences of tokens. • The N stands for how many terms are used • Unigram: 1 term • Bigram: 2 terms • Trigrams: 3 terms List of Persian n-grams
MapReduceFramework [5] Input Map Shuffle & Sort Reduce Output the, 1 brown, 1 fox, 1 the quick brown fox brown, 2 fox, 2 how, 1 now, 1 the, 3 Map Reduce the, 2 fox, 1 Word Count the fox ate the mouse Map quick, 1 how, 1 now, 1 brown, 1 ate, 1 cow, 1 mouse, 1 quick, 1 ate, 1 mouse, 1 Reduce how now brown cow Map cow, 1
Implementing FICA using Map-Reduce (sec) DataSet: “uk-2002” Nodes: 18,520,486 Edges: 298,113,762 active and in-active nodes (Iteration) Execution time of Map-Reduce iterations using 100 random initial seeds (sec) Speedup of improved Map-Reduce execution using active and in-active nodes. (Iteration) Execution time using active and in-active nodesof Map-Reduce iterations using 100 random initial seeds
Natural Graphs More than 108 vertices have one neighbor. Top 1% of vertices are adjacent to 50% of the edges! [Image from WikiCommons]
Power-Law Graphs are Difficult to Partition • Power-Law graphs do not have low-cost balanced cuts • Traditional graph-partitioning algorithms perform poorly on Power-Law Graphs. CPU 1 CPU 2 http://www.cs.cmu.edu/~ylow/vldb5.pptx
Curse of the Slow Job Iterations Data Data Data Data CPU 1 CPU 1 CPU 1 Data Data Data Data Data Data Data Data CPU 2 CPU 2 CPU 2 Data Data Data Data Data Data Data Data CPU 3 CPU 3 CPU 3 Data Data Data Data Data Data Data Data Barrier Barrier Barrier http://www.www2011india.com/proceeding/proceedings/p607.pdf
PowerGraph[2] Program For This Run on This Machine 1 Machine 2 • Split High-Degree vertices • New Abstraction Equivalence on Split Vertices http://www.cs.cmu.edu/~ylow/vldb5.pptx
Distributed Execution of a PowerGraph Vertex-Program Machine 1 Machine 2 Mirror Mirror Master Mirror Gather Y’ Y’ Y’ Y’ Y Y Σ Σ1 Σ2 + + + Apply Y Y Machine 3 Machine 4 Σ3 Σ4 Scatter http://www.cs.cmu.edu/~ylow/vldb5.pptx
PageRank on the Twitter Follower Graph Natural Graph with 40M Users, 1.4 Billion Links Communication Runtime Seconds Total Network (GB) Reduces Communication Runs Faster 32 Nodes x 8 Cores (EC2 HPC cc1.4x) http://www.cs.cmu.edu/~ylow/vldb5.pptx
Reference [1] Zareh Bidoki, Ali Mohammad, Nasser Yazdani, and Pedram Ghodsnia. "FICA: a fast intelligent crawling algorithm." Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence. IEEE Computer Society, 2007. [2] Gonzalez, Joseph E., et al. "PowerGraph: Distributed graph-parallel computation on natural graphs." Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 2012. [3] Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze.Introduction to information retrieval. Vol. 1. Cambridge: Cambridge University Press, 2008. [4] Balkir, Atilla Soner, Ian Foster, and Andrey Rzhetsky. "A distributed look-up architecture for text mining applications using MapReduce." High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for. IEEE, 2011. [5] Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113. [6] M. Khalilian and N. Mustapha, "Data stream clustering: Challenges and issues," arXiv preprint arXiv: 1006.5261, 2010. [7] Chung, Seokkyung, Dennis McLeod, and Jongeun Jun. "Incremental Mining from News Streams." (2009). [8] POWER, R., AND LI, J. Piccolo: building fast, distributed programs with partitioned tables. In OSDI (2010). [9] Low, Yucheng, et al. "Graphlab: A new framework for parallel machine learning." arXiv preprint arXiv: 1006.4990 (2010).