Utilizing Distributed Environment for Clustering Web Page Streams in Search Engines

Utilizing Distributed Environment for Clustering Web Page Streams in Search Engines Saeed Rahmani, Dr. MohammdHadiSadroddini Shiraz University

Overview • Introduction • Proposed Model • FICA as Crawler Algorithm • Incremental Clustering • Distributed Environment • MapReduce • PowerGraph • Reference

The Age of Big Data Social Media Web Advertising Science 72 Hours each Minute YouTube 6 Billion Flickr Photos 1 Billion Facebook Users 28 Million Wikipedia Pages

Powerful tools for tackling large-data problems Search Engine Bioinformatics DNA sequence assembly protein-protein interaction networks Recommendation system Text processing Machine Translation MapReduce, PowerGraph, Spark, Storm, … How are you?

Proposed Model Web PowerGraph PowerGraph [2] Web Graph FICA [1] Crawler Temp Repository Web Page Pre-processing [3] Urls and Links MapReduce MapReduce [5] Incremental Clustering [6] Important N-gram Detection Unit [4] Application Search Engine News Analysis …. • Clustered Web Page Repository

Fast Intelligent Crawling Algorithm(FICA) Throughput of crawling algorithms where the benchmark ranking is PageRank[1]. Logarithmic Distance in FICA[1]

Incremental clustering [6] D1 D2 D3 D4 D5 D6

Main Concept D1 D2 D3 D4 D5 - Document Representation - Documents Similarity - Nearest Neighbor

N-gram • N-Grams are sequences of tokens. • The N stands for how many terms are used • Unigram: 1 term • Bigram: 2 terms • Trigrams: 3 terms List of Persian n-grams

MapReduceFramework [5] Input Map Shuffle & Sort Reduce Output the, 1 brown, 1 fox, 1 the quick brown fox brown, 2 fox, 2 how, 1 now, 1 the, 3 Map Reduce the, 2 fox, 1 Word Count the fox ate the mouse Map quick, 1 how, 1 now, 1 brown, 1 ate, 1 cow, 1 mouse, 1 quick, 1 ate, 1 mouse, 1 Reduce how now brown cow Map cow, 1

Implementing FICA using Map-Reduce (sec) DataSet: “uk-2002” Nodes: 18,520,486 Edges: 298,113,762 active and in-active nodes (Iteration) Execution time of Map-Reduce iterations using 100 random initial seeds (sec) Speedup of improved Map-Reduce execution using active and in-active nodes. (Iteration) Execution time using active and in-active nodesof Map-Reduce iterations using 100 random initial seeds

Natural Graphs More than 108 vertices have one neighbor. Top 1% of vertices are adjacent to 50% of the edges! [Image from WikiCommons]

Power-Law Graphs are Difficult to Partition • Power-Law graphs do not have low-cost balanced cuts • Traditional graph-partitioning algorithms perform poorly on Power-Law Graphs. CPU 1 CPU 2 http://www.cs.cmu.edu/~ylow/vldb5.pptx

Curse of the Slow Job Iterations Data Data Data Data CPU 1 CPU 1 CPU 1 Data Data Data Data Data Data Data Data CPU 2 CPU 2 CPU 2 Data Data Data Data Data Data Data Data CPU 3 CPU 3 CPU 3 Data Data Data Data Data Data Data Data Barrier Barrier Barrier http://www.www2011india.com/proceeding/proceedings/p607.pdf

PowerGraph[2] Program For This Run on This Machine 1 Machine 2 • Split High-Degree vertices • New Abstraction Equivalence on Split Vertices http://www.cs.cmu.edu/~ylow/vldb5.pptx

Distributed Execution of a PowerGraph Vertex-Program Machine 1 Machine 2 Mirror Mirror Master Mirror Gather Y’ Y’ Y’ Y’ Y Y Σ Σ1 Σ2 + + + Apply Y Y Machine 3 Machine 4 Σ3 Σ4 Scatter http://www.cs.cmu.edu/~ylow/vldb5.pptx

PageRank on the Twitter Follower Graph Natural Graph with 40M Users, 1.4 Billion Links Communication Runtime Seconds Total Network (GB) Reduces Communication Runs Faster 32 Nodes x 8 Cores (EC2 HPC cc1.4x) http://www.cs.cmu.edu/~ylow/vldb5.pptx

Reference [1] Zareh Bidoki, Ali Mohammad, Nasser Yazdani, and Pedram Ghodsnia. "FICA: a fast intelligent crawling algorithm." Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence. IEEE Computer Society, 2007. [2] Gonzalez, Joseph E., et al. "PowerGraph: Distributed graph-parallel computation on natural graphs." Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 2012. [3] Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze.Introduction to information retrieval. Vol. 1. Cambridge: Cambridge University Press, 2008. [4] Balkir, Atilla Soner, Ian Foster, and Andrey Rzhetsky. "A distributed look-up architecture for text mining applications using MapReduce." High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for. IEEE, 2011. [5] Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113. [6] M. Khalilian and N. Mustapha, "Data stream clustering: Challenges and issues," arXiv preprint arXiv: 1006.5261, 2010. [7] Chung, Seokkyung, Dennis McLeod, and Jongeun Jun. "Incremental Mining from News Streams." (2009). [8] POWER, R., AND LI, J. Piccolo: building fast, distributed programs with partitioned tables. In OSDI (2010). [9] Low, Yucheng, et al. "Graphlab: A new framework for parallel machine learning." arXiv preprint arXiv: 1006.4990 (2010).

Utilizing Distributed Environment for Clustering Web Page Streams in Search Engines

Utilizing Distributed Environment for Clustering Web Page Streams in Search Engines

Presentation Transcript

Web Search Engines

Web Technologies Search Engines

Web Technologies Search Engines

Web search engines

Web Search Engines

Web Search Engines

Web Page Clustering using Heuristic Search in the Web Graph

Page Ranking Techniques In Search Engines

Web and Search Engines

Web Search Engines

Web search engines

How to enhance Your Web page ranking in Search Engines

A Secure Clustering Algorithm for Distributed Data Streams

Web Search Engines

Web Search Engines

Web Search Engines

Web clustering Engines

Deep Web Search Engines

Web search engines

Web Search Engines