1 / 19

Utilizing Distributed Environment for Clustering Web Page Streams in Search Engines

Utilizing Distributed Environment for Clustering Web Page Streams in Search Engines. Saeed Rahmani , Dr. Mohammd Hadi Sadroddini Shiraz University. Overview. Introduction Proposed Model FICA as Crawler Algorithm Incremental Clustering Distributed Environment MapReduce PowerGraph

pfowler
Download Presentation

Utilizing Distributed Environment for Clustering Web Page Streams in Search Engines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Utilizing Distributed Environment for Clustering Web Page Streams in Search Engines Saeed Rahmani, Dr. MohammdHadiSadroddini Shiraz University

  2. Overview • Introduction • Proposed Model • FICA as Crawler Algorithm • Incremental Clustering • Distributed Environment • MapReduce • PowerGraph • Reference

  3. The Age of Big Data Social Media Web Advertising Science 72 Hours each Minute YouTube 6 Billion Flickr Photos 1 Billion Facebook Users 28 Million Wikipedia Pages

  4. Powerful tools for tackling large-data problems Search Engine Bioinformatics DNA sequence assembly protein-protein interaction networks Recommendation system Text processing Machine Translation MapReduce, PowerGraph, Spark, Storm, … How are you?

  5. Proposed Model Web PowerGraph PowerGraph [2] Web Graph FICA [1] Crawler Temp Repository Web Page Pre-processing [3] Urls and Links MapReduce MapReduce [5] Incremental Clustering [6] Important N-gram Detection Unit [4] Application Search Engine News Analysis …. • Clustered Web Page Repository

  6. Fast Intelligent Crawling Algorithm(FICA) Throughput of crawling algorithms where the benchmark ranking is PageRank[1]. Logarithmic Distance in FICA[1]

  7. Incremental clustering [6] D1 D2 D3 D4 D5 D6

  8. Main Concept D1 D2 D3 D4 D5 - Document Representation - Documents Similarity - Nearest Neighbor

  9. N-gram • N-Grams are sequences of tokens. • The N stands for how many terms are used • Unigram: 1 term • Bigram: 2 terms • Trigrams: 3 terms List of Persian n-grams

  10. MapReduceFramework [5] Input Map Shuffle & Sort Reduce Output the, 1 brown, 1 fox, 1 the quick brown fox brown, 2 fox, 2 how, 1 now, 1 the, 3 Map Reduce the, 2 fox, 1 Word Count the fox ate the mouse Map quick, 1 how, 1 now, 1 brown, 1 ate, 1 cow, 1 mouse, 1 quick, 1 ate, 1 mouse, 1 Reduce how now brown cow Map cow, 1

  11. Implementing FICA using Map-Reduce (sec) DataSet: “uk-2002” Nodes: 18,520,486 Edges: 298,113,762 active and in-active nodes (Iteration) Execution time of Map-Reduce iterations using 100 random initial seeds (sec) Speedup of improved Map-Reduce execution using active and in-active nodes. (Iteration) Execution time using active and in-active nodesof Map-Reduce iterations using 100 random initial seeds

  12. Natural Graphs More than 108 vertices have one neighbor. Top 1% of vertices are adjacent to 50% of the edges! [Image from WikiCommons]

  13. Power-Law Graphs are Difficult to Partition • Power-Law graphs do not have low-cost balanced cuts • Traditional graph-partitioning algorithms perform poorly on Power-Law Graphs. CPU 1 CPU 2 http://www.cs.cmu.edu/~ylow/vldb5.pptx

  14. Curse of the Slow Job Iterations Data Data Data Data CPU 1 CPU 1 CPU 1 Data Data Data Data Data Data Data Data CPU 2 CPU 2 CPU 2 Data Data Data Data Data Data Data Data CPU 3 CPU 3 CPU 3 Data Data Data Data Data Data Data Data Barrier Barrier Barrier http://www.www2011india.com/proceeding/proceedings/p607.pdf

  15. PowerGraph[2] Program For This Run on This Machine 1 Machine 2 • Split High-Degree vertices • New Abstraction Equivalence on Split Vertices http://www.cs.cmu.edu/~ylow/vldb5.pptx

  16. Distributed Execution of a PowerGraph Vertex-Program Machine 1 Machine 2 Mirror Mirror Master Mirror Gather Y’ Y’ Y’ Y’ Y Y Σ Σ1 Σ2 + + + Apply Y Y Machine 3 Machine 4 Σ3 Σ4 Scatter http://www.cs.cmu.edu/~ylow/vldb5.pptx

  17. PageRank on the Twitter Follower Graph Natural Graph with 40M Users, 1.4 Billion Links Communication Runtime Seconds Total Network (GB) Reduces Communication Runs Faster 32 Nodes x 8 Cores (EC2 HPC cc1.4x) http://www.cs.cmu.edu/~ylow/vldb5.pptx

  18. Reference [1] Zareh Bidoki, Ali Mohammad, Nasser Yazdani, and Pedram Ghodsnia. "FICA: a fast intelligent crawling algorithm." Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence. IEEE Computer Society, 2007. [2] Gonzalez, Joseph E., et al. "PowerGraph: Distributed graph-parallel computation on natural graphs." Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 2012. [3] Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze.Introduction to information retrieval. Vol. 1. Cambridge: Cambridge University Press, 2008. [4] Balkir, Atilla Soner, Ian Foster, and Andrey Rzhetsky. "A distributed look-up architecture for text mining applications using MapReduce." High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for. IEEE, 2011. [5] Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113. [6] M. Khalilian and N. Mustapha, "Data stream clustering: Challenges and issues," arXiv preprint arXiv: 1006.5261, 2010. [7] Chung, Seokkyung, Dennis McLeod, and Jongeun Jun. "Incremental Mining from News Streams." (2009). [8] POWER, R., AND LI, J. Piccolo: building fast, distributed programs with partitioned tables. In OSDI (2010). [9] Low, Yucheng, et al. "Graphlab: A new framework for parallel machine learning." arXiv preprint arXiv: 1006.4990 (2010).

  19. ?

More Related