1 / 37

Large Scale Internet Search at Ask

Large Scale Internet Search at Ask.com. Tao Yang. Outline. Overview of the company and search products Core techniques for page ranking ExpertRank Challenges in building scalable search services Neptune clustering middleware. Fault isolation. Fast detection (TAMP) Communication.

fynn
Download Presentation

Large Scale Internet Search at Ask

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large Scale Internet Search at Ask.com Tao Yang

  2. Outline • Overview of the company and search products • Core techniques for page ranking • ExpertRank • Challenges in building scalable search services • Neptune clustering middleware. • Fault isolation. Fast detection (TAMP) • Communication

  3. Image/video search • Image search is very popular. • Video search is getting popular. • Driven by significant growth of broadband users • Popularity of video sharing: Youtube. • Major content providers continue to make significant investments • Disney, Fox, ABC/NBC/CNN/CBS, ESPN, AOL/Time Warner. • Traditional publishers also move to video content: NY times/ LA time, Wall street journal etc. • Significant growth on online advertisement market and

  4. Ask.com: Focused on Delivering a Better Search Experience • Innovative search technologies help people find what they’re looking for faster • For text, image, video, map, city search etc. • #6 U.S. Web Property; #8 Global in terms of user coverage (Formally Ask Jeeves) • 28.5% reach - Active North American Audience with 48.8 million unique users • 133 million global unique users • 6% Share of North American Searches • A Division of IAC, a fortunate 500 company with over 60 brands, 28,000 employees.

  5. Sectors of IAC • Retailing • Services • Media & Advertising • Membership & Subscriptions • Emerging Businesses

  6. 7 Billion Videos Per Month

  7. Site Features: Smart Answer

  8. Topic Zooming with Search Suggestions

  9. AskCity

  10. Ask Competitive Strengths • Deeper topic view of the Internet • Query-specific link and text analysis with behavior analysis • Differentiated clustering technology • Natural Language Processing • Better understanding/analysis of queries and user behavior • Integration of Structured Data with Web search.

  11. Parsing Parsing Parsing Behind Ask.com: Data Indexing and Mining Internet Web documents Crawler Crawler Crawler Document DB Document DB Document DB Inverted index generation Inverted index generation Inverted index generation Online Database Content classification Link graph generation Spammer removal Link graph generation Web graph generation Duplicate removal

  12. Ranking Ranking Ranking Ranking Ranking Ranking Classification Page index Engine Architecture Client queries Traffic load balancer Frontend Frontend Frontend Frontend Neptune Clustering Middleware Hierarchical Result Cache Page index Document Abstract Document Abstract Document Abstract Document description Structured DB

  13. 1 2 3 4 5 6 7 8 9 1 1 2 1 1 1 1 3 4 1 2 5 1 6 7 3 1 4 1 5 1 8 6 9 1 1 7 8 Concept: Link-based Popularity for Ranking • A is a connectivity matrix among web pages. A(i,j)=1 for edge from i to j. • Query-independent popularity. • Query-specific popularity

  14. Approaches for Page Ranking • PageRank:[Brin/Page’98] offline computation of query-independent popularity iteratively. • HITS:[Kleinberg’98, IBM Clever] • Build a query-based connectivity matrix on the fly. H, R are hub and authority weights of pages. • Repeat until H, R converge. • R=A’ H= A’A R; • Normalize H, R. • ExpertRank: Compute query-specific communities and ranking in real time. • Started from Teoma and evolved at Ask.com

  15. Steps of ExpertRank at Ask.com Clustering for subject communities for matched results 2 4 1 3 Ranking with knowledge and classification local subject-specific mining Local Subject Community Search the index for a query

  16. 1 Index search and web graph generation • Search the index and identify relevant candidates for a given query. • Generate a query-specific link graph dynamically.

  17. 2 Multi-stage Cluster Refinement with Integrated Link/Topic Analysis • Derive link-guided page communities. • Cluster refinement with topic purification • Decompose through text classification and NLP • Restructure through topic similarity analysis

  18. 3 Hub Authority Subject-specific ranking • Examples: • “bat”: Flying mammals vs. Baseball bat. • “microwave dish”: food recipes/cookware vs. satellite TV reception. • For each topic group, identify experts for page recommendation, and remove spamming links.

  19. 4 Hub Local Subject Community Authority Integrated Ranking with User Intention Analysis • Score weighting from multiple topic groups. • Authoritativeness and freshness assessment. • User intention analysis. • Result diversification.

  20. Scalability Challenges • Data scalability: • From millions of pages to billions of pages. • Clean vs. datasets with lots of noise. • Infrastructure scalability: • Tens of thousands of machines. • Tens of Millions of users • Impact on response time, throughput, &availability, • data center power/space/networking. • People scalability: From few persons to many engineers with non-uniform experience.

  21. Examples of Scalability Problems • Mining question answers from web. • Computing with irregular data structure. Level-1/2 cache. • Large-scale memory management: 32 bits vs. 64 bits. • Incremental cluster expansion and topology-aware management. • High throughput write/read traffic: reliability vs performance. Logging and checkpointing. • Fast and reliable data propagation across networks. • Architecture optimization for low power consumption. • Software engineering • Update large software & data on a live platform. • Distributed debugging thousands of machines.

  22. Some of Lessons Learned • Data • Data methods can behave differently with different data sizes/noise levels. • Data-driven approaches with iterative refinement to track positive/negative effectiveness • Architecture & Software • Distributed service-oriented architectures • Middleware support.

  23. The Neptune Clustering Middleware • A simple/flexible programming model • Aggregating and replicating application modules with persistent data. • Shielding complexity of service discovery, load balancing, consistency, and failover management • Providing inter-service communication. • Providing quality-aware request scheduling for service differentiation • Started at UCSB. Evolved with Teoma, Ask.com.

  24. Programming Model and Cluster-level Parallelism/Redudancy in Neptune • Request-driven processing model. • SPMD model (single program/multiple data) while large data sets are partitioned and replicated. • Location-transparent service access with consistency support. Service cluster Request Provider module Provider module Service method Clustering by Neptune … Data

  25. Neptune architecture for cluster-based services • Symmetric and decentralized: • Each node can host multiple services, acting as a service provider. • Each node can also subscribe internal services from other nodes, acting as a consumer. • Support multi-tier or nested service architecture Service consumer/provider Client requests

  26. Network to the rest of the cluster Service Access Point Service Availability Directory Polling Agent Service Load-balancing Subsystem Service Availability Subsystem Service Providers Load Index Server Service Availability Publishing Service Runtime Inside a Neptune Server Node Service Consumers Service Handling Module

  27. Impact of Component Failure in Multi-tier services • Failure of one replica: 7s - 12s • Service unavailable: 10s - 13s

  28. Service B Service B Replica #1 Replica #1 (Unresponsive) (Unresponsive) Problems that affect availability • Tradeoff: Bounded pools in multi-threaded services. • Threads are blocked with slow service dependency. • Fault detection speed. Requests Queue Service B Replica #2 Thread Pool (Healthy) Service A (From healthy to unresponsive)

  29. Dependency Isolation • Per-dependency management with capsules. • Isolate their performance impact. • maintain dependency-specific feedback information for QoS control. • Programming support with automatic recognition of dependency states.

  30. Fast Fault Detection and Information Propagation for Large-Scale Cluster-Based Services • Complex 24x7 network topology in service clusters. • Frequent events: failures, structure changes, and new services. • Yellowpage directory • discovery of services and their attributes • Server aliveness

  31. TAMP: Topology-Adaptive Membership Protocol • Highly Efficient: Optimize bandwidth, # of packets • Topology-aware: • Form a hierarchical tree according to network topology • Localize traffic within switches and adaptive to changes of switch architecture. • Topology-adaptive: • Network changes: switches • Scalable: scale to tens of thousands of nodes. Easy to operate.

  32. Reliable Communication for Large-Scale Data-Intensive Computing • Small messages or large files • Membership dynamicity and fault masking • Easy programming Senders Receivers

  33. Solution for large-scale NxM communication Receivers Senders MediationLayer

  34. Concluding Remarks • Ask.com is focused on leading-edge technology for Internet search. • Various solutions developed for ranking, mining, and infrastructure support. • Still there are many open/challenging problems to be solved: exciting opportunities.

More Related