1 / 22

Divide and Conquer: Challenges in Scaling Federated Search

Divide and Conquer: Challenges in Scaling Federated Search. Presented by Abe Lederman, President and CTO Deep Web Technologies, LLC. SearchEngine Meeting 24 April 2006 Boston, MA.

Download Presentation

Divide and Conquer: Challenges in Scaling Federated Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Divide and Conquer:Challenges in Scaling Federated Search Presented by Abe Lederman, President and CTO Deep Web Technologies, LLC SearchEngine Meeting 24 April 2006 Boston, MA

  2. SEARCH ALL OF THESE SOURCES ONE AT A TIME

  3. OR SEARCH THEM ALL AT ONCE

  4. Finding the Gold Hidden in the World Wide Web “Google-type” search engines “pan” the surface web for gold “Deep Web” search engines go mining for gold

  5. Finding the Gold Hidden in the World Wide Web “Google-type” search engines “pan” the surface web for gold “Deep Web” search engines go mining for gold

  6. Challenges Overview • Managing a large number of sources • Searching a large number of sources in parallel • Organizing and ranking the results returned

  7. Challenges of Managing Thousands of Data Sources Locate Reliable Sources Categorize Sources by Content Configure Sources for Searching Maintain Sources 4

  8. Challenges in Searching Thousands of Sources Automatically Select Sources to Search Retrieve Results from Cache Perform Many Searches in Parallel Bring Back Best Results 5

  9. Search Conductor Source Selection Optimizer Source Descriptions Previous Results Source Selection Optimizer

  10. Caching of Search Results Reduces the load (cost) of accessing sources CHALLENGES • Requires a large database • Need to determine how often to update the cache • Works best with lots of users doing similar searches

  11. We Address Scalability Through a Grid-Based Solution • Uses open standards (Web Services, WSDL, SOAP, XML) • Runs on distributed nodes • Is platform independent (Java based) • Very flexible, providing a framework for integration of various filtering and analysis tools

  12. Distributing the Workload as Grid Services

  13. Enough good results? YES Deliver results to user Can I get more results from “good” sources? Search Conductor Select sources to search Perform Search Get Next Results NO YES NO

  14. Searching a large number of sources can lead to a flood of results

  15. Challenges in Organizing and Ranking Results Multi-tier Relevance Ranking User-driven Ranking Clustering of Results 5

  16. Multi-tier Relevance Ranking • QuickRank – Ranks results based on occurrence of search terms in title, author, and snippet • MetaRank – Ranks results utilizing custom algorithms applied to meta-data • DeepRank – Downloads and indexes full-text documents HEAVY LIFTING REQUIRED!

  17. User-driven Ranking Desired: Blending (weighing) of above criteria

  18. Clustering

  19. A Grand Challenge for Federated Search Source: Walter Warnick, Ph.D., DOE OSTI. Global Discovery: Increasing the Pace of Knowledge Diffusion to Increase the Pace of Science. Presented at the Annual Meeting of the American Association for the Advancement of Science, February 16-20, 2006.

  20. Math Databases: • Research Papers • Correspondence • Conferences Math Community Global Discovery Search Portal Biology Community Physics Community Knowledge Diffusion in Action Mathematician’s Scientific Discovery • Biology Databases: • Research Papers • Correspondence • Conferences Biology Researcher’s Scientific Discovery • Physics Databases: • Research Papers • Correspondence • Conferences Physics Scientific Discovery

  21. Grid of Grids Scaling to the Next Level Each circle = a portal with 10-100 sources End result is thousands ofsources in 2 hops

  22. Thank You! Abe Lederman 122 Longview Drive Los Alamos, NM 87544 abe@deepwebtech.com www.deepwebtech.com 12

More Related