1 / 21

Google Search Appliance

Google Search Appliance. November 2, 2010. Susan Fagan. Why Google Search Appliance?. A different approach to search at EPA Smarter ranking Improved indexing Easier operations A future We’re going to call it GSA from here on in. How GSA ranks documents.

maine
Download Presentation

Google Search Appliance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Google Search Appliance November 2, 2010 Susan Fagan

  2. Why Google Search Appliance? • A different approach to search at EPA • Smarter ranking • Improved indexing • Easier operations • A future We’re going to call it GSA from here on in

  3. HowGSA ranks documents • It’s a secret, but we know some things • Page rank • Self learning • We can control some things • Date biasing • Source biasing • Metadata biasing • Best bets • We’re going to let it do its thing before we tune it too much

  4. How GSA ranks documents: Page Rank • Who links to your pages? • Who links to pages that link to your pages? • How does everybody link? • What does it say in the link text? • Is the link always theprimary URL(because if it isn’t, you don’t get any points)? A primary URL is a URL that contains no aliases that are not primary. Primary as defined by what you put in the TSSMS Alias Tool.

  5. How GSA Ranks Documents: Things We Can Control • Date biasing • Newer is better • We control how much better • Source biasing • Boost or decrease chunks of our website • Regions are slightly decreased for Agency search • Metadata biasing • We control how much each metadata field counts • We can turn up the bias as metadata quality improves

  6. How GSA Ranks Documents: More Things We Can Control • Best Bets • Like buying keywords from Google.com • Specific pages for specific keywords or phrases • Always featured at the top • Take effect immediately

  7. How GSA IndexesDocuments • Continuous crawl • Learns by experience • Crawl rates tunable by host and time • Requires some starting points (seeds) • Restricted byDo Not Crawllist A manually maintained list in the GSA Admin UI, of URL patterns that the crawler should not visit. • Respects robots.txt (in it’s own way)

  8. How EPA is implementing GSA • Same Java webapp on the same servers • Your search form will stay the same • Area search won’t change much • Your XML search application may change (most won’t) • Smart, fast indexing, with some help • Only indexing primary URLs

  9. Implementing GSA: Your search form will stay the same • Implemented Northern Light via an object-oriented Java application • We get to keep our code this time • 6 weeks to change it, instead of 6 months • Nothing changes for client pages • Two Model 7007 Google Search Appliances - • Primary • Hot spare for failover • Parallel indexes • 2,000,000 document license

  10. Implementing GSA: Your search form • URL is the same • All common elements work the same • Some obscure elements go away • weighted_search, search_crumbs • Custom result templates work the same • Advanced search works the same

  11. Implementing GSA: Area Search • Area search is here for now • If you search by TSSMS • We will translate it on the fly to URL • We will only translate TSSMS to primary alias • If you search by URL • Nothing changes… • …. But aliases are your problem • Contact Peter to test your area search

  12. Implementing GSA: Your XML search app • Parameters and templates are unchanged • GSA response packet automatically transformed to original NL format • Only 1,000 results are available for a single query • 3 applications have been observed exceeding that limit

  13. Implementing GSA: Smart, fast indexing • Continuous crawl – scans the website at least daily for new links • If it’s not linked, it won’t be found • Librarian looks daily for new content • If all this doesn’t work (quickly), tell the librarian • Notes databases do not require Verity Views

  14. Implementing GSA: Indexing your primary URL • Search engines think different URLs are different documents • This means duplicates in search results • All non-primary aliases are being placed in the Do Not Crawl list

  15. What will our customers see? • The same thing…. At first. • Breadcrumbs are gone…what were they, anyway? • Folders replaced by Related Searches • FAQ will come back • Best Bets for top documents • The document they’re looking for!

  16. What do we have to do? • Plan our November 19 public access implementation • Test (with your help) • Implement • Make it better

  17. What do you have to do? • Keep working on ROT • Keep working on metadata • Don’t change your search form… • … Area search will work, if you want it • Tell us what you think

  18. What are we leaving out … for now? • EPA thesaurus • Contains only general terms • We will add EPA vocabulary • Google’s spellchecker • We’ll use our own for now • We’ll compare and use the winner • RSS presentation – delivers only raw XML in search results, for now • Recent searches

  19. What’s in our future? • Marketplace of One Box modules • Faceted search? • Contextual search? • Business intelligence? • More social media • OneEPA integration • Web CMS integration • Advanced analytics • Special collections • Geographic search? • GSA for intranet

  20. Contact: Susan Fagan Fagan.Susan@epa.gov 202-566-2021

More Related