210 likes | 317 Views
Google Search Appliance. November 2, 2010. Susan Fagan. Why Google Search Appliance?. A different approach to search at EPA Smarter ranking Improved indexing Easier operations A future We’re going to call it GSA from here on in. How GSA ranks documents.
E N D
Google Search Appliance November 2, 2010 Susan Fagan
Why Google Search Appliance? • A different approach to search at EPA • Smarter ranking • Improved indexing • Easier operations • A future We’re going to call it GSA from here on in
HowGSA ranks documents • It’s a secret, but we know some things • Page rank • Self learning • We can control some things • Date biasing • Source biasing • Metadata biasing • Best bets • We’re going to let it do its thing before we tune it too much
How GSA ranks documents: Page Rank • Who links to your pages? • Who links to pages that link to your pages? • How does everybody link? • What does it say in the link text? • Is the link always theprimary URL(because if it isn’t, you don’t get any points)? A primary URL is a URL that contains no aliases that are not primary. Primary as defined by what you put in the TSSMS Alias Tool.
How GSA Ranks Documents: Things We Can Control • Date biasing • Newer is better • We control how much better • Source biasing • Boost or decrease chunks of our website • Regions are slightly decreased for Agency search • Metadata biasing • We control how much each metadata field counts • We can turn up the bias as metadata quality improves
How GSA Ranks Documents: More Things We Can Control • Best Bets • Like buying keywords from Google.com • Specific pages for specific keywords or phrases • Always featured at the top • Take effect immediately
How GSA IndexesDocuments • Continuous crawl • Learns by experience • Crawl rates tunable by host and time • Requires some starting points (seeds) • Restricted byDo Not Crawllist A manually maintained list in the GSA Admin UI, of URL patterns that the crawler should not visit. • Respects robots.txt (in it’s own way)
How EPA is implementing GSA • Same Java webapp on the same servers • Your search form will stay the same • Area search won’t change much • Your XML search application may change (most won’t) • Smart, fast indexing, with some help • Only indexing primary URLs
Implementing GSA: Your search form will stay the same • Implemented Northern Light via an object-oriented Java application • We get to keep our code this time • 6 weeks to change it, instead of 6 months • Nothing changes for client pages • Two Model 7007 Google Search Appliances - • Primary • Hot spare for failover • Parallel indexes • 2,000,000 document license
Implementing GSA: Your search form • URL is the same • All common elements work the same • Some obscure elements go away • weighted_search, search_crumbs • Custom result templates work the same • Advanced search works the same
Implementing GSA: Area Search • Area search is here for now • If you search by TSSMS • We will translate it on the fly to URL • We will only translate TSSMS to primary alias • If you search by URL • Nothing changes… • …. But aliases are your problem • Contact Peter to test your area search
Implementing GSA: Your XML search app • Parameters and templates are unchanged • GSA response packet automatically transformed to original NL format • Only 1,000 results are available for a single query • 3 applications have been observed exceeding that limit
Implementing GSA: Smart, fast indexing • Continuous crawl – scans the website at least daily for new links • If it’s not linked, it won’t be found • Librarian looks daily for new content • If all this doesn’t work (quickly), tell the librarian • Notes databases do not require Verity Views
Implementing GSA: Indexing your primary URL • Search engines think different URLs are different documents • This means duplicates in search results • All non-primary aliases are being placed in the Do Not Crawl list
What will our customers see? • The same thing…. At first. • Breadcrumbs are gone…what were they, anyway? • Folders replaced by Related Searches • FAQ will come back • Best Bets for top documents • The document they’re looking for!
What do we have to do? • Plan our November 19 public access implementation • Test (with your help) • Implement • Make it better
What do you have to do? • Keep working on ROT • Keep working on metadata • Don’t change your search form… • … Area search will work, if you want it • Tell us what you think
What are we leaving out … for now? • EPA thesaurus • Contains only general terms • We will add EPA vocabulary • Google’s spellchecker • We’ll use our own for now • We’ll compare and use the winner • RSS presentation – delivers only raw XML in search results, for now • Recent searches
What’s in our future? • Marketplace of One Box modules • Faceted search? • Contextual search? • Business intelligence? • More social media • OneEPA integration • Web CMS integration • Advanced analytics • Special collections • Geographic search? • GSA for intranet
Contact: Susan Fagan Fagan.Susan@epa.gov 202-566-2021