slide1 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
eTouch Systems Presents NASA Portal Search Implementation September 17 th 2003 PowerPoint Presentation
Download Presentation
eTouch Systems Presents NASA Portal Search Implementation September 17 th 2003

Loading in 2 Seconds...

play fullscreen
1 / 26

eTouch Systems Presents NASA Portal Search Implementation September 17 th 2003 - PowerPoint PPT Presentation


  • 431 Views
  • Uploaded on

eTouch Systems Presents NASA Portal Search Implementation September 17 th 2003 Agenda NASA Search Architecture Indexing Statistics Content Discovery Relevance Metadata Robots Exclusion Standards Browsable Categories Recommendations NASA Search Architecture

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'eTouch Systems Presents NASA Portal Search Implementation September 17 th 2003' - lotus


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

eTouch Systems

Presents

NASA Portal Search Implementation

September 17th 2003

agenda
Agenda
  • NASA Search Architecture
  • Indexing Statistics
  • Content Discovery
  • Relevance
  • Metadata
  • Robots Exclusion Standards
  • Browsable Categories
  • Recommendations
nasa search architecture
NASA Search Architecture
  • Global Traffic Manager (GTM) will load balance cross data center traffic to the Verity Search Boxes.
content discovery statistics
Content Discovery Statistics
  • Started off with 2600 domains (provided by NASA).
  • Around 1250 *.nasa.gov domains found inaccessible
  • Exclusion criteria discovered during manual cleansing
    • 70 domain level exclusion (provided by NASA)

e.g. http://images.ksc.nasa.gov/

    • ~2500 specific URL / Directory level exclusions

e.g. http://heasarc.gsfc.nasa.gov/listserv* is a mailing list.

    • File type exclusions

e.g. *.old, *.map, *.spec, *.mod, *.log etc.

  • 50K – 60K documents excluded because of Robots restriction
  • ~80 duplicate domains

e.g.aerospace.arc.nasa.gov& aeronautics.arc.nasa.gov etc.

  • Exclusion for text only version of documents as duplicate content
  • Exclusion of documents in binary format
    • Sample binary format document
  • 67K documents are of size greater than 1 MB or of size 0 byte.Out of which 11K documents fall within included mime-types (i.e. pdfs/htmls/docs/xls/ppts).
    • Sample 0 byte document
    • Sample 1 MB document
indexing status
Indexing Status
  • Over 2 million documents crawled.
  • Total 420K documents indexed.
  • 1600 *.nasa.gov domains are indexed.
  • Included mime-types
  • Dynamic content like *.jsp, *.cfm, *.php are also indexed.
relevance
Relevance
  • Relevance Logic are set of business rules that determines the order/rank of documents in the search results
  • NASA Search Engine generates VQL (Verity Query Language) Query based on search query to produce relevant search results.
  • Automatically a score is assigned to each retrieved document based on its relevancy to the search query.
  • Relevancy depends on the presence of search query in document’s various metadata fields or in the content itself.
  • Some of the most common words (STOP Words) like is,what,there etc. are removed from search query. These words don’t contribute much to relevancy.
relevance logic in detail
Relevance Logic in detail…
  • Based on numerous discussions and reviews of NASA’s Content we came up with an efficient and optimum relevance logic.
  • Relevance Factors in units.

Search query present in title as a phrase carries maximum

weight.

Dc is an acronym for Dublin Core interoperable metadata standards

metadata
Metadata
  • Metadata is a critical component of data which describes its content,quality,condition and other characteristics.

e.g. <META NAME="dc.subject" CONTENT="news, events">

  • Metadata value can be an appropriate free text or it can selected from controlled vocabulary.

e.g. <META NAME=“dc.description" CONTENT=“A Remote-sensing ..”>

  • Metadata fields used in Simple and Advanced Search

“Having these metadata fields in content is very important to achieve closest affinity to NASA Search Relevance Logic.”

nasa standard metadata fields for search
NASA Standard Metadata Fields for Search

Aliases are alternative names for metadata fields. e.g. On a site description field is defined as <META NAME="dc.description" CONTENT=“…"> and on other site it is <META NAME="description" CONTENT=“….">

metadata continued
Metadata Continued…
  • Metadata influences relevancy of documents.
  • For HTML documents, proper image alt text and anchor text enhances Advanced Search capabilities.
  • Suitable metadata is equally important for PDFs, Microsoft Word Documents, Excel Spreadsheets etc.
metadata examples and guidelines recommended
Metadata Examples and Guidelines - Recommended

Earth Observatory Site can be considered as an example of good quality metadata.

http://earthobservatory.nasa.gov/Study/islscp/

metadata examples and guidelines for pdf documents recommended
Metadata Examples and Guidelines for PDF Documents - Recommended
  • Search for 2003 strategic plan on search.nasa.gov will return http://www.nasa.gov/pdf/1968main_strategi.pdfon top with 99% relevance.
  • This document has 2003 strategic plan in its title, subject and as a phrase in content.

Properly populated metadata resulted most relevant document on top.

http://www.nasa.gov/pdf/1968main_strategi.pdf

metadata examples and guidelines for msword documents recommended
Metadata Examples and Guidelines for MSWord Documents - Recommended

Suitable title, subject and keywords should be populated for Microsoft Word and Excel Documents.

http://science.ksc.nasa.gov/projects/astwg/vfunct07.doc

metadata examples not recommended
Metadata Examples - Not Recommended

On km.nasa.gov,

many documents are having same

value for meta data field description.

Metadata should be pertinent to content. It improves the efficiency of searching, making it much easier to find something specific and relevant

http://km.nasa.gov/

metadata examples not recommended16
Metadata Examples - Not Recommended

On quest.nasa.gov,

many documents are having same

value for meta data fields description and keywords.

http://quest.nasa.gov/women/archive/12-07-99aldas.html

metadata examples not recommended17
Metadata Examples - Not Recommended

Inappropriate population of metadata would negatively affect relevance logic.

http://amesnews.arc.nasa.gov/releases/2003/03_24AR.html

metadata population tool machine aided indexing
Metadata Population Tool - Machine Aided Indexing

Metadata can be easily generated using NASA Thesaurus Machine Aided Indexing (MAI) . http://mai.larc.nasa.gov

Generate metadata

http://www.nasa.gov/vision/earth/environment/HURRICANE_RECIPE.html

metadata generation using machine aided indexing
Metadata Generation using Machine Aided Indexing

NASA Thesaurus

Paste content here.

Generated keywords

Selected keywords as a part of metadata

http://mai.larc.nasa.gov/

robots exclusion standards
Robots Exclusion Standards
  • The Robots Exclusion Protocol

A Web site administrator can indicate which parts of the site should not be visited by a spider/robot, by providing a specially formatted file, robots.txt in document root of their site.

e.g.User-agent: *

Disallow: /

This file will not allow any spider/robot to crawl the site.

  • The Robots META tag

This allows HTML authors to indicate to visiting robots whether a document may be indexed, or used to harvest more links. No server administrator action is required.

e.g. <META NAME=“robots" CONTENT=“noindex,follow">

A robot should not index this document but should analyze it for links.

usage of robots exclusion standards
Usage of Robots Exclusion Standards
  • Provide appropriate robots.txt to allow NASA Spider to crawl desired content.

e.g.User-agent: nasak2spider

Disallow:

User-agent: *

Disallow: /

This file will allow nasak2spider which is the name of NASA Spider to crawl the site and will deny access to all other robots.

User-agent: nasak2spider

Disallow:/cgi-bin/

User-agent: *

Disallow: /

This file will allow nasak2spider to crawl the site except /cgi-bin/ and will deny access to all other robots.

usage of robots exclusion standards22
Usage of Robots Exclusion Standards
  • Put suitable robot meta tags to direct visiting robot to index or follow the document.

e.g.<META NAME="robots" CONTENT="index,follow">

< META NAME="robots" CONTENT="noindex,follow">

< META NAME="robots" CONTENT="index,nofollow">

< META NAME="robots" CONTENT="noindex,nofollow">

  • Content discovery and cleansing time can be reduced a lot by using Robots Exclusion Standards efficiently.
tips for frames using robot tags
Tips for Frames using Robot Tags
  • Generally frames can be divided into three parts.
    • e.g.On http://www.sti.nasa.gov/

Parent Frame

Navigation Frame

Content Frame

http://www.sti.nasa.gov/

tips for frames using robot tags24
Tips for Frames using Robot Tags
  • Put appropriate meta data in parent frame html.
  • As navigation frame doesn’t add any value to content, add robots meta tags directive so as not to index it but follow links in the frame.

e.g.< META NAME="robots" CONTENT="noindex,follow">

  • Content frame contains desired information, hence robots meta tags directive to index the frame and follow links in it should be added.

e.g.< META NAME="robots" CONTENT="index,follow">

browsable categories
Browsable Categories
  • Defining Browsable Taxonomy is an iterative process which evolves by adding new categories and defining new business rules
  • Taxonomy and Business Rule Workflow
recommendations
Recommendations
  • Populate Appropriate Metadata
    • Title, Keywords and Description
    • Meta-tagging should be relevant to the document and the expected search terms
    • Today - Less than 10% of total documents have the basic metadata associated with it.
    • Use suggested metadata aliases (if any).
  • Follow NASA Descriptive Taxonomy and Metadata Guidelines which are available on NASA Support site.
    • http://portalpub.jpl.nasa.gov:8080/project_doc/IA_and_Taxonomy/NASA_Descriptive_Taxonomy_Spreadsheet_v8.2_04.02.03.xls
  • Use robots.txt or robots meta tags to direct NASA Spider.
  • Improve Browsable Categories and Business Rules by contributing your feedback.