searching citeseer metadata using nutch l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Searching CiteSeer Metadata Using Nutch PowerPoint Presentation
Download Presentation
Searching CiteSeer Metadata Using Nutch

Loading in 2 Seconds...

play fullscreen
1 / 25

Searching CiteSeer Metadata Using Nutch - PowerPoint PPT Presentation


  • 141 Views
  • Uploaded on

Searching CiteSeer Metadata Using Nutch. Larry Reeve INFO624 – Information Retrieval Dr. Lin – Winter 2005. CiteSeer. CiteSeer. Search Issues Keyword-based full-text search Boolean search syntax How to… search by author name? search author affiliation? search by publication date?.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Searching CiteSeer Metadata Using Nutch' - daniel_millan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
searching citeseer metadata using nutch

Searching CiteSeer Metadata Using Nutch

Larry Reeve

INFO624 – Information Retrieval

Dr. Lin – Winter 2005

citeseer3
CiteSeer
  • Search Issues
    • Keyword-based full-text search
    • Boolean search syntax
    • How to…
      • search by author name?
      • search author affiliation?
      • search by publication date?
citeseer4
CiteSeer
  • Example:
    • Suggested author search approach:
      • For authors, list all variants that appear in citations, separated by “OR“
      • Examples:
        • m jordan or michael jordan or m i jordan or

michael i jordan

        • howard w/2 white or h w/2 white
slide7
Goal
  • Search selected metadata fields
    • Author name
    • Author affiliation
    • Publication Date (month, day, year)
    • Title
    • Others…
  • Increase precision
methodology nutch
Methodology - Nutch
  • An open-source web search engine
    • Includes crawling, indexing, searching
  • Technologies: Java, JSP, Tomcat
  • Extensible
    • new fields
    • new parsing/indexing facilities
    • adapt UI for searching
methodology
Methodology

1) Split XML file into HTML documents

  • Each HTML doc contains metadata
  • Allows existing crawler to be used/extended

2) Crawl and index HTML documents on local filesystem

3) Search generated index using JSP page

methodology11
Methodology

Implemented as part of project

XML File

(100 records)

Split Program

100 HTML

Documents

Nutch Crawler

Parse

Filter

Index

Filter

Nutch Search

(JSP)

Query

Filter

methodology crawl index
Methodology – Crawl/Index
  • Requires 2 filters to process metadata
    • CSParseFilter
      • Parses HTML for metadata values
      • Implements Nutch HtmlParseFilter interface
    • CSIndexingFilter
      • Uses metadata generated by ParseFilter
      • Adds metadata to index
      • Implements Nutch IndexingFilter interface
methodology query
Methodology – Query
  • Modification of Nutch search page
      • Change URL from filesystem metadata HTML to CiteSeer
      • Change to 20 hits, to match CiteSeer
  • Query filter
    • Handles custom fields from index filter
      • Prefixed with cs_
    • Implements Nutch QueryFilter interface
evaluation
Evaluation
  • Testing for precision/recall
    • 100 documents
  • Stress test
    • 10,000 documents
      • Approx 10 mins to crawl/index
    • 575,000 documents in CiteSeer metadata download
      • (716,797 documents in CiteSeer)
      • 3.5 hours to split XML into HTML
      • 12 hours to crawl/index
      • ~551,000 indexed during crawling
evaluation21
Evaluation
  • Precision & recall
    • Use first 100 docs (easy to measure recall)
    • Issue queries
        • Author last name
        • Author first & last name
        • Author affiliation
  • Precision
    • Use max docs in each system
    • Issue author search queries to both systems
    • Measure precision on each page of 20 hits
evaluation p r
Evaluation – P & R
  • Look for all papers where Peter Lee is an author (1 document)
      • cs_authorlast:lee
        • Returns 3 documents, all with last name of Lee
        • P=.33, R=1
      • cs_authorlast:lee cs_authorfirst:peter
        • Returns single document
        • P=1, R=1
evaluation precision
Evaluation - Precision
  • Author search:
      • Q1: Peter Lee
        • Project: cs_authorfirst:peter cs_authorlast:lee
        • CiteSeer: peter w/2 lee
      • Q2: Jeffrey Ullman
        • Project: cs_authorfirst:jeffrey cs_authorlast:ullman
        • CiteSeer: jeffrey w/2 ullman
      • Q3: John Smith
        • Project: cs_authorfirst:john cs_authorlast:smith
        • CiteSeer: john w/2 smith
search demo
Search Demo
  • Available fields:
    • cs_authorfirst
    • cs_authorlast
    • cs_authoraffiliation
    • cs_pubyear
    • cs_pubmonth