1 / 13

Enhanced Hierarchical File System Indexer

Enhanced Hierarchical File System Indexer. Matthew Madson Evan Figueroa COGS 188. Hypothesis.

manning
Download Presentation

Enhanced Hierarchical File System Indexer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Enhanced Hierarchical File System Indexer Matthew Madson Evan Figueroa COGS 188

  2. Hypothesis • Our claim is that by using the file system’s metadata (e.g. directory names) as additional information to format enhanced queries unbeknownst to the user, we can augment the precision of the user’s queries allowing for more relevant search results.

  3. Our Test Corpus That’s 54,977 documents across ~120 files

  4. Our Test Corpus Cont. • Top 5 (Re-indexed) • Class • Method • Object • Public • valu • Keywords • File noise words • Stem • Analyze with Snowball Analyzer • Top 5 • Use • Class • Method • Object • File

  5. Technology Used • Lucene • Lucene Snowball Analyzer • Google Collections • PDFRenderer • PDF Box • CHMDeco & Istorage

  6. Assessment • Compare relevance between 4 query formats: • Baseline Query: • Contents: non-path terms & path terms • Boosted Path Baseline Query: • Contents: path terms Contents: non-path terms^2.0

  7. Assessment Cont. • Boosted Path Query (Non-Baseline): • Parsed-path: query terms Contents: path terms Contents: non-path terms^2.0 • Path Query (Non-Baseline): • Parsed-path: query terms Contents: query terms

  8. How? Run all 4 query formats with the same test query. Take the top 10 results for each test query and put them in a bag Shuffle the bag and remove duplicates Assess (Yes / No) if the result is relevant to your test query Reverse the process to see how well the queries did.

  9. Results

  10. Results Cont.

  11. Results Cont.

  12. DEMO!

  13. Questions??

More Related