1 / 21

Lucene

Lucene. Brian Nisonger Feb 08,2006. What is it?. Doug Cutting’s grandmother’s middle name A open source set of Java Classses Search Engine/Document Classifier/Indexer http://lucene.sourceforge.net/talks/pisa/ Developed by Doug Cutting 1996 Xerox/Apple/Excite/Nutch

Download Presentation

Lucene

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lucene Brian Nisonger Feb 08,2006

  2. What is it? • Doug Cutting’s grandmother’s middle name • A open source set of Java Classses • Search Engine/Document Classifier/Indexer • http://lucene.sourceforge.net/talks/pisa/ • Developed by Doug Cutting 1996 • Xerox/Apple/Excite/Nutch • Wrote several papers in IR

  3. What is it-Nuts and Bolts • Modules for IR • Analysis • Tokenization • Where tokens are indexed • Document • Where the Document ID is created • Date of Document is extracted • Title of document is extracted

  4. Nuts and Bolts -II • Modules-Con’t • Index • Provides access to indexes • Maintains indexes • Query Parser • Where the magic of query happens • Search • Searches across indexes

  5. Nuts and Bolts-III • Modules-Con’t • Search Spans • Spans • K+/- words • Example: • Find me a document that has Rachael Ray and Alton Brown within 100 words of each other that also has the term cooking • Store/Util • Store the indexes and other housekeeping

  6. Theory • Space Optimization for Total Ranking • Cutting et al 1996 • RAIO (Computer Assisted IR) 1997 • http://lucene.sf.net/papers/riao97.ps • Lucene lecture at Pisa • Doug Cutting • Slides from Lecture at University of Pisa 2004 • See previous link

  7. Vector • Vectors are a mathematical distance between terms • Uses a cosine distance to determine how close terms/documents are • This distance can then be used for WSD/Clustering/IR • Example: • Bass,fishing: .6506 • Bass,guitar: .000423 • This tells us the document is about fishing not about guitars

  8. Vectors-IR • “Vector-space search engines use the notion of a term space, where each document is represented as a vector in a high-dimensional space. There are as many dimensions as there are unique words in the entire collection. Because a document's position in the term space is determined by the words it contains, documents with many words in common end up close together, while documents with few shared words end up far apart.” • http://www.perl.com/pub/a/2003/02/19/engine.html • Intro to Comp Ling and its applications to IR • Nisonger 2005 :P

  9. Inverted Index • Term/Doc Id/Weight • Term • “A Token, the basic unit of indexing in Lucene, represents a single word to be indexed after any document domain transformation -- such as stop-word elimination, stemming, filtering, term normalization, or language translation -- has been applied.” • http://www.javaworld.com/javaworld/jw-09-2000/jw-0915-lucene-p2.html

  10. Inverted Index –Con’t • Doc Id • A unique “key” that identifies each document • Weight • Binary • Freq Count • Weighting Algorithm

  11. Index Merge • Basic/Basket/Basketball • Only keeps track of the differences between words • Periodically merges indexes • Allows new documents to be added easily

  12. Query • Boolean Search • Only searches documents with at least 1 term in query • “Boolean Search Engine” • Parallel Search • Each term in query is search in parallel • Partial scores added to queue of docs

  13. Query-II • Threshold • If partial score is too low and will not be part of N-best then the document is ignored even before search is complete • Example • Potential New Doc [0,0,0,0,0,0,i] • Document ranked 14 [233,202,109,100,i] • Potential New Doc is ignored • Small loss of recall greatly increases speed of search

  14. Evaluation of Lucene • Quantitative Evaluation of Passage Retrieval Algorithms for Question Answering • Tellex et al, MIT AI Lab 2003 • Compared Prise to Lucene for question and answer tasks • Question & Answer • <Who is the president?> <George W. Bush .76>

  15. Evaluation-II • Prise • A IR system developed by NIS that according to the paper uses “modern” search engine techniques • Findings • Found Prise was better than Lucene since “Boolean” query engines are considered old school and its answers to questions were better

  16. Eval-III • Lucene • Found although Prise had better correct answers Lucene found more documents containing relevant information

  17. Eval-Conclusion • External Knowledge Sources for Question Answering • http://people.csail.mit.edu/gremio/publications/TREC2005.ps. • Katz et al, MIT Lab 2005 • MIT used Lucene in their 2005 TREC submission not Prise

  18. Users • Lucene is used widely • TREC • Document Retrieval Enterprise Systems • Part of Database/Web engine • Part of Nutch • Used by academics for large projects • MIT, AI Lab • Know-It-All Project (UW)

  19. Conclusions • Lucene is a good set of classes • Designed to allow customization without have to “reinvent the wheel” • Robust • Fast • Large development groups • Used Widely in Academia and Industry

  20. Questions? • Feel free to ask questions, make comments, tell jokes.

  21. That’s ALL Folks!!!!!

More Related