1 / 40

Introduction to Lucene

Introduction to Lucene. Rong Jin. What is Lucene ?. Lucene is a high performance, scalable Information Retrieval (IR) library Free, open-source project implemented in Java O riginally written by Doug Cutting Become a project in the Apache Software Foundation in 2001

nailah
Download Presentation

Introduction to Lucene

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Lucene Rong Jin

  2. What is Lucene ? • Lucene is a high performance, scalable Information Retrieval (IR) library • Free, open-source project implemented in Java • Originally written by Doug Cutting • Become a project in the Apache Software Foundation in 2001 • It is the most popular free Java IR library. • Lucene has been ported to Perl, Python, Ruby, C/C++, and C# (.NET).

  3. Lucene Users • IBM Omnifind Y! Edition • Technorati • Wikipedia • Internet Archive • LinkedIn • Eclipse • JIRA • Apache Roller • jGuru • More than 200 others

  4. The Lucene Family • Lucene & Apache Lucene & Java Lucene: IR library • Nutch: Hadoop-loving crawler, indexer, searcher for web-scale SE • Solr: Search server • Droids: Standalone framework for writing crawlers • Lucene.Net: C#, incubator graduate • Lucy: C Lucene implementation • PyLecene: Python port • Tika: Content analysis toolkit

  5. Indexing Documents Document Analyzer • Each document is comprised of multiple fields • Analyzer extracts words from texts • IndexWriter creates and writes inverted index to disk Dictionary Field Field Field : Tokenizer TokenFilter IndexWriter InvertedIndex

  6. Indexing Documents

  7. Indexing Documents

  8. Indexing Documents

  9. Lucene Classes for Indexing • Directory class • An abstract class representing the location of a Lucene index. • FSDirectorystores index in a directory in the filesystem, • RAMDirectory holds all its data in memory. • useful for smaller indices that can be fully loaded in memory and can be destroyed upon the termination of an application.

  10. LuceneClasses for Indexing • IndexWriter Class • Creates a new index or opens an existing one, and adds, removes or updates documents in the index. • Analyzer Class • An abstract class for extracting tokens from texts to be indexed • StandardAnalyzer is the most common one

  11. Lucene Classes for Indexing • Document Class • A document is a collection of fields • The meta-data such as author, title, subject, date modified, and so on, are indexed and stored separately as fields of a document.

  12. Index Segments and Merge • Each index consists of multiple segments • Every segment is actually a standalone index itself, holding a subset of all indexed documents. • At search time, each segment is visited separately and the results are combined together. # ls -lh total 1.1G -rw-r--r-- 1 root root 123M 2009-03-14 10:29 _0.fdt -rw-r--r-- 1 root root 44M 2009-03-14 10:29 _0.fdx -rw-r--r-- 1 root root 33 2009-03-14 10:31 _9j.fnm -rw-r--r-- 1 root root 372M 2009-03-14 10:36 _9j.frq -rw-r--r-- 1 root root 11M 2009-03-14 10:36 _9j.nrm -rw-r--r-- 1 root root 180M 2009-03-14 10:36 _9j.prx -rw-r--r-- 1 root root 5.5M 2009-03-14 10:36 _9j.tii -rw-r--r-- 1 root root 308M 2009-03-14 10:36 _9j.tis -rw-r--r-- 1 root root 64 2009-03-14 10:36 segments_2 -rw-r--r-- 1 root root 20 2009-03-14 10:36 segments.gen

  13. Index Segments and Merge • Each segment consists of multiple files • _X.<ext> : X is the name and <ext> is the extension that identifies which part of the index that file corresponds to. • Separate files to hold the different parts of the index (term vectors, stored fields, inverted index, etc.). • Optimize() operation will merge all the segments into one • Involves a lot of disk IO and time consuming • Significantly improves search efficiency # ls -lh total 1.1G -rw-r--r-- 1 root root 123M 2009-03-14 10:29 _0.fdt -rw-r--r-- 1 root root 44M 2009-03-14 10:29 _0.fdx -rw-r--r-- 1 root root 33 2009-03-14 10:31 _9j.fnm -rw-r--r-- 1 root root 372M 2009-03-14 10:36 _9j.frq -rw-r--r-- 1 root root 11M 2009-03-14 10:36 _9j.nrm -rw-r--r-- 1 root root 180M 2009-03-14 10:36 _9j.prx -rw-r--r-- 1 root root 5.5M 2009-03-14 10:36 _9j.tii -rw-r--r-- 1 root root 308M 2009-03-14 10:36 _9j.tis -rw-r--r-- 1 root root 64 2009-03-14 10:36 segments_2 -rw-r--r-- 1 root root 20 2009-03-14 10:36 segments.gen

  14. Lucene Classes for Reading Index • IndexReader class • Read index from the indexed file • Terms class • A container for all the terms in a specified field • TermsEnum class • Implement BytesRefIterator interface, providing interface for accessing each term

  15. Reading Document Vector FieldTypefieldType = newFieldType(); fieldType.setStoreTermVectors( true); fieldType.setIndexed( true ); fieldType.setIndexOptions( IndexOptions.DOCS_AND_FREQS); fieldType.setStored( true ); doc.add( new Field(“contents”, contentString, fieldType )) ; • Enable storing term vector at indexing step.

  16. Reading Document Vector FieldTypefieldType = newFieldType(); fieldType.setStoreTermVectors( true); fieldType.setIndexed( true ); fieldType.setIndexOptions( IndexOptions.DOCS_AND_FREQS); fieldType.setStored( true ); doc.add( new Field(“contents”, contentString, fieldType )) ; • Enable storing term vector at indexing step. • Read document vector • Obtain each term in the document vector IndexReader reader = IndexReader.open( FSDirectory.open ( new File( indexPath )) ); intmaxDoc = reader.maxDoc(); for (inti=0; i<maxDoc; i++) { Terms terms = reader.getTermVector( i, “contents”); TermsEnumtermsEnum = terms.iterator( null ); BytesRef text = null; while ( (text = termsEnum.next()) !=null ) { Stringtermtext = text.utf8ToString(); intdocfreq = termsEnum.docFreq(); } }

  17. Updating Documents in Index • IndexWriter.add(): add documents to the existing index • IndexWriter.delete(): remove documents/fields from the existing index • IndexWriter.update(): update documents in the existing index

  18. Other Features of Lucene Indexing • Concurrency • Multiple IndexReaders may be open at once on a single index • But only one IndexWriter can be open on an index at once • IndexReaders may be open even while a single IndexWriter is making changes to the index; each IndexReader will always show the index as of the point-in-time that it was opened.

  19. Other Features of Lucene Indexing • A file-based lock is used to prevent two writers working on the same index • If the file write.lock exists in your index directory, a writer currently has the index open; any attempt to create another writer on the same index will hit a LockObtainFailedException.

  20. Search Documents

  21. Search Documents

  22. Search Documents

  23. Lucene Classes for Searching • IndexSearcher class • Search through the index • TopDocs class • A container of pointers to the top N ranked results • Records the docID and score for each of the top N results (docID can be used to retrieve the document)

  24. Lucene Classes for Searching • QueryParser • Parse a text query into the Query class • Need the analyzer to extract tokens from the text query • Search single term • Term class • Similar to Field, is pair of name and value • Use together TermQuery class to create query

  25. Similarity Functions in Lucene • Many similarity functions are implemented in Lucene • Okapi (BM25Similarity) • Language model (LMDirichletSimilarity) • Example : Similarity simfn = new BM25Similarity(); searcher.setSimilarity(simfn); // searcher is an IndexSearcher

  26. Similarity Functions in Lucene • Default similarity function • Allow implementing various similarity functions

  27. Lucene Scoring in DefaultSimilarity • tf - how often a term appears in the document • idf - how often the term appears across the index • coord-number of terms in both the query and the document • lengthNorm-total number of terms in the field • queryNorm - normalization factor makes queries comparable • boost(index) – boost of the field at index-time • boost(query) – boost of the field at query-time http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

  28. Lucene Scoring in DefaultSimilarity • tf - how often a term appears in the document • idf - how often the term appears across the index • coord-number of terms in both the query and the document • lengthNorm-total number of terms in the field • queryNorm - normalization factor makes queries comparable • boost(index) – boost of the field at index-time • boost(query) – boost of the field at query-time sqrt( freq ) log(numDocs/(docFreq+1))+1 overlap/maxOverlap 1/sqrt( numTerms ) 1/sqrt(sumOfSquaredWeights) http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

  29. Customizing Scoring • Subclass DefaultSimilarity and override the method you want to customize. • ignore how common a term appears across the index • Increase the weight of terms in “title” field

  30. Queries in Lucene • Lucene support many types of queries • RangeQuery • PrefixQuery • WildcardQuery, BooleanQuery, PhraseQuery, …

  31. Analyzers • Basic analyzers • Analyzers for different languages (in analyzers-common) • Chinese, Japanese, Arabic, German, Greek, ….

  32. Analysis in Action "The quick brown fox jumped over the lazy dogs" WhitespaceAnalyzer : [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] SimpleAnalyzer : [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] StopAnalyzer : [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dogs]

  33. Analysis in Action "XY&Z Corporation - xyz@example.com" WhitespaceAnalyzer: [XY&Z] [Corporation] [-] [xyz@example.com] SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer: [xy&z] [corporation] [xyz@example.com]

  34. Analyzer: Key Structure • Breaks text into a stream of tokens enumerated by the TokenStream class.

  35. Analyzer • Breaks text into a stream of tokens enumerated by the TokenStream class. The only required method for Analyzer

  36. Analyzer • Breaks text into a stream of tokens enumerated by the TokenStream class. Allows the TokenStream class to be reused; save space allocation and garbage collection

  37. TokenStream Class • Two types of TokenStream • Tokenizer: a TokenStream that tokenizes the input from a Reader. i.e., chunks the input into Tokens. • TokenFilter: allows you to chain TokenStreams together, i.e., further modify the Tokens including removing it, stemming it, and other actions. • A chain usually includes 1Tokenizer and NTokenFilters

  38. TokenStream Class • Example: StopAnalyzer Text TokenStream TokenStream StopFilter LowerCaseTokenizer

  39. TokenStream Class • Example: StopAnalyzer Text TokenStream TokenStream StopFilter LowerCaseTokenizer Text TokenStream TokenStream TokenStream LetterTokenizer LowerCaseFilter StopFilter Order Matter !

  40. TokenizerTokenFilter

More Related