Lucene - PowerPoint PPT Presentation

risa
lucene n.
Skip this Video
Loading SlideShow in 5 Seconds..
Lucene PowerPoint Presentation
play fullscreen
1 / 59
Download Presentation
Lucene
227 Views
Download Presentation

Lucene

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Lucene Open Source Search Engine

  2. Lucene - Overview • Complete search engine in a Java library • Stand-alone only, no server • But can use SOLR • Handles indexing and query • Fully featured – but not 100% complete • Customizable – to an extent • Fully open source • Current version: 3.6.1

  3. Lucene Implementations • LinkedIn • OS software on integer list compression • Eclipse IDE • For searching documentation • Jira • Twitter • Comcast • XfinityTV.com, some set top boxes • Care.com • MusicBrainz • Apple, Disney • BobDylan.com

  4. Indexing Lucene

  5. Lucene - Indexing • Directory = A referenceto an Index • RAMDirectory, SimpleFSDirectory • IndexWriter = Writes to the index, options: • Limited or unlimited field lengths • Auto commit • Analyzer (how to do text processing, more on this later) • Deletion Policy (only for deleting old temporary data) • Document – Holds fields to index • Field – A name/value pair + index/store flags

  6. Lucene – Indexer Outline SimpleFSDirectoryfsDir = new SimpleFSDirectory(File) IndexWriteriWriter = new IndexWriter(fsDir,…) Loop: fetch text for each document { Document doc = new Document(); doc.add(new Field(…)); // for each field iWriter.addDocument(doc); } iWriter.commit(); iWriter.close(); fsDir.close();

  7. Class Materials • SharePoint link • use “search\[flast]” username • sharepoint.searchtechnologies.com • Annual Kickoff • Shared Documents • FY2013 Presentations • Introduction to Lucene • lucene-training-src-FY2013.zip

  8. Lucene – Index – Exercise 1 • Create A new Maven Project • mvnarchetype:generate -DgroupId=com.searchtechnologies -DartifactId=lucene-training -DarchetypeArtifactId=maven-archetype-quickstart-DinteractiveMode=false • Right click pom.xml, Maven.. Add Dependency • lucene-core in search box • Choose 3.6.1 • Expand Maven Dependencies.. Right click lucene-core.. Maven download sources • Source code level = 1.6 • Copy Source File: LuceneIndexExercise.java • Into com.searchtechnologies package • Copy data directory to your project • Follow instructions in the file

  9. Query Lucene

  10. Lucene - Query • Directory = An index reference • IndexReader = Reads the index, typically associated with reading document fields • readOnly • IndexSearcher = Searches the Index • QueryParser – Parses a string to a Query • QueryParser = Standard Lucene Parser • Constructor: Version, default field, analyzer • Query – Query expression to execute • Returned by qParser.parse(String) • Search Tech’s QPL can generate Query objects

  11. Lucene – Query part 2 • Executing a Search • TopDocs td = iSearcher.search(<query-object>, <num-docs>) • TopDocs – Holds statistics on the search plus the top N documents • totalHits, scoreDocs[], maxScore • ScoreDoc –Information on a single document • Doc ID and score • Use IndexReader to fetch any Document from a Doc ID • (includes all fields for the document)

  12. Lucene – Search Outline SimpleFSDirectoryfsDir = new SimpleFSDirectory(File f) IndexReaderiReader = new IndexReader(fsDir,…) IndexSearcheriSearcher = new IndexSearcher(iReader) StandardAnalyzersa = new StandardAnalyzer(…) QueryParserqParser = new QueryParser(…) Loop: fetch a query from the user { Query q = qParser.parse( <query string> ) TopDocstds = iSearcher.search(q, 10); Loop: For every document in tds.scoreDocs { Document doc = iReader.document(tds.scoreDocs[i].doc); Print: tds.scoreDocs[i].score, doc.get(“field”) } } // Close the StandardAnalyzer, iSearcher, and iReader

  13. Lucene – Query – Exercise 2 • Open Source File: LuceneQueryExercise.java • Follow instructions in the file

  14. Relevancy Tuning Lucene

  15. Lucene Extras – Fun Things You Can Do • iWriter.updateDocument(Term, Document) • Updates a document which contains the “Term” • “Term” in this case is a field/value pair • Such as “id” = “3768169” • doc.boost( <float boost value> ) • Multiplies term weights in the doc by boost value • Part of “fieldNorm” when you do an “explain” • field.boost( <float boost value> ) • Multiplies term weights in field by boost value

  16. Explain - Example iSearcher.explain(query, doc-number) Query: star OR catch^0.6 for document 903 1.2778056 = (MATCH) product of: 2.5556111 = (MATCH) sum of: 2.5556111 = (MATCH) weight(title:catch^0.6 in 903), product of: 0.56637216 = queryWeight(title:catch^0.6), product of: 0.6 = boost 7.2195954 = idf(docFreq=1, maxDocs=1005) 0.13074881 = queryNorm 4.512247 = (MATCH) fieldWeight(title:catch in 903), product of: 1.0 = tf(termFreq(title:catch)=1) 7.2195954 = idf(docFreq=1, maxDocs=1005) 0.625 = fieldNorm(field=title, doc=903) 0.5 = coord(1/2)

  17. Lucene – Query– Exercise 3 • Add explain to your query program Explanation exp = iSearcher.explain( . . . ) • Call it for all documents produced by your search • Simply use toString() on the result of explain() to display the results

  18. Boosting – Other Issues • Similarity Class Javadoc Documentation • Very useful discussion of boosting formulas • Similarity.encodeNormValue() – 8-bit floating point! 0.00 => 0 0.10 => 6E 0.20 => 72 0.30 => 74 0.40 => 76 0.50 => 78 0.60 => 78 0.70 => 79 0.80 => 7A 0.90 => 7B 1.00 => 7C 1.10 => 7C 1.20 => 7C 1.30 => 7D 1.40 => 7D 1.50 => 7E 1.60 => 7E 1.70 => 7E 1.80 => 7F 1.90 => 7F 2.00 => 80 2.10 => 80 2.20 => 80 2.30 => 80 2.40 => 80 2.50 => 80 2.60 => 81 2.70 => 81 2.80 => 81 2.90 => 81 3.00 => 81 3.10 => 82 3.20 => 82 3.30 => 82 3.40 => 82 3.50 => 82 3.60 => 83 3.70 => 83 3.80 => 83 3.90 => 83 4.00 => 83 4.10 => 84 4.20 => 84 4.30 => 84 4.40 => 84 4.50 => 84 4.60 => 84 4.70 => 84 4.80 => 84 4.90 => 84 5.00 => 84

  19. Lucene Query Objects • Query objects are used to execute the search Query Parser iSearcher. search() Top Docs Query String All Derived from the Lucene Query class

  20. Lucene Query Objects - Example (george AND washington) OR (thomas AND jefferson) BooleanQuery (clauses = SHOULD) BooleanQuery (clauses = MUST) BooleanQuery (clauses = MUST) TermQuery george TermQuery washington TermQuery thomas TermQuery jefferson

  21. Lucene BooleanQuery george +washington -martha jefferson -thomas +sally WORKS LIKE AND: BooleanQuery bq = new BooleanQuery(); bq.add( X , Occur.MUST); bq.add( Y , Occur.MUST); WORKS LIKE OR: BooleanQuery bq = new BooleanQuery(); bq.add( X , Occur.SHOULD); bq.add( Y , Occur.SHOULD); WORKS LIKE: X AND (X OR Y) BooleanQuery bq = new BooleanQuery(); bq.add( X , Occur.MUST); bq.add( Y , Occur.SHOULD);

  22. Lucene – Query– Exercise 4 • Create BooleanQuery and TermQuery objects as necessary to create a query without the query parser • Goal: (star AND wars) OR (blazing AND saddles) • TermQuery: tq = new TermQuery(new Term("field","token")) • BooleanQuery: BooleanQuerybq = new BooleanQuery(); bq.add( <nested query object> , Occur.MUST); bq.add(<nested query object> , Occur.MUST); • Occur • Occur.MUST, Occur.SHOULD, Occur.MUST_NOT • TermQuery and BooleanQuery derive from Query • Any “Query” objects can be passed to iSearcher.search()

  23. Lucene Proximity Queries • “Spanning” Queries  Return matching “spans” Word positions mark word boundaries 0 1 2 3 4 5 6 7 8 9 10 DOCUMENT: Four score and seven years ago, our forefathers brought forth… Query: Returns: four before/5 seven 0:4 (four before/5 seven) before forefathers 0:8 brought near/3 ago 5:9 (four adj score) or (brought adj forth) 0:2, 8:10

  24. Proximity Queries : Available Operators • (standard) SpanTermQuery • For terms inside spanning queries • (standard) SpanNearQuery • Inorder flag  handles both near and before • (standard) SpanOrQuery • (standard) SpanMultiTermQueryWrapper • fkaSpanRegexQuery • (Search Tech) SpanAndQuery • (SearchTech) SpanBetweenQuery • between(start,end,positive-content,not-content)

  25. Span Queries • demo of LuceneSpanDemo.java

  26. Analysis Lucene

  27. Analyzers • “Analysis” = “Text Processing” in Lucene • Includes: • Tokenization • Since 1955, the B-52…  Since, 1955, the, B, 52 • Token filtering • Splitting, joining, replacing, filtering, etc. • Since, 1955, the, B, 52  1955, B, 52 • George, Lincoln  george, lincoln • MasteringBiology Mastering, Biology • B-52  B52, B-52, B, 52 • Stemming • tables  table • carried  carry

  28. Analyzer, Tokenizer, TokenFilter • Tokenizer: Text  TokenStream • TokenFilter: TokenStream  TokenStream • Analyzer: A complete text processing function (one tokenizer + multiple token filters) • Manufactures TokenStreams Analyzer . . . Tokenizer TokenFilter TokenFilter string

  29. Existing Analyzers, Tokenizers, Filters • Tokenizer • (Standard) CharTokenizer, WhitespaceTokenizer, KeywordTokenizer, ChineseTokenizer, CJKTokenizer, StandardTokenizer, WikipediaTokenizer (more) • (Search Tech) UscodeTokenizer (produces each HTML <tag> as a separate token) • TokenFilter • Stemmers: (Standard) many language-specific stemmers, PorterStemFilter, SnoballFilter • Stemmers: (Search Tech) Lemmatizer

  30. Existing Analyzers, Tokenizers, Filters • TokenFilters (continued) • LengthFilter, LowerCaseFilter, StopFilter, SynonymTokenFilter (don’t use), WordDelimiterFilter (SOLR only) • Analyzers • WhitespaceAnalyzer, StandardAnalyzer, various language analyzers, PatternAnalyzer Analyzers almost always need to be customized.

  31. Creating and Using TokenStream TokenStreamtokenStream = new SomeTokenizer(…); tokenStream = new SomeTokenFilter1(tokenStream); tokenStream = new SomeTokenFilter2(tokenStream); CharTermAttributecharTermAtt = tokenStream.getAttribute(CharTermAttribute.class); OffsetAttributeoffsetAtt = tokenStream.getAttribute(OffsetAttribute.class); while (tokenStream.incrementToken()) { charTermAtt Now contains info on the token’s term offsetAtt.startOffset()  Now contains the token’s start offset }

  32. Token Streams - How They Work Use Attribute Objects Modify attribute objects and return Call incrementToken() TokenFilter Modify attribute objects and return Call incrementToken() TokenFilter Return Call incrementToken() Tokenizer Get next token from Reader() store in Attribute objects

  33. Creating and Using TokenStream DEMO

  34. Replacement Pattern Token Filters Simply Modify Attributes that Pass Through incrementToken() Modify attribute objects and return TokenFilter Call incrementToken()

  35. Token Filter – Replacement Pattern public final class LowerCaseFilter extends TokenFilter { public LowerCaseFilter(TokenStream input) { super(input); termAtt = (CharTermAttribute) addAttribute(CharTermAttribute.class); } private CharTermAttributetermAtt; public final booleanincrementToken() throws IOException { if (input.incrementToken()) { final char[] buffer = termAtt.buffer(); final int length = termAtt.length(); for(inti=0;i<length;i++) buffer[i] = Character.toLowerCase(buffer[i]); return true; } else return false; } }

  36. Deletion Pattern Token Filters Check Token Attributes and May Call incrementToken() Multiple Times incrementToken() TokenFilter Keep Looping Until A Good Token is Found Then Return It Call incrementToken()

  37. Token Filter – Deletion Pattern public final class TokenLengthLessThan50CharsFilter extends TokenFilter { public TokenLengthLessThan50CharsFilter(TokenStream in) { super(in); termAtt = (CharTermAttribute) addAttribute(CharTermAttribute.class); posIncrAtt = (PositionIncrementAttribute) addAttribute(PositionIncrementAttribute.class); } private CharTermAttributetermAtt; private PositionIncrementAttributeposIncrAtt; public final booleanincrementToken() throws IOException { intskippedPositions = 0; while(input.incrementToken()) { final int length = termAtt.length(); if(length > 50) { skippedPositions += posIncrAtt.getPositionIncrement(); continue; } posIncrAtt.setPositionIncrement( posIncrAtt.getPositionIncrement() + skippedPositions); return true; } return false; } }

  38. Splitting Tokens Pattern – First Call When Splitting a Token, Save the Splits Aside For Later incrementToken() Return First Half Save Second Half TokenFilter Saved Token Split Token Call incrementToken()

  39. Splitting Tokens Pattern – Second Call When Called the Second Time, Just Return Saved Token incrementToken() Return Saved Token TokenFilter Saved Token

  40. Token Filter – Splitting Pattern public final class SplitDashFilter extends TokenFilter { public SplitDashFilter(TokenStream in) { super(in); termAtt = (CharTermAttribute) addAttribute(CharTermAttribute.class); } private CharTermAttributetermAtt; char[] saveToken = new char[100]; // Buffer to hold tokens from previous incrementToken() call intsaveLen = 0; public final booleanincrementToken() throws IOException { if(saveLen > 0) { // Output previously saved token termAtt.setEmpty(); termAtt.append(new String(saveToken, 0, saveLen)); saveLen = 0; return true; } if (input.incrementToken()) { // Get a new token to split final char[] buffer = termAtt.buffer(); final int length = termAtt.length(); booleanfoundDash = false; for(inti=0;i<length;i++) { // Scan token looking for ‘–’ to split it if(buffer[i] == ‘-’) { foundDash = true; termAtt.setLength(i); // Set length so termAtt = first half now } else if(foundDash) saveToken[saveLen++] = buffer[i]; // Save second half for later } return true; // Output first half right away } else return false; } }

  41. Token Splitting DEMO

  42. Stemmers and Lemmatizers • Stemmers available in Lucene • Snoball, Porter • They are both terrible [much too aggressive] • For example: mining  min • Kstem • Publicly available stemmer with Lucene TokenFilter Implementation • Better, but still too aggressive: • searchmining  searchmine • Search Technologies Lemmatizer • Based on GCIDE Dictionary • Extremely accurate, only reduces words to dictionary entries • Also does irregular spelling reduction: mice  mouse • STILL A WORK IN PROGRESS: Needs one more refactor

  43. ST Query Processing Lucene

  44. Search Technologies Query Parser • Originally written for GPO • Query  FAST FQL • Converted to .Net for CPA • Refactored for Lucene for Aspermont • Refactored to be more componentized and pipeline-oriented for OLRC • Still a work in progress • Lacks documentation, wiki, etc.

  45. Search Technologies Query Processing • Query Parser • Parses the user’s entered query • Query Processing Pipeline • A sequence of query processing components which can be mixed and matched • Lucene Query Builder • Other Query Builders Possible • FAST, Google, etc. • No others implemented yet • Query Configuration File • Holds query parsing and processing parameters

  46. Our Own Query Processing: Why? • Gives us more control • Can exactly meet user’s query syntax • Exposes operators not available through Lucene Syntax • Example: before proximity operator • “behind the scenes” query tweaking • Field weighting • Token merging: rio tinto  url:riotinto • Exact case and exact suffix matching • True lemmatization (not just stemming)

  47. ST Query Parser – Overall Structure Parser Processor Processor Lucene Builder Top Docs . . . Query String Generic “AQNode” Structures Lucene Query Structures

  48. The Search Technologies Query Structure • Holds references to all query representations • Therefore, query processors can process any query representation • Everything is a QueryProcessor • Parsing, processing, and query building Query String Generic AQNode Structures userQuery nodeQuery finalQuery Lucene Query Structures

  49. Query Parser: Features • AND, OR, NOT, parenthesis • ((star and wars) or (star and trek)) • star and not born {broken} • +, - + = query boost - = not {broken} • Proximity operators • within/3, near/3, adj • Phrases • field: searches • title:(star and wars) and description:(the original)

  50. Using Query Processors • Load Query Configuration QueryConfigqConfig = new QueryConfig("data/query-config.xml"); • Create Query Processor IQueryProcessor iqp2 = new TokenizationQueryProcessor(); • Initialize Query Processor iqp2.initialize(qConfig); • Use Query Processors (simply call in sequence) iqp1.process(query); iqp2.process(query); iqp3.process(query);