330 likes | 449 Views
Join us for an informative session introducing the exciting features available in Apache Solr development versions 3.1 and 4.0. Discover advancements in relevancy with the Extended Dismax Parser, spatial/geospatial search capabilities, search result grouping and field collapsing, and advanced faceting. Learn about Solr Cloud for improved scalability, as well as other odds and ends. This presentation aims to equip you with practical knowledge and tools for enhancing your Solr implementations. Bring your questions for the Q&A session!
E N D
Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010
Agenda Goal : Introduce new features you can try & use now in Solr development versions 3.1 or 4.0 • Relevancy (Extended Dismax Parser) • Spatial/Geo Search • Search Result Grouping / Field Collapsing • Faceting (Pivot, Range, Per-segment) • Scalability (Solr Cloud) • Odds & Ends • Q&A
Solr 3.1? What happened to 1.5? • Lucene/Solr merged (March 2010) • Single set of committers • Single dev mailing list (dev@lucene.apache.org) • Single shared subversion trunk • Keep separate downloads, user mailing lists • Other former lucene subprojects spun off (Nutch, Tika, Mahout, etc) • Development • trunk is now always next major release (currently 4.0) • branch_3x will be base for all 3.x releases • Branch together, Release together, Share version numbers
Extended Dismax Parser • Superset of dismax &defType=edismax&q=foo&qf=body • Fixes edge cases where dismax could still throw exceptions OR AND NOT - “ • Full lucene syntax support • Tries lucene syntax first • Smart escaping is done if syntax errors • Optionally supports treating “and”/”or” as AND/OR in lucene syntax • Fielded queries (e.g. myfield:foo) even in degraded mode • uf parameter controls what field names may be directly specified in “q”
Extended Dismax Parser (continued) • boost parameter for multiplicative boost-by-function • Pure negative query clauses Example: solr OR (-solr) • Enhanced term proximity boosting • pf2=myfield – results in term bigrams in sloppy phrase queries myfield:“aa bb cc” -> myfield:“aa bb” myfield:“bb cc” • Enhanced stopword handling • stopwords omitted in main query, but added in optional proximity boosting part Example: q=solr is awesome & qf=myfield & pf2=myfield-> +myfield:(solr awesome) (myfield:”solr is” myfield:”is awesome”) • Currently controlled by the absence of StopWordFilter in index analyzer, and presence in query analyzer
Spatial Search Step1: Index some locations! <field name=“name”>The Alpine Shop</field> <field name=“store”>44.013617,-73.168264</field> Step2: Decide where you are &pt=44.0153371,-73.16734 &d=1 &sfield=store Step3: Profit! Spatial Filter: &fq={!geofilt} Bounding Box: &fq={!bbox} Distance Function: &sort=geodist() asc
Field Collapsing Definition • Field collapsing • Limit the number of results per category • “category” normally defined by unique values in a field • Uses • Web Search – collapse by web site • Email threads – collapse by thread id • Ecommerce/retail • Show the top 5 items for each store category (music, movies, etc)
Result Grouping by Category Field Collapse on Product Type
Group by Field "grouped":{ "manu_exact":{ "matches":3, "groups":[{ "groupValue":"Belkin", "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}] }}, { "groupValue":"Apple Computer Inc.", "doclist":{"numFound":1,"start":0,"docs":[ { "id":"MA147LL/A", "name":"Apple 60 GB iPod with Video Playback Black"}] }}]}}} http://...&fl=id,name&q=ipod&group=true&group.field=manu_exact
Group by Query http://...&group=true&group.query=price:[0 TO 99.99]&group.query=price:[100 TO *]&group.limit=5 "grouped":{ "price:[0 TO 99.99]":{ "matches":3, "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}, { "id":"F8V7067-APL-KIT", "name":"Belkin Mobile Power Cord for iPod"}] }}, "price:[100 TO *]":{ "matches":3, "doclist":{"numFound":1,"start":0,"docs":[ { "id":"MA147LL/A", "name":"Apple 60 GB iPod with Video Playback Black"}] }}}}
Pivot Faceting • Other names that could have made sense: • Grid Faceting, Cross-Product Faceting, Matrix Faceting • Syntax: facet.pivot=field1,field2,field3,… facet.pivot=cat,inStock
Pivot Faceting http://...&facet=true&facet.pivot=cat,popularity (continued) { "field":"popularity", "value":"1", "count":2}]}, { "field":"cat", "value":"memory", "count":3, "pivot":[]}, […] "facet_counts":{ "facet_pivot":{ "cat,popularity":[{ "field":"cat", "value":"electronics", "count":14, "pivot":[{ "field":"popularity", "value":"6", "count":5}, { "field":"popularity", "value":"7", "count":4}, 14 docs w/ cat==electronics 5 docs w/ cat==electronics && popularity==6
Range Faceting • Like Date faceting, but more generic http://...&facet=true &facet.range=price &facet.range.start=0 &facet.range.end=500 &facet.range.gap=50 "facet_counts":{ "facet_ranges":{ "price":{ "counts":{ "0.0":5, "50.0":2, "100.0":0, "150.0":2, "200.0":0, "250.0":1, "300.0":2, "350.0":2, "400.0":0, "450.0":1}, "gap":50.0, "start":0.0, "end":500.0}}}}
Existing single-valued faceting algorithm Documents matching the base query “Juggernaut” Lucene FieldCache Entry (StringIndex) for the “hero” field q=Juggernaut &facet=true &facet.field=hero 0 order: for each doc, an index into the lookup array lookup 2 lookup: the string values 7 flash, 5 5 (null) Batman, 3 3 batman accumulator 5 flash 0 1 spiderman 1 4 superman Priority queue 0 increment 5 wolverine 0 2 0 1 2
Per-segment single-valued algorithm Segment1 FieldCache Entry Segment2 FieldCache Entry Segment3 FieldCache Entry Segment4 FieldCache Entry accumulator1 accumulator2 accumulator3 accumulator4 inc lookup 0 0 1 0 3 2 3 1 0 flash, 5 5 1 0 0 Base DocSet Batman, 3 2 0 0 4 7 thread4 thread3 1 thread2 2 FieldCache + accumulator merger (Priority queue) Priority queue thread1
Per-segment faceting • Enable with facet.method=fcs • Controllable multi-threading facet.field={!threads=4}myfield • Disadvantages • Larger memory use (FieldCaches + accumulators) • Slower (extra FieldCache merge step needed) • Advantages • Rebuilds FieldCache entries only for new segments (NRT friendly) • Multi-threaded
Per-segment faceting performance comparison Test index: 10M documents, 18 segments, single valued field Base DocSet=100 docs, facet.field on a field with 100,000 unique terms A Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms B *complete request time, measured externally
Faceting Performance Improvements • For facet.method=enum, speed up initial population of the filterCache (i.e. first time facet): from 30% to 32x improvement • Optimized facet.method=fc for multi-valued fields and large facet.limit – up to 3x faster • Optimized deep facet paging – up to 10x faster with really large facet.offsets • Less memory consumed by field cache entries
SolrCloud • First steps toward simplifying cluster management • Integrates Zookeeper • Central configuration (schema.xml, solrconfig.xml, etc) • Tracks live nodes + shards of collections • Removes need for external load balancers shards=localhost:8983/solr|localhost:8900/solr, localhost:7574/solr|localhost:7500/solr • Can specify logical shard ids shards=NY_shard,NJ_shard • Clients don’t need to know shards at all: http://localhost:8983/solr/collection1/select?distrib=true
SolrCloud : The Future • Eliminate all single points of failure • Remove Master/Searcher distinction • Enables near real-time search in a highly scalable environment • High Availability for Writes • Eventual consistency model (like Amazon Dynamo, Cassandra) • Elastic • Simply add/subtract servers, cluster will rebalance automatically • By default, Solr will handle document partitioning
Auto-Suggest • Many people currently use terms component • Can be slow for a large corpus • New auto-suggest builds off SpellCheck component • Compact memory based trie for really fast completions • Based on a field in the main index, or on a dictionary file http://localhost:8983/solr/suggest?wt=json&indent=true&q=ult "spellcheck":{ "suggestions":[ "ult",{ "numFound":1, "startOffset":0, "endOffset":3, "suggestion":["ultrasharp"]}, "collation","ultrasharp"]}}
Index with JSON $ URL=http://localhost:8983/solr/update/json $ curl $URL -H 'Content-type:application/json' -d ' { "add": { "doc": { "id" : "978-0641723445", "cat" : ["book","hardcover"], "title" : "The Lightning Thief", "author" : "Rick Riordan", "series_t" : "Percy Jackson and the Olympians", "sequence_i" : 1, "genre_s" : "fantasy", "inStock" : true, "price" : 12.50, "pages_i" : 384 } } }'
Query Results in CSV http://localhost:8983/solr/select?q=ipod&fl=name,price,cat,popularity&wt=csv name,price,cat,popularity iPod & iPod Mini USB 2.0 Cable,11.5,"electronics,connector",1 Belkin Mobile Power Cord for iPod w/ Dock,19.95,"electronics,connector",1 Apple 60 GB iPod with Video Playback Black,399.0,"electronics,music",10 • Can handle multi-valued fields (see “cat” field in example) • Completely compatible with the CSV update handler (can round-trip) • Results are streamed – good for dumping entire parts of the index