Explore New Features in Apache Solr 3.1 and 4.0: Enhancements and Innovations

Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010

Agenda Goal : Introduce new features you can try & use now in Solr development versions 3.1 or 4.0 • Relevancy (Extended Dismax Parser) • Spatial/Geo Search • Search Result Grouping / Field Collapsing • Faceting (Pivot, Range, Per-segment) • Scalability (Solr Cloud) • Odds & Ends • Q&A

Solr 3.1? What happened to 1.5? • Lucene/Solr merged (March 2010) • Single set of committers • Single dev mailing list (dev@lucene.apache.org) • Single shared subversion trunk • Keep separate downloads, user mailing lists • Other former lucene subprojects spun off (Nutch, Tika, Mahout, etc) • Development • trunk is now always next major release (currently 4.0) • branch_3x will be base for all 3.x releases • Branch together, Release together, Share version numbers

Relevance

Extended Dismax Parser • Superset of dismax &defType=edismax&q=foo&qf=body • Fixes edge cases where dismax could still throw exceptions OR AND NOT - “ • Full lucene syntax support • Tries lucene syntax first • Smart escaping is done if syntax errors • Optionally supports treating “and”/”or” as AND/OR in lucene syntax • Fielded queries (e.g. myfield:foo) even in degraded mode • uf parameter controls what field names may be directly specified in “q”

Extended Dismax Parser (continued) • boost parameter for multiplicative boost-by-function • Pure negative query clauses Example: solr OR (-solr) • Enhanced term proximity boosting • pf2=myfield – results in term bigrams in sloppy phrase queries myfield:“aa bb cc” -> myfield:“aa bb” myfield:“bb cc” • Enhanced stopword handling • stopwords omitted in main query, but added in optional proximity boosting part Example: q=solr is awesome & qf=myfield & pf2=myfield-> +myfield:(solr awesome) (myfield:”solr is” myfield:”is awesome”) • Currently controlled by the absence of StopWordFilter in index analyzer, and presence in query analyzer

Spatial Search

Spatial Search Step1: Index some locations! <field name=“name”>The Alpine Shop</field> <field name=“store”>44.013617,-73.168264</field> Step2: Decide where you are &pt=44.0153371,-73.16734 &d=1 &sfield=store Step3: Profit! Spatial Filter: &fq={!geofilt} Bounding Box: &fq={!bbox} Distance Function: &sort=geodist() asc

Result Grouping /Field Collapsing

Field Collapsing Definition • Field collapsing • Limit the number of results per category • “category” normally defined by unique values in a field • Uses • Web Search – collapse by web site • Email threads – collapse by thread id • Ecommerce/retail • Show the top 5 items for each store category (music, movies, etc)

Field Collapsing by Site

Result Grouping by Category Field Collapse on Product Type

Group by Field "grouped":{ "manu_exact":{ "matches":3, "groups":[{ "groupValue":"Belkin", "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}] }}, { "groupValue":"Apple Computer Inc.", "doclist":{"numFound":1,"start":0,"docs":[ { "id":"MA147LL/A", "name":"Apple 60 GB iPod with Video Playback Black"}] }}]}}} http://...&fl=id,name&q=ipod&group=true&group.field=manu_exact

Group by Query http://...&group=true&group.query=price:[0 TO 99.99]&group.query=price:[100 TO *]&group.limit=5 "grouped":{ "price:[0 TO 99.99]":{ "matches":3, "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}, { "id":"F8V7067-APL-KIT", "name":"Belkin Mobile Power Cord for iPod"}] }}, "price:[100 TO *]":{ "matches":3, "doclist":{"numFound":1,"start":0,"docs":[ { "id":"MA147LL/A", "name":"Apple 60 GB iPod with Video Playback Black"}] }}}}

Grouping Params

Faceting

Pivot Faceting • Other names that could have made sense: • Grid Faceting, Cross-Product Faceting, Matrix Faceting • Syntax: facet.pivot=field1,field2,field3,… facet.pivot=cat,inStock

Pivot Faceting http://...&facet=true&facet.pivot=cat,popularity (continued) { "field":"popularity", "value":"1", "count":2}]}, { "field":"cat", "value":"memory", "count":3, "pivot":[]}, […] "facet_counts":{ "facet_pivot":{ "cat,popularity":[{ "field":"cat", "value":"electronics", "count":14, "pivot":[{ "field":"popularity", "value":"6", "count":5}, { "field":"popularity", "value":"7", "count":4}, 14 docs w/ cat==electronics 5 docs w/ cat==electronics && popularity==6

Range Faceting • Like Date faceting, but more generic http://...&facet=true &facet.range=price &facet.range.start=0 &facet.range.end=500 &facet.range.gap=50 "facet_counts":{ "facet_ranges":{ "price":{ "counts":{ "0.0":5, "50.0":2, "100.0":0, "150.0":2, "200.0":0, "250.0":1, "300.0":2, "350.0":2, "400.0":0, "450.0":1}, "gap":50.0, "start":0.0, "end":500.0}}}}

Existing single-valued faceting algorithm Documents matching the base query “Juggernaut” Lucene FieldCache Entry (StringIndex) for the “hero” field q=Juggernaut &facet=true &facet.field=hero 0 order: for each doc, an index into the lookup array lookup 2 lookup: the string values 7 flash, 5 5 (null) Batman, 3 3 batman accumulator 5 flash 0 1 spiderman 1 4 superman Priority queue 0 increment 5 wolverine 0 2 0 1 2

Per-segment single-valued algorithm Segment1 FieldCache Entry Segment2 FieldCache Entry Segment3 FieldCache Entry Segment4 FieldCache Entry accumulator1 accumulator2 accumulator3 accumulator4 inc lookup 0 0 1 0 3 2 3 1 0 flash, 5 5 1 0 0 Base DocSet Batman, 3 2 0 0 4 7 thread4 thread3 1 thread2 2 FieldCache + accumulator merger (Priority queue) Priority queue thread1

Per-segment faceting • Enable with facet.method=fcs • Controllable multi-threading facet.field={!threads=4}myfield • Disadvantages • Larger memory use (FieldCaches + accumulators) • Slower (extra FieldCache merge step needed) • Advantages • Rebuilds FieldCache entries only for new segments (NRT friendly) • Multi-threaded

Per-segment faceting performance comparison Test index: 10M documents, 18 segments, single valued field Base DocSet=100 docs, facet.field on a field with 100,000 unique terms A Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms B *complete request time, measured externally

Faceting Performance Improvements • For facet.method=enum, speed up initial population of the filterCache (i.e. first time facet): from 30% to 32x improvement • Optimized facet.method=fc for multi-valued fields and large facet.limit – up to 3x faster • Optimized deep facet paging – up to 10x faster with really large facet.offsets • Less memory consumed by field cache entries

Scalability

SolrCloud • First steps toward simplifying cluster management • Integrates Zookeeper • Central configuration (schema.xml, solrconfig.xml, etc) • Tracks live nodes + shards of collections • Removes need for external load balancers shards=localhost:8983/solr|localhost:8900/solr, localhost:7574/solr|localhost:7500/solr • Can specify logical shard ids shards=NY_shard,NJ_shard • Clients don’t need to know shards at all: http://localhost:8983/solr/collection1/select?distrib=true

SolrCloud : The Future • Eliminate all single points of failure • Remove Master/Searcher distinction • Enables near real-time search in a highly scalable environment • High Availability for Writes • Eventual consistency model (like Amazon Dynamo, Cassandra) • Elastic • Simply add/subtract servers, cluster will rebalance automatically • By default, Solr will handle document partitioning

Odds & Ends

Auto-Suggest • Many people currently use terms component • Can be slow for a large corpus • New auto-suggest builds off SpellCheck component • Compact memory based trie for really fast completions • Based on a field in the main index, or on a dictionary file http://localhost:8983/solr/suggest?wt=json&indent=true&q=ult "spellcheck":{ "suggestions":[ "ult",{ "numFound":1, "startOffset":0, "endOffset":3, "suggestion":["ultrasharp"]}, "collation","ultrasharp"]}}

Index with JSON $ URL=http://localhost:8983/solr/update/json $ curl $URL -H 'Content-type:application/json' -d ' { "add": { "doc": { "id" : "978-0641723445", "cat" : ["book","hardcover"], "title" : "The Lightning Thief", "author" : "Rick Riordan", "series_t" : "Percy Jackson and the Olympians", "sequence_i" : 1, "genre_s" : "fantasy", "inStock" : true, "price" : 12.50, "pages_i" : 384 } } }'

Query Results in CSV http://localhost:8983/solr/select?q=ipod&fl=name,price,cat,popularity&wt=csv name,price,cat,popularity iPod & iPod Mini USB 2.0 Cable,11.5,"electronics,connector",1 Belkin Mobile Power Cord for iPod w/ Dock,19.95,"electronics,connector",1 Apple 60 GB iPod with Video Playback Black,399.0,"electronics,music",10 • Can handle multi-valued fields (see “cat” field in example) • Completely compatible with the CSV update handler (can round-trip) • Results are streamed – good for dumping entire parts of the index

http://localhost:8983/solr/browse

Q&A

Explore New Features in Apache Solr 3.1 and 4.0: Enhancements and Innovations