1 / 20

Introduction to Open Source Search with Apache Lucene and Solr

Introduction to Open Source Search with Apache Lucene and Solr. Grant Ingersoll. The How Many Game. How many of you: Have taken a class in Information Retrieval (IR)? Are doing work/research in IR? Have heard of or are using Lucene? Have heard of or are using Solr?

eliza
Download Presentation

Introduction to Open Source Search with Apache Lucene and Solr

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Open Source Search with Apache Lucene and Solr Grant Ingersoll

  2. The How Many Game • How many of you: • Have taken a class in Information Retrieval (IR)? • Are doing work/research in IR? • Have heard of or are using Lucene? • Have heard of or are using Solr? • Are doing work on core IR algorithms such as compression techniques or scoring? • Are doing UI/Application work/research as they relate to search?

  3. Topics • Brief Bio • Search 101 (skip?) • What is: • Apache Lucene • Apache Solr • What can they do? • Features and functionality • Intangibles • What’s new in Lucene and Solr? • How can they help my research/work/____?

  4. Brief Bio • Apache Lucene/Solr Committer • Apache Mahout co-founder • Scalable Machine Learning • Co-founder of Lucid Imagination • http://www.lucidimagination.com • Previously worked at Center for Natural Lang. Processing at Syracuse Univ. with Dr. Liddy • Co-Author of upcoming “Taming Text” (Manning Publications) • http://www.manning.com/ingersoll

  5. Search 101 • Search tools are designed for dealing with fuzzy data/questions • Works well with structured and unstructured data • Performs well when dealing with large volumes of data • Many apps don’t need the limits that databases place on content • Search fits well alongside a DB too • Given a user’s information need, (query) find and, optionally, score content relevant to that need • Many different ways to solve this problem, each with tradeoffs • What’s “relevant” mean?

  6. Search 101 Relevance Indexing Finds and maps terms and documents Conceptually similar to a book index At the heart of fast search/retrieve Vector Space Model (VSM) for relevance • Common across many search engines • Apache Lucene is a highly optimized implementation of the VSM

  7. Apache Lucene in a Nutshell • http://lucene.apache.org/java • Java based Application Programming Interface (API) for adding search and indexing functionality to applications • Fast and efficient scoring and indexing algorithms • Lots of contributions to make common tasks easier: • Highlighting, spatial, Query Parsers, Benchmarking tools, etc. • Most widely deployed search library on the planet

  8. Lucene Basics • Content is modeled via Documents and Fields • Content can be text, integers, floats, dates, custom • Analysis can be employed to alter content before indexing • Searches are supported through a wide range of Query options • Keyword • Terms • Phrases • Wildcards • Many, many more

  9. Apache Solr in a Nutshell • http://lucene.apache.org/solr • Lucene-based Search Server + other features and functionality • Access Lucene over HTTP: • Java, XML, Ruby, Python, .NET, JSON, PHP, etc. • Most programming tasks in Lucene are configuration tasks in Solr • Faceting (guided navigation, filters, etc.) • Replication and distributed search support • Lucene Best Practices

  10. A small sampling of Lucene/Solr-Powered Sites Buy.com

  11. Features and Functionality

  12. Quick Solr/Lucene Demo • Pre-reqs: • Apache Ant 1.7.x, Subversion (SVN) • Command Line 1: • svn co https://svn.apache.org/repos/asf/lucene/dev/trunksolr-trunk • cdsolr-trunk/solr/ • ant example • cd example • java –Dsolr.clustering.enabled=true –jar start.jar • Command Line 2 • cd exampledocs; java –jar post.jar *.xml • http://localhost:8983/solr/browse?q=&debugQuery=true&annotateBrowse=true

  13. Other Features • Data Import Handler • Database, Mail, RSS, etc. • Rich document support via Apache Tika • PDF, MS Office, Images, etc. • Replication for high query volume • Distributed search for large indexes • Production systems with 1B+ documents • Configurable Analysis chain and other extension points • Total control over tokenization, stemming, etc.

  14. Intangibles • Open Source • Flexible, non-restrictive license • Apache License v2 – non-viral • “Do what you want with the software, just don’t claim you wrote it” • Large community willing to help • Great place to learn about real world IR systems • Many books and other documentation • Lucene in Action by Hatcher, McCandless and Gospodnetic

  15. What’s New? • https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/CHANGES.txt • https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/CHANGES.txt • Codecs • Pluggable Index Formats • Provide Different index compression techniques • Stats to enable alternate scoring approaches • BM25, Lang. Modeling, etc. -- More work to be done here • Faster • Java Strings are slow; convert to use byte arrays

  16. Other New Items • Many new Analyzers (tokenizers, etc.) • Richer Language support (Hindi, Indonesian, Arabic, …) • Richer Geospatial (Local) Search capabilities • Score, filter, sort by distance • http://wiki.apache.org/solr/SpatialSearch • Results Grouping • Group Related Results • http://wiki.apache.org/solr/FieldCollapsing • More Faceting Capabilities • Pivot • New underlying algorithms

  17. How can Lucene/Solr help me?

  18. Job Trends http://www.indeed.com

  19. Other Things that Can Help • Nutch • Crawling • http://nutch.apache.org • Mahout • Machine learning (clustering, classification, others) • http://mahout.apache.org • OpenNLP • Part of Speech, Parsers, Named Entity Recognition • http://incubator.apache.org/opennlp • Open Relevance Project • Relevance Judgments • http://lucene.apache.org/openrelevance

  20. Resources • http://lucene.apache.org • http://www.lucidimagination.com • {java-user|solr-user}@lucene.apache.org • @gsingers • http://www.slideshare.net/gsingers • grant@lucidimagination.com

More Related