1 / 12

Lucene in Action

Lucene in Action. For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei. What is Lucene.

malory
Download Presentation

Lucene in Action

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lucene in Action For ITCS 6265 Professor: Wensheng Wu Present by TA: XuFei

  2. What is Lucene • “Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. ” • high performance, scalable Information Retrieval (IR) library. • a project in the Apache Software Foundation • mature, free, open-source • implemented in Java.

  3. full-text indexing and searching • “In text retrieval, full text search refers to a technique for searching a computer-stored document or database. In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user. ” • “Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. ”

  4. Lucene is popular • a number of ports or integrations to other programming languages • C/C++, C#, Ruby, Perl, Python, PHP, etc. • 1500+ installations: • HP, FedEx, Iron Mountain, Akamai, DSpace, IBM/Yahoo, Healthline, Webmail, CNET, Lookout (acquired by Microsoft), webshots.com (100M docs, 4M queries/day), Siderean, Monster….

  5. Lucene is just a hammer! • NOT a ready-to-use search application, like Google • a software library, a toolkit • a single compact JAR file (less than 1 MB!) • A number of full-featured search applications have been built on top of Lucene.

  6. What Lucene can do for you • add search capabilities to your application • index and make searchable any data that you can extract text from • Lucene doesn’t care about the source of the data, its format, or even its language, as long as you can derive text from it. • You can even index data stored in your databases, indirectly!

  7. Search Application • Components for indexing • Acquire Content • Build Document • Analyze Document • Index Document • Components for searching • Search User Interface • Build Query • Search Query • Render Results • Others • Administration Interface • Analytics Interface • Scaleout Figure 1. Typical components of search application; the shaded components show which parts Lucene handles.

  8. Ranking formula score(Q,D)   =   coord(Q,D)  · queryNorm(Q)   ·  ∑ t in Q (tf(t in D)  ·  idf(t)2 ·  t.getBoost() · norm(D) ) • tf–idf weight (term frequency–inverse document frequency)

  9. Key index files in Lucene • Segments file • Fields information file • Text information file • Frequency file • Position file

  10. Inverted Index Example Doc 1: Penn State Football … football Posting Table Doc 2: Football players … State

  11. Demo • How to install Lucene and run the demo • Boolean retrieval example • apache – lucene • apache + lucene • apache lucene • Luke: http://www.getopt.org/luke/ • A online demo (PHP + Lucene) : http://tiny.cc/JCA9K

  12. Reference: • Lucene: http://lucene.apache.org/ • Apache: http://www.apache.org/ • “Lucene in Action” Chapter 1 and code: Link • Lucene index: http://www.ibm.com/developerworks/library/wa-lucene/ • http://lucene.apache.org/java/2_4_0/scoring.html • http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html • http://en.wikipedia.org/wiki/Full_text_search • http://en.wikipedia.org/wiki/Index_%28search_engine%29 • http://en.wikipedia.org/wiki/Tf-idf

More Related