1 / 41

Open Source IR Tools and Libraries

Open Source IR Tools and Libraries. CS-463 Information Retrieval Models Computer Science Department University of Crete. Outline. Google Search API Lucene Terrier Lemur Dragon Groogle. Google Search API. Google Search API: Overview. The API exposes the Google engine to developers.

Download Presentation

Open Source IR Tools and Libraries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Open Source IR Tools and Libraries CS-463 Information Retrieval Models Computer Science Department University of Crete

  2. Outline • Google Search API • Lucene • Terrier • Lemur • Dragon • Groogle

  3. Google Search API

  4. Google Search API: Overview • The API exposestheGoogle engine to developers. • You can write scripts that access the Google search in real-time. • Google no longerissuingnew API keysforthe SOAP Search API. • Instead, Google provides an AJAX Search API. • You can putGoogle Search in your web pages with JavaScript.

  5. Google Search API: SOAP • Basedon the Web Services TechnologySOAP (theXML-basedSimpleObject Access Protocol). • Developerswritesoftwareprogramsthatconnectremotely to the Google SOAP Search API service. • Developers can issue search requests to Google’s index of billions of web pages and receive results as structured data, access information in the Google cache and check the spelling of words. • Limitations • Default limit of 1,000 queries per day. • Can only query for 10 results a time • Can only access Google Web Search (notGoogle Images, Google Groups and so on).

  6. Google Search API: AJAX • Lets you putGoogle Search in your web pages with JavaScript. • Does not have a limit on the number of queries per day. • SupportsadditionalfeatureslikeVideo, News, Maps, and Blog search results.

  7. Google Search API: AJAX Web Search IncorporateresultsfromWeb Search, News Search, and Blog Search Local Search Provides access to local search resultsfromGoogle Maps. Video Search Incorporate a simplesearch box incorporate dynamic, searchpoweredstrips of videoand book thumbnails.

  8. Google Search API: Demo

  9. Google’s Solutions URL Queue/List Cached source pages (compressed) Use many features, e.g. font, layout,… Inverted index Hypertext structure

  10. Google Search API: References • Google SOAP Search API http://code.google.com/apis/soapsearch/ • Google AJAX Search API http://code.google.com/apis/ajaxsearch/ • Google AJAX Search API Developer Guide http://code.google.com/apis/ajaxsearch/documentation/ • Google AJAX Search API Samples http://code.google.com/apis/ajaxsearch/samples.html

  11. Lucene Apache Software Foundation

  12. Lucene • Cross-Platform API • Implemented in Java • Ported in C++, C#, Perl, Python • Offers scalable, high-performance indexing • Incremental indexing as fast as bath indexing • Index size roughly 20-30% the size of indexed text • Supports many powerful query types

  13. Lucene: Modules • Analysis • Tokenization, Stop words, Stemming, etc. • Document • Unique ID for each document • Title of document, date modified, content, etc. • Index • Provides access and maintains indexes. • Query Parser • Search / Search Spans

  14. Lucene: Indexing • A Document is a collection of Fields • A Field is free text, keywords, dates, etc. • A Field can have several characteristics • indexed, tokenized, stored, term vectors • Apply Analyzer to alter Tokens during indexing • Stemming • Stop-word removal • Phrase identification Document: Field 1 Field 2 Field N

  15. Terms Single terms and phrases Fields E.g. title:"Do itright" AND right Wildcard Searches ‘?’ for single character ‘*’ for multiple characters Proximity Searches “jakarta apache"~10 Fuzzy Searches LevenshteinDistanceorEditDistancealgorithm Range Searches mod_date:[20020101 TO 20030101] title:{Aida TO Carmen} Boosting a Term E.g. jakarta^4 apache Boolean Operators Lucene: Query Parser Syntax

  16. Lucene: More Advanced Options • Relevance Feedback • Manual • User selects which documents are relevant/non-relevant • Get the terms from the term vector for each of the documents and construct a new query. • Automatic • Application assumes the top X documents are relevant and the bottom Y are non-relevant and constructs a new query based on the terms in those documents. • Span Queries • Phrase Matching

  17. Lucene: Basic Demo • The latest version can be obtained from http://www.apache.org/dyn/closer.cgi/lucene/java/ • To build an index just type • java org.apache.lucene.demo.IndexFiles <dir> • To search from an index type: • java org.apache.lucene.demo.SearchFiles <index>

  18. Terrier University Of Glasgow

  19. Terrier: Overview (1/2) • Stands for TERabyte RetrIEveR. • OpenSourceAPI (MozillaPublic Licence). • Modular platform for the rapid development of large-scale IR applications. • It is written in Java (and Perl) • Highlycompresseddiskdata structures. • Handlinglarge-scaledocument collections. • Standard evaluationof TREC ad-hocandknown-itemsearch retrieval results. • Based on a new parameter-free probabilistic framework for IR (DFR), allowing adaptable term weighting functionalities.

  20. Terrier: Overview (2/2) • Includes state-of-the-art functionalities such as: • hyperlink structure analysis, • combination of evidence approaches, • automatic query expansion/re-formulation techniques, • query performance predictors • compression techniques. • Deploys over 50 term weighting/ matching functions. • Has a robust and effective crawler, called Labrador. • Allows a large-scale experimentation conducted in a robust, transparent, reproducible, modular, platform independent, and without constraints and parameters.

  21. Terrier: Indexing • Createyour ownCollectiondecoderand Document implementation. • Centralized or distributed Setting. • Indexer iterates through the collection and creates the following data structures • DirectIndex • Document Index • Lexicon

  22. Terrier: Indexing The inverted index is built from the existing direct index, document index and lexicon In this way, we build the direct and document indices. Each document in the collection is tokenized and parsed. We also build temporary lexicons in order to reduce the required memory during indexing

  23. Terrier: Retrieval • Parsing • Pre-processing • Matching • Post Processing • Post Filtering • Query Language • term1 term2 • term1^2.3 • +term1 -term2 • "term1 term2"~n

  24. Terrier: Retrieval Terrier automatically select the optimal document weighting model If Query Expansion is applied an appropriate term weighting model is selected and the most informative terms from the top ranked documents are added to the query Remove stop words and apply stemming to the query.

  25. Terrier: Sample Applications • Trec Terrier • Anapplicationthat allowsTerriertoindexandretrievefromstandard TREC testcollections. • Instructions are available at http://ir.dcs.gla.ac.uk/terrier/doc/trec_terrier.html

  26. Terrier: Sample Applications • Desktop Search • A Swing (graphical) applicationthat canbeusedtoindexfilesfromthelocalmachine, andthenperformquerieson them. • Thescriptsforrunningthedesktopsearchapplicationare: • desktop_search.sh (Linux, MacOSX) • desktop_search.bat (Windows)

  27. Terrier: Sample Applications • Interactive Querying • Aconsoleapplicationforperformingsimple querieson an existing index andseeingwhichdocumentsarereturned. • The scripts forrunningtheconsoleapplicationare: • interactive_terrier.sh (Linux, Mac OS X) • interactive_terrier.bat (Windows)

  28. Terrier: Demo

  29. Lemur University of Massachusetts

  30. Lemur: Overview • Supportfor XML andstructureddocumentretrieval • Interactiveinterfaces for Windows, Linux, andWeb • Cross-Platform, fastand modular codewrittenin C++ • Freeand open-source software

  31. Lemur: API • Providesinterfacesto Lemur classesthataregroupedat threedifferent levels: • Utility level • Common utilities, such as memory management, document parsing, etc. • Indexer level • Converts a raw text collectionto data structures for efficient retrieval. • Retrieval level • Abstract classes for a general retrieval architecture and concrete classes forseveral specificinformation retrieval

  32. Lemur: Indexing • Multipleindexingmethodsfor small, mediumand large-scale (terabyte) collections. • Built-in support forEnglish, Chinese and Arabic text. • Porter andKrovetz word stemming. • Incremental indexing.

  33. Lemur: Retrieval • Supportsmajorlanguagemodellingapproachessuch asIndriandKL-divergence, as well asvector space, tf-idf, OcapiandInQuery • Relevance-andpseudo-relevance feedback • Wildcardterm expansion (using Indri) • Supports arbitrarydocument priors (e.g., PageRank, URL depth)

  34. Lemur: Query Flow Query Parser User Query Scoring Nodes runQuery() Sorted Vector of Score Extent Results

  35. The Dragon Toolkit University Of Drexel

  36. Dragon: Overview (1/2) • Highly scalable to large data set • Well designed Programming API and XML-based Interface • Various document representations including words, multiword phrases, ontology-based concepts, and concept pairs • Various text retrieval models • Text classification, clustering, summarization and topic modeling

  37. Dragon: Overview (2/2) • Provides built-in supports for semantic-based IR and TM (different from Lucene and Lemur ). • Integrates a set of NLP tools, which enable the toolkit to index text collections with various representation schemes including words, phrases, ontology-based concepts and relationships. • It is specially designed for large-scale application. • The toolkit uses sparse matrix to implement text representations and does not have to load all data into memory in the running time. • Can handle hundred thousands of documents with very limited memory.

  38. Dragon Toolkit Demo

  39. Groogle University of Crete

  40. Groogle • Wiki • http://groogle.csd.uoc.gr/apache2-default/ • Repository • http://groogle.csd.uoc.gr/bzr/groogle-devel • Bugzilla • http://groogle.csd.uoc.gr/bugzilla/ • Credits • http://groogle.csd.uoc.gr:8080/groogle-2007/index.jsp?tID=credits

  41. Questions ?

More Related