1 / 18

Softvérové knižnice a systémy

Softvérové knižnice a systémy. Vyhľadávanie informácií Michal Laclav ík. Tools. IR libraries & engines Lucene Egothor Xapian mnoGoSearch Lucene Nutch Porty SearchBlox. Lucene Indexing . IndexWriter Directory FSDirectory, RAMDirectory Analyzer Document Collection of fields Field

kass
Download Presentation

Softvérové knižnice a systémy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Softvérové knižnice a systémy Vyhľadávanie informácií Michal Laclavík 08.11.2007

  2. Tools • IR libraries & engines • Lucene • Egothor • Xapian • mnoGoSearch • Lucene • Nutch • Porty • SearchBlox 08.11.2007

  3. Lucene Indexing • IndexWriter • Directory • FSDirectory, RAMDirectory • Analyzer • Document • Collection of fields • Field • Keyword, UnIndexed, UnStored, Text 08.11.2007

  4. Lucene Indexing 2 • Indexing Dates • Boosting • Field.setBoost • Indexing Numbers • Adding zeros, Analyzers • Sorting • Not tokenized, Field Keyword • Directory • FSDirectory, RAMDirectory • Term vector • Field.Unstored(“subject”,subject,true); 08.11.2007

  5. Lucene Searching • IndexSearcher • Term • Query • Boolean, Phrase, Prefix, Range, Fuzzy (levenstein) • TermQuery • Hits 08.11.2007

  6. Lucene Searching 2 • Query q = QueryParser.parse(“search”, “field”, new SimpleAnalyzer()); • +pubdate:[20040101 TO 20041231] Java AND (Jakarta OR Apache) • Query.toString() • Scoring • Similarity, DefaultSimilarity • Sorting • By field, by multiple • MultiFieldQueryParser • Filtering 08.11.2007

  7. Lucene Searching 3 • Custom Sort Method • Distance search 08.11.2007

  8. Lucene Analysis • XY&Z Corporation – xyz@example.com • WitespaceAnalyzer • [XY&Z] [Corporation] [–] [xyz@example.com] • SimpleAnalyzer – kills numbers • [XY] [Z] [corporation] [xyz] [example] [com] • StopAnalyzer • [XY] [Z] [corporation] [xyz] [example] [com] • StandardAnalyzer • [XY&Z] [corporation] [xyz@example.com] 08.11.2007

  9. Lucene Analysis 2 • Indexing • Querying • Query parse, QueryTerm not Analyzed • Results • Tokens, position type • Terms, position • TokenStream, Tokenizer, TokenFilter 08.11.2007

  10. Lucene Analysis 3 • Synonyms, aliases • Same position (phrase query) • UTF-8 • Kodovania, znaky HTML • Content-type • Nutch analysis • The quick 08.11.2007

  11. SandBox • Development tools • Lucli CLI • Luke – toolbox • SnowBall analyzer • T9 indexing example • Highlite • BerkleyDB 08.11.2007

  12. Lucene Doc format • XML • SAX parser Xserces • Digester Apache Jakarta • PDF • PDFBox.org • Buildin support • HTML • JTidy.sf.net • NekoHTML • Word • POI – jakarta project • TextMining.org • RTF • Javax.swing.text.rtf 08.11.2007

  13. Tools • DocSearcher • Docco • SearchBlox 08.11.2007

  14. Lucene Ports • CLucene • dotLucene • Plucene Perl • Lupy Python • PyLucene GCJ + SWIG 08.11.2007

  15. Nutch • Build on lucene • Fetcher, searcher interface • Scalable to several bilions • Ranking ??? • Hadoop • Implementacia MapReduce 08.11.2007

  16. Other Use cases • JGuru • SearchBlox • Alias-i 08.11.2007

  17. Linux tools • Catdoc • Xsl, doc • openoffice • Pdftotext (XPDF) • Encoding • enca 08.11.2007

  18. Ine kniznice • QTag • POS tagging • Stemming • Snowball • Potter • Tvaroslovnik, JULS • SimMetrics • Podobnosti, levenstein, cosmiera • GATE 08.11.2007

More Related