1 / 21

Softvérové knižnice a systémy

Softvérové knižnice a systémy. Vyhľadávanie informácií Michal Laclav ík. IR tools Nutch + Hadoop IR API Lucene získavanie informácií Sťahovač: Nutch textové operácie: lucene, GATE Indexovanie: lucene spracovanie odkazov : Nutch B áza dokumentov: Konvertery, kompresia, kódovanie

shalom
Download Presentation

Softvérové knižnice a systémy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Softvérové knižnice a systémy Vyhľadávanie informácií Michal Laclavík

  2. IR tools Nutch + Hadoop IR API Lucene získavanie informácií Sťahovač: Nutch textové operácie: lucene, GATE Indexovanie: lucene spracovanie odkazov: Nutch Báza dokumentov: Konvertery, kompresia, kódovanie JavaMail Tika: PDFBox, POI, TextMining zip Vyhľadávanie formulácia dopytu a operácie na dopyte: Solr spracovanie dopytu: Solr vrátenie výsledku na používateľské rozhranie: Solr spätná väzba od používateľa: ? Extrakcia GATE Ontea Regexy Tools - Nástroje Bratislava, 4. november 2013

  3. Tools • IR libraries & engines • Lucene • Egothor • Lucene • Nutch • Sorl • Porty Bratislava, 4. november 2013

  4. Lucene Indexing Directory dir = FSDirectory.open(new File(indexPath)); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43); IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_43, analyzer); iwc.setOpenMode(OpenMode.CREATE); IndexWriter writer = new IndexWriter(dir, iwc); • IndexWriter • Directory • FSDirectory, RAMDirectory, MMapDirectory • Analyzer • Document • Collection of fields • Field • Keyword, UnIndexed, UnStored, Text doc = new Document(); doc.add(new StringField("ctg", value, Field.Store.YES)); doc.add(new TextField(fieldName, value2, Field.Store.NO)); doc.add(new VecTextField("title", data, Field.Store.YES)); writer.addDocument(doc); Bratislava, 4. november 2013

  5. Lucene Indexing 2 • Indexing Dates • Boosting • Field.setBoost • Indexing Numbers • Adding zeros, Analyzers • Sorting • Not tokenized, Field Keyword • Directory • FSDirectory, RAMDirectory • Term vector • Field.Unstored(“subject”,subject,true); Bratislava, 4. november 2013

  6. Lucene Searching directory = FSDirectory.open(new File(index)); reader = DirectoryReader.open(directory); searcher = new IndexSearcher(reader); • IndexSearcher • Term • Query • Boolean, Phrase, Prefix, Range, Fuzzy (levenstein) • TermQuery • Hits queryL = new BooleanQuery(); Query name = new TermQuery(new Term("name_exact", query)); Query alias = new TermQuery(new Term("alias_exact", query)); Query wiki = new TermQuery(new Term("wikipedia_exact", query)); name.setBoost(0.40f); alias.setBoost(0.30f); wiki.setBoost(0.30f); ((BooleanQuery) queryL).add(name, Occur.SHOULD); ((BooleanQuery) queryL).add(alias, Occur.SHOULD); ((BooleanQuery) queryL).add(wiki, Occur.SHOULD); Bratislava, 4. november 2013

  7. Lucene Searching 2 • Query q = QueryParser.parse(“search”, “field”, new SimpleAnalyzer()); • +pubdate:[20040101 TO 20041231] Java AND (Jakarta OR Apache) • Query.toString() • Scoring • Similarity, DefaultSimilarity • Sorting • By field, by multiple • MultiFieldQueryParser • Filtering fields = new String[] {"name", "alias", "text", "wikipedia"}; boosts.put("name", 0.40f); boosts.put("alias", 0.30f); boosts.put("text", 0.20f); boosts.put("wikipedia", 0.10f); MultiFieldQueryParser parser = new MultiFieldQueryParser(Version.LUCENE_43, fields, analyzer, boosts); queryL = parser.parse(query); TopDocs results = s.search(queryL, topK); ScoreDoc[] hits = results.scoreDocs; Bratislava, 4. november 2013

  8. Lucene Searching 3 • Custom Sort Method • Distance search Bratislava, 4. november 2013

  9. Lucene Analysis • XY&Z Corporation – xyz@example.com • WitespaceAnalyzer • [XY&Z] [Corporation] [–] [xyz@example.com] • SimpleAnalyzer – kills numbers • [XY] [Z] [corporation] [xyz] [example] [com] • StopAnalyzer • [XY] [Z] [corporation] [xyz] [example] [com] • StandardAnalyzer • [XY&Z] [corporation] [xyz@example.com] Bratislava, 4. november 2013

  10. Lucene Analysis 2 • Indexing • Querying • Query parse, QueryTerm not Analyzed • Results • Tokens, position type • Terms, position • TokenStream, Tokenizer, TokenFilter Bratislava, 4. november 2013

  11. Lucene Analysis 3 • Synonyms, aliases • Same position (phrase query) • UTF-8 • Kodovania, znaky HTML • Content-type • Nutch analysis Bratislava, 4. november 2013

  12. SandBox • Development tools • Lucli CLI • Luke – toolbox • SnowBall analyzer • T9 indexing example • Highlite • BerkleyDB Bratislava, 4. november 2013

  13. Lucene Doc format • Apache Tika • XML • SAX parser Xserces • Digester Apache Jakarta • PDF • PDFBox.org • Buildin support • HTML • JTidy.sf.net • NekoHTML • Word • POI – jakarta project • TextMining.org • RTF • Javax.swing.text.rtf Bratislava, 4. november 2013

  14. Lucene Ports • CLucene • dotLucene • Plucene Perl • Lupy Python • PyLucene GCJ + SWIG Bratislava, 4. november 2013

  15. Nutch • Build on lucene • Fetcher, searcher interface • Scalable to several bilions • Ranking ??? • Hadoop • Implementacia MapReduce Bratislava, 4. november 2013

  16. Other Use cases • JGuru • SearchBlox • Alias-i Bratislava, 4. november 2013

  17. Linux tools • Catdoc • Xsl, doc • openoffice • Pdftotext (XPDF) • Encoding • enca Bratislava, 4. november 2013

  18. Ine kniznice • QTag • POS tagging • Stemming • Snowball • Potter • Tvaroslovnik, JULS • SimMetrics • Podobnosti, levenstein, cosmiera • GATE Bratislava, 4. november 2013

  19. Ukážka • Na minulej prednáške • GATE • http://gate.ac.uk/sale/talks/gate-course-july09/slides-pdf/slides.html • Lucene Bratislava, 4. november 2013

  20. Other Tools • Apache UIMA • text processing (information extraction • OpenNLP • machine learning for text analysis i.e. information extraction • MOSES • Machine learning language translation Bratislava, 4. november 2013

  21. Dostupné dátové zdroje v Slovenskom jazyku • Korpus • http://korpus.juls.savba.sk/ • Organizácie s dátovými zdrojmi v rôznych jazykoch použiteľné na automatický preklad • http://www.tasr.sk/ • http://www.sita.sk • http://www.skrivanek.com/ • Voľne dostupné zdroje: • http://sk.wikipedia.org • http://sk.wiktionary.org • Slovníky • http://slovnik.azet.sk/ • http://slovniky.lingea.sk/ • http://www.sk-spell.sk.cx/mass-msas • Dáta • http://sk-spell.sk.cx/ • http://www.sk-spell.sk.cx/thesaurus/ • http://www.sk-spell.sk.cx/biblia-sk/ • http://www.sk-spell.sk.cx/OCR Bratislava, 4. november 2013

More Related