NL search: hype or reality?

Università di Pisa NL search: hype or reality?

Hakia

Hakia’s Aims and Benefits Hakia is building the Web’s new “meaning-based” search engine with the sole purpose of improving search relevancy and interactivity, pushing the current boundaries of Web search. The benefits to the end user are search efficiency, richness of information, and time savings.

Hakia’s Promise The basic promise is to bring search results by meaning match - similar to the human brain's cognitive skills - rather than by the mere occurrence (or popularity) of search terms. Hakia’s new technology is a radical departure from the conventional indexing approach, because indexing has severe limitations to handle full-scale semantic search.

Hakia’s Appeal Hakia’s capabilities will appeal to all Web searchers - especially those engaged in research on knowledge intensive subjects, such as medicine, law, finance, science, and literature.

Hakia “meaning-based” search

Ontological Semantics • A formal and comprehensive linguistic theory of meaning in natural language • A set of resources, including: • a language-independent ontology of 8,000 interrelated concepts • an ontology-based English lexicon of 100,000 word senses • an ontological parser which "translates" every sentence of the text into its text meaning representation • acquisition toolbox which ensures the homogeneity of the ontological concepts and lexical entries by different acquirers of limited training

OntoSem Lexicon Example Bow (bow-n1 (cat n) (anno (def "instrument for archery")) (syn-struc ((root $var0) (cat n)))(sem-struc (bow))) (bow-n2 (cat n) (anno (def "part of string-instruments")) (syn-struc ((root $var0) (cat n)))(sem-struc (stringed-instrument-bow)))

Lexicon (Bow) (bow-v1(cat v)(anno (def "to give in to someone or something"))(syn-struc ((subject ((root $var2) (cat np))) (root $var0) (cat v) (pp-adjunct ((root to) (cat prep) (obj ((root $var3) (cat np)))))) ) (sem-struc (yield-to (agent (value ^$var2)) (caused-by (value ^$var3)))) )

QDEX • QDEX extracts all possible queries that can be asked to a Web page, at various lengths and forms • queries (sequences) become gateways to the originating documents, paragraphs and sentences during retrieval

QDEX vs Inverted Index • An inverted index has a huge “active” data set prior to a query from the user. • Enriching this data set with semantic equivalences (concept relations) will further increase the operational burden in an exponential manner. • QDEX has a tiny active set for each query and semantic associations can be easily handled on-the-fly.

QDEX combinatorics • The critical point in QDEX system is to be able to decompose sentences into a handful of meaningful sequences without getting lost in the combinatory explosion space. • For example, a sentence with 8 significant words can generate over a billion sequences (of 1, 2, 3, 4, 5, and 6 words) where only a few dozen makes sense by human inspection. • The challenge is how to reduce billion possibilities into a few dozen that make sense. hakia uses OntoSem technology to meet this challenge.

Semantic Rank • a pool of relevant paragraphs come from the QDEX system for a given query terms • final relevancy is determined based on an advanced sentence analysis and concept match between the query and the best sentence of each paragraph • morphological and syntactic analyses are also performed • no keyword matching or Boolean algebra is involved • the credibility and age (of the Web page) are also taken into account

PowerSet

NL Question on Wikipedia What companies did IBM acquire? Which company did IBM acquire in 1989? Google query on Wikipedia Same queries Poorer results Powerset Demo

Try yourself • Who acquired IBM? • IBM acquisitions 1996 • IBM acquisitions • What do liberal democrats say about healthcare • 1.4 million matches

Problems • Parser from Xerox is a quite sophisticated constituent parser: • it produces all possible parser trees • fairly slow • Workaround: index only the highest relevant portion of the Web

Reality

Semantic Document Analysis • Question Answering • Return precise answer to natural language queries • Relation Extraction • Intent Mining • assess the attitude of the document author with respect to a given subject • Opinion mining: attitude is a positive or negative opinion

Semantic Retrieval Approaches • Used in QA, Opinion Retrieval, etc. • Typical 2-stage approach: • Perform IR and rank by topic relevance • Postprocess results with filters and rerank • Generally slow: • Requires several minutes to process each query

Single stage approach • Single-stage approach: • Enrich the index with opinion tags • Perform normal retrieval with custom ranking function • Proved effective at TREC 2006 Blog Opinion Mining Task

Enriched Index for TREC Blog • Overlay words with tags

Enhanced Queries • music NEGATIVE:lame • music NEGATIVE:* • Achieved 3rd best P@5 at TREC Blog Track 2006

Enriched Inverted Index

Inverted Index • Stored compressed • ~1 byte per term occurrence • Efficient intersection operation • O(n) where n is the length of shortest postings list • Using skip lists further reduces cost • Size: ~ 1/8 original text

Small Adaptive Set Intersection world wide web 2 3 1 4 9 8 6 12 10 21 20 25 30 40 40 35 47 41 40

IXE Search Engine Library • C++ OO architecture • Fast indexing • Sort-based inversion • Fast search • Efficient algorithms and data structures • Query Compiler • Small Adaptive Set Intersection • Suffix array with supra index • Memory mapped index files • Programmable API library • Template metaprogramming • Object Store Data Base

IXE Performance • TREC TeraByte 2005: • 2nd fastest • 2nd best P@5

Query Processing • Query compiler • One cursor on posting lists for each node • CursorWord, CursorAnd, CursorOr, CursorPhrase • QueryCursor.next(Result& min) • Returns first result r >= min • Single operator for all kind of queries: e.g. proximity

IXE Composability DocInfo Collection<DocInfo> name date size Collection<PassageDoc> Cursor PassageDoc next() text boundaries QueryCursor next() PassageQueryCursor next()

Passage Retrieval • Documents are split into passages • Matches are searched in passages ± n nearby • Results are ranked passages • Efficiency requires special store for passage boundaries

QA Using Dependency Relations • Build dependency trees for both question and answer • Determine similarity of corresponding paths in dependency trees of question and answer

Whatmetal has the highest melting point? obj mod sub mod PiQASso Answer Matching 1 Parsing Tungsten is a very dense material and has the highest melting point of any metal. 2 Answer type check 3 Relationextraction <tungsten, material, pred> <tungsten, has, subj> <point, has, obj> … SUBSTANCE 4 Matching Distance 5 Distance Filtering Tungsten 6 Popularity Ranking ANSWER

QA Using Dependency Relations • Further developed by Cui et al, NUS • Score computed by statistical translation model • Second best at TREC 2004

Wikipedia Experiment • Tagged Wikipedia with: • POS • LEMMA • NE (WSJ, IEER) • WN Super Senses • Anaphora • Parsing (head, dependency)

Tools Used • SST tagger [Ciaramita & Altun] • DeSR dependency parser [Attardi & Ciaramita] • Fast: 200 sentence/sec • Accurate: 90 % UAS

Dependency Parsing • Produces dependency trees • Word-word dependency relations • Far easier to understand and to annotate SUBJ OBJ OBJ SUBJ SUBJ MOD MOD TO Rolls-Royce Inc. said it expects its sales to remain steady

Classifer-based Shift-Reduce Parsing top next Shift Left Right He PP saw VVD a DT girl NN with IN a DT telescope NNS . SENT

CoNLL 2007 Results

EvalIta 2007 Results Best statistical parser

Experiment

Experimental data sets • Wikipedia • Yahoo! Answers

English Wikipedia Indexing • Original size: 4.4 GB • Number of articles: 1,400,000 • Tagging time: ~3 days (6 days with previous tools) • Parsing time: 40 hours • Indexing time: 9 hours (8 days with UIMA + Lucene) • Index size: 3 GB • Metadata: 12 GB

Scaling Indexing • Highly parallelizable • Using Hadoop in stream mode

Example (partial)

Stacked View

Implementation • Special version of Passage Retrieval • Tags are overlaid to words • Dealt as terms in same position as corresponding word • Not counted to avoid skewing TF/IDF • Given an ID in the lexicon • Retrieval is fast: • A few msec per query on a 10 GB index • Provided as both Linux library and Windows DLL

Java Interface • Generated using SWIG • Results accessible through a ResultIterator • List of terms or tags for a sentence generated on demand

Proximity queries • Did France win the World Cup?proximity 15 [MORPH/win:*DEP/SUB:france 'world cup'] • Born in the French territory of New Caledonia, he was a vital player in the French team that won the 1998 World Cup and was on the squad, but played just one game, as Francewon Euro 2000. • France repeated the feat of Argentina in 1998, by taking the title as they won their home 1998 World Cup, beating Brazil. • Both England (1966) and France (1998) won their only World Cups whilst playing as host nations.

NL search: hype or reality?

NL search: hype or reality?

Presentation Transcript

Support Vector Machines: Hype or Hallelujah?

Convergence in Indonesia; Hype or reality

4GL to EGL – Hype or Reality

Genomics- Hope or Hype?

Hype or Reality

Avian Influenza: Armageddon or Hype?

Mobile Commerce: Hype or Reality

Myth or Reality?

Cloud Computing: Moving From Hype to Reality

HPC Cloud: Hype or Reality?

Reality or Fantasy

Network science (NS): hype or reality?

MOOCs: hype or hope?

Myth or Reality?

ESL: Panacea or Hype?

The MultiMedia Home Platform (MHP): Hype or Reality ?

Hype or Hope?

TRUE BROADBAND IN AFRICA: HYPE OR REALITY

The GRID – Hype or Reality?

XML 101: Hype or Hoax ?

IPv6: Hype or Reality?

Jatropha curcas Hope or Hype?