Università di Pisa NL search: hype or reality?
Hakia’s Aims and Benefits Hakia is building the Web’s new “meaning-based” search engine with the sole purpose of improving search relevancy and interactivity, pushing the current boundaries of Web search. The benefits to the end user are search efficiency, richness of information, and time savings.
Hakia’s Promise The basic promise is to bring search results by meaning match - similar to the human brain's cognitive skills - rather than by the mere occurrence (or popularity) of search terms. Hakia’s new technology is a radical departure from the conventional indexing approach, because indexing has severe limitations to handle full-scale semantic search.
Hakia’s Appeal Hakia’s capabilities will appeal to all Web searchers - especially those engaged in research on knowledge intensive subjects, such as medicine, law, finance, science, and literature.
Ontological Semantics • A formal and comprehensive linguistic theory of meaning in natural language • A set of resources, including: • a language-independent ontology of 8,000 interrelated concepts • an ontology-based English lexicon of 100,000 word senses • an ontological parser which "translates" every sentence of the text into its text meaning representation • acquisition toolbox which ensures the homogeneity of the ontological concepts and lexical entries by different acquirers of limited training
OntoSem Lexicon Example Bow (bow-n1 (cat n) (anno (def "instrument for archery")) (syn-struc ((root $var0) (cat n)))(sem-struc (bow))) (bow-n2 (cat n) (anno (def "part of string-instruments")) (syn-struc ((root $var0) (cat n)))(sem-struc (stringed-instrument-bow)))
Lexicon (Bow) (bow-v1(cat v)(anno (def "to give in to someone or something"))(syn-struc ((subject ((root $var2) (cat np))) (root $var0) (cat v) (pp-adjunct ((root to) (cat prep) (obj ((root $var3) (cat np)))))) ) (sem-struc (yield-to (agent (value ^$var2)) (caused-by (value ^$var3)))) )
QDEX • QDEX extracts all possible queries that can be asked to a Web page, at various lengths and forms • queries (sequences) become gateways to the originating documents, paragraphs and sentences during retrieval
QDEX vs Inverted Index • An inverted index has a huge “active” data set prior to a query from the user. • Enriching this data set with semantic equivalences (concept relations) will further increase the operational burden in an exponential manner. • QDEX has a tiny active set for each query and semantic associations can be easily handled on-the-fly.
QDEX combinatorics • The critical point in QDEX system is to be able to decompose sentences into a handful of meaningful sequences without getting lost in the combinatory explosion space. • For example, a sentence with 8 significant words can generate over a billion sequences (of 1, 2, 3, 4, 5, and 6 words) where only a few dozen makes sense by human inspection. • The challenge is how to reduce billion possibilities into a few dozen that make sense. hakia uses OntoSem technology to meet this challenge.
Semantic Rank • a pool of relevant paragraphs come from the QDEX system for a given query terms • final relevancy is determined based on an advanced sentence analysis and concept match between the query and the best sentence of each paragraph • morphological and syntactic analyses are also performed • no keyword matching or Boolean algebra is involved • the credibility and age (of the Web page) are also taken into account
NL Question on Wikipedia What companies did IBM acquire? Which company did IBM acquire in 1989? Google query on Wikipedia Same queries Poorer results Powerset Demo
Try yourself • Who acquired IBM? • IBM acquisitions 1996 • IBM acquisitions • What do liberal democrats say about healthcare • 1.4 million matches
Problems • Parser from Xerox is a quite sophisticated constituent parser: • it produces all possible parser trees • fairly slow • Workaround: index only the highest relevant portion of the Web
Semantic Document Analysis • Question Answering • Return precise answer to natural language queries • Relation Extraction • Intent Mining • assess the attitude of the document author with respect to a given subject • Opinion mining: attitude is a positive or negative opinion
Semantic Retrieval Approaches • Used in QA, Opinion Retrieval, etc. • Typical 2-stage approach: • Perform IR and rank by topic relevance • Postprocess results with filters and rerank • Generally slow: • Requires several minutes to process each query
Single stage approach • Single-stage approach: • Enrich the index with opinion tags • Perform normal retrieval with custom ranking function • Proved effective at TREC 2006 Blog Opinion Mining Task
Enriched Index for TREC Blog • Overlay words with tags
Enhanced Queries • music NEGATIVE:lame • music NEGATIVE:* • Achieved 3rd best P@5 at TREC Blog Track 2006
Inverted Index • Stored compressed • ~1 byte per term occurrence • Efficient intersection operation • O(n) where n is the length of shortest postings list • Using skip lists further reduces cost • Size: ~ 1/8 original text
Small Adaptive Set Intersection world wide web 2 3 1 4 9 8 6 12 10 21 20 25 30 40 40 35 47 41 40
IXE Search Engine Library • C++ OO architecture • Fast indexing • Sort-based inversion • Fast search • Efficient algorithms and data structures • Query Compiler • Small Adaptive Set Intersection • Suffix array with supra index • Memory mapped index files • Programmable API library • Template metaprogramming • Object Store Data Base
IXE Performance • TREC TeraByte 2005: • 2nd fastest • 2nd best P@5
Query Processing • Query compiler • One cursor on posting lists for each node • CursorWord, CursorAnd, CursorOr, CursorPhrase • QueryCursor.next(Result& min) • Returns first result r >= min • Single operator for all kind of queries: e.g. proximity
IXE Composability DocInfo Collection<DocInfo> name date size Collection<PassageDoc> Cursor PassageDoc next() text boundaries QueryCursor next() PassageQueryCursor next()
Passage Retrieval • Documents are split into passages • Matches are searched in passages ± n nearby • Results are ranked passages • Efficiency requires special store for passage boundaries
QA Using Dependency Relations • Build dependency trees for both question and answer • Determine similarity of corresponding paths in dependency trees of question and answer
Whatmetal has the highest melting point? obj mod sub mod PiQASso Answer Matching 1 Parsing Tungsten is a very dense material and has the highest melting point of any metal. 2 Answer type check 3 Relationextraction <tungsten, material, pred> <tungsten, has, subj> <point, has, obj> … SUBSTANCE 4 Matching Distance 5 Distance Filtering Tungsten 6 Popularity Ranking ANSWER
QA Using Dependency Relations • Further developed by Cui et al, NUS • Score computed by statistical translation model • Second best at TREC 2004
Wikipedia Experiment • Tagged Wikipedia with: • POS • LEMMA • NE (WSJ, IEER) • WN Super Senses • Anaphora • Parsing (head, dependency)
Tools Used • SST tagger [Ciaramita & Altun] • DeSR dependency parser [Attardi & Ciaramita] • Fast: 200 sentence/sec • Accurate: 90 % UAS
Dependency Parsing • Produces dependency trees • Word-word dependency relations • Far easier to understand and to annotate SUBJ OBJ OBJ SUBJ SUBJ MOD MOD TO Rolls-Royce Inc. said it expects its sales to remain steady
Classifer-based Shift-Reduce Parsing top next Shift Left Right He PP saw VVD a DT girl NN with IN a DT telescope NNS . SENT
EvalIta 2007 Results Best statistical parser
Experimental data sets • Wikipedia • Yahoo! Answers
English Wikipedia Indexing • Original size: 4.4 GB • Number of articles: 1,400,000 • Tagging time: ~3 days (6 days with previous tools) • Parsing time: 40 hours • Indexing time: 9 hours (8 days with UIMA + Lucene) • Index size: 3 GB • Metadata: 12 GB
Scaling Indexing • Highly parallelizable • Using Hadoop in stream mode
Implementation • Special version of Passage Retrieval • Tags are overlaid to words • Dealt as terms in same position as corresponding word • Not counted to avoid skewing TF/IDF • Given an ID in the lexicon • Retrieval is fast: • A few msec per query on a 10 GB index • Provided as both Linux library and Windows DLL
Java Interface • Generated using SWIG • Results accessible through a ResultIterator • List of terms or tags for a sentence generated on demand
Proximity queries • Did France win the World Cup?proximity 15 [MORPH/win:*DEP/SUB:france 'world cup'] • Born in the French territory of New Caledonia, he was a vital player in the French team that won the 1998 World Cup and was on the squad, but played just one game, as Francewon Euro 2000. • France repeated the feat of Argentina in 1998, by taking the title as they won their home 1998 World Cup, beating Brazil. • Both England (1966) and France (1998) won their only World Cups whilst playing as host nations.