Contextual Insight in Search Enabling Technologies and Applications

Contextual Insight in SearchEnabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Outline • Background and scope • Enabling technologies • Scalable search and data aggregation • XML search • Information extraction • Example applications

Background • Fast Search & Transfer ASA • Founded 1997, grew out of research at NTNU • Enterprise search market leader • Previously a major web search player with alltheweb.com • Me • PhD in computer science from NTNU (2000) • Chief scientist at Fast Search & Transfer ASA • Main focus on the “intelligence layer” of search

Scope “Paragraphs that contain quotations by George W. Bush, where he mentions a monetary amount.” “Paragraphs that discuss a company merger or acquisition.” “Quotations where somebody says something about the Gaza Strip.”

Search engines and databases • Serve similar purposes • Organize your data so that it can be queried • Hybrid deployments • Database offloading

> Anatomy of a search platform SECURITY ACCESS User Monitor ACL Monitor Web Portlets File Applications Database CONTENT API QUERY API SFE Application Custom Custom CONTENT CONNECTORS QUERY CONNECTORS MANAGEMENT & APPLICATION SERVICES Deployment Business Application Administration Custom TOOLS & TOOL BUILDING FRAMEWORK

Document processing • Processes the content before it gets indexed • Documents flow through a pipeline of processing stages • Highly customizable • Example processing stages • Format, language and encoding detection • Format and encoding normalization • HTML parsing • Entity extraction • Vectorization • Categorization • Lemmatization and synonym handling • Anchor text harvesting • ... pdf; english; iso-8859-1 pdf → html; iso-8859-1 → utf-8 venus williams; arthur ashe; ... {(venus williams, 1.0), (wimbledon, 0.81), (center court, 0.65), ...} sports/tennis mouse ~ mice; car ~ automobile

Document Processing “UIMA stands for Unstructured Information Management Architecture. It is a component software architecture for the development, discovery, composition, and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies.”

Query processing • Processes the query before it gets sent to the search engine • Queries flow through a pipeline of processing stages • Highly customizable • Example processing stages • Stopword handling • Phrasing and antiphrasing • Spellchecking • Natural language handling • Lemmatization and synonym handling • Ontologies • ... red cross → “red cross”; where can i find information about cars → cars brittany speers → britney spears television under 200 dollars → television AND price:<200 mouse ~ mice; car ~ automobile xsara < citröen < car < vehicle

Engine architecture • Scaling in data volume • Add columns • Each column holds a partition of the data • Query the partitions in parallel • Scaling in query traffic • Add rows • Replicate the partitions • Distribute the queries • Scaling in other dimensions • Query complexity • Fault tolerance

qrserver nodes dispatch nodes search nodes (single tier) indexing nodes Engine architecture • Query and result processing • Turn bad queries into good ones • Help navigate among matching documents • Divide and conquer • Parallelize the searching over disjoint sets of documents • Merge the results

For very big deployment scenarios Web scale, i.e., billions of documents All search nodes are equal, but some are more equal than others Organize the search nodes into multiple tiers Top tier nodes may have fewer documents and run on better hardware Keep the good stuff in the top tiers Only fall through to the lower tiers if not enough good matches are not found in the top tiers Use query logs to decide which documents that belong in which tiers Tier 1 Fallthrough? Tier 2 Fallthrough? Tier 3 Engine architecture

Data aggregation • Computes histograms and scalar statistics across the result set • Counting is done distributed across the search nodes • For numerical or temporal fields, this may involve applying a real-time discretization algorithm • The histograms can be scored according to their ”interestingness” • Histograms may serve as query refinement sources • The counts preview how many results to expect • Adds a discovery aspect

Co-Authors Research Topics Documents authoredby J Seward stress echocardiography A Tajik 56 image orientation J Oh 25 Journals J Am Soc Echocardiography regurgitant orifice P Pellikka 16 J Am Coll Cardiology abnormal relaxation B Khanderia Echocardiography 16 Mayo Clinic Proc Am J Cardiology Circulation Am Heart J two-dimensional echocardiography D Hagler 13 Chemical substances V Roger ventricular response in patients 13 Dobutamine 1 2 5 4 1 K Bailey initial repair 13 1 Cardiotonic agents 1 3 2 1 2 F Miller mitral lesions 11 Contrast media 1 1 Atropine 1 echocardiographic contrast Imagent US 1 myocardial infarction Data aggregation • Trends and correlations • Query specific • Ad hoc reports and OLAP • Analytical applications powered by search

Result processing • Unsupervised clustering • Analyzes on-the-fly similarities across matching documents • Group together documents having similar content • Provides a bird’s-eye view of topical spread • Query refinement suggestions • Examines the distribution of meta data • Builds a histogram of values • Provides a means for slicing and dicing the result set • Filtering • E.g., dynamic duplicate detection

CompletenessAuthority Statistics Quality Freshness Relevancy primitives C A S Q F C Rank score A S Q F Relevancy

<authors> <author>John P. Brown</author> <author>George Smith</author></authors> XML search • Moving beyond a flat document model • From a simple (field, value) layout to complex,nested structure having scopes/tags • From a predefined index layout to schemaflexibility • Some queries cannot be adequately handledwithout structure • Flattening out the content won’t quite work • False positives slip through “Show me documents authored by John Smith”

<matches> <match> <sentence>The publicist of <person>Robert De Niro</person> announces that the actor has prostate cancer.</sentence> </match> <match> <sentence><person>De Niro</person> was diagnosed with cancer last week.</sentence> </match> <match> <sentence>”He’ll fight the cancer,” says <person>John Barnes</person>, founder of his Welsh fanclub. </match> </matches> xml:sentence:(“cancer” and scope(person))//sentence[fast:matches(., “cancer”) and .//person] XML search • Queries that combine structure and content • Impose contextual constraints on the content • FAST Query Language (FQL) • Partial overlap with XQuery • Linguistic extensions • Return matching scopes • See the matching documentfragments, including markupand annotations

Data aggregation and XML • Constrain the aggregation to the level of matching scopes • Document-level aggregation may yield too imprecise results • Provides a more factual and relevant relation between the query and the histogram entries • Example from Wikipedia • Or to some other constrained scope • E.g., match documents that within the same sentence contains a company the word ’scandal’, but aggregate over all person names that occur in sentences that contain the job title ’cfo’. Persons that appear in documents that contain the word ‘soccer’ Persons that appear in paragraphs that contain the word ‘soccer’

Data aggregation and XML • Aggregate not only over content, but also over structure! • Discovery, schema exploration and disambiguation xml:sentence:(“clinton” and scope(date)) <sentence><person>Hillary Rodham Clinton</person> was elected <location type=“country”>United States</location> Senator from <location type=“city”>New York</location> on <date base=“2000-11-07”>November 7, 2000</date>.</sentence> <sentence><person>Chelsea Clinton</person> has been dating <person>Jeremy Kane</person>, a <university>Stanford University</university> classmate since <date base=“2000-09-08”>Sept. 8, 2000</date>.</sentence> <sentence>So, whether you are visiting for the <date base=“XXXX-07-04”>4th of July</date> weekend or an extended stay, we are confident that you will enjoy your time in <location type=“city”>Clinton</location>!</sentence> person (3)date (3)location (2)university (1) “Which scopes exist in my matching document fragments? “ person (2)location (1) “Inside which scopes are my query terms found?”

...in fair and free elections. Kuchma was reelected in November 1999 to another five-year term, with 56% of the vote. International observers criticized... ...in fair and free elections.</sentence> <sentence><person base=“Leonid Kuchma” title=“president” >Kuchma</person> was reelected in <date base=“1999-11-XX”>November 1999</date> to another five-year term, with 56% of the vote.</sentence> <sentence>International observers criticized... Information extraction • Apply text mining techniques to identify entities of interest • Structural and semantic regions • Makes unstructured data more structured • Mark them up in context • Grammars as scope producers • Scopes can be annotated with meta data • Make it possible to act on the information! • E.g., make it searchable in a way that preserves context

Searching with context • High search precision and expressiveness • Semantically logical boundaries • Disambiguation • Wildcards at the semantic level • Work with normalized forms • Where there’s smoke, there’s fire • ... • Navigation • Suggestions to broaden/refine • Discover, summarize and disambiguate E.g., use sentence or paragraph as the main enclosing scope E.g., ‘bush’ as in ‘George Bush’ and not ‘burning bush’ Predicates that test for scope existence 2001-09-11, Sept. 11, 11th of September, ... E.g., use existence of a zip code to find an address sentence → paragraph; paragraph → sentence ’clinton’ as in ’person’ or in ’location’?

<works-for person=“John Smith” company=“Microsoft” title=“Vice President”> <sentence><person>Johnny</person> works at <company>Microsoft</company>.</sentence> <sentence>He is a <jobtitle>VP</jobtitle> there.</sentence></works-for> Complex relations • Relations are often of interest • E.g., (gene x disease) or (person x company x jobtitle) or (company x product) • Possible to model explicitly, if we are willing to invest the effort • Can we avoid upfront modelling? • We want to enable ad hoc fact finding and relations • We cannot foresee everything • Even if we did, we may not want to spend the time modelling all the targeted combinations • We want to maximize document processing throughput • Use a scoped search to relate entities • Require the entity types involved to co-occur within the same semantically meaningful scope • Filter and count over the matching scopes • Statistics to the rescue? • Co-occurrence is no guarantee for relatedness • Implies a certain amount of redundancy to work well

POS tagging • Annotate each token with part-of-speech data • Noun, verb, adjective, ... • Markov based taggers • Treat sequences of tags as a Markov model • Use the Viterbi algorithm to find the most probable path • Extensions to standard Markov models, e.g., using trigrams • Transformation based taggers • Assign the most probable tags as a starting point • Use rewrite rules to modify the tags • Learning algorithm to find the best rules and their application order

POS tagging

Layered pattern matching • Annotate the text with zero or more layers of meta data • The original surface form of the text can be viewed as trivial meta data • Apply a pattern matcher over selected layers • Handcrafted rules or trained models • Extract the surface forms that correspond to the matching patterns • Regular expressions over the surface forms go a long way • Modern regular expression engines go beyond regular languages • Callouts are our friends! Use them as “oracles”

Sentiment analysis “Is there a positive spin on current news related to my brand?” “What is the tone of this document?” “Thumbs up or thumbs down?”

Global Local Distant Imminent Sentiment analysis • Based on work by Turney and Littman • Depends on initial word seeds defined as “polar opposites” • Search for interesting words and phrases in a training corpus that occur together with the seeds, and score them • Use the scored words and phrases as the classification vocabulary • Interesting generalizations possible • Multiple dimensions • Dimensions other than thumbs up/down

<sentence><company>Adidas</company> <sentiment degree="-0.6">failed to deliver</sentiment> a profit in Q1.</sentence> <sentence>The new shoe from <company>Adidas</company> is <sentiment degree="0.9">really awesome</sentiment>!</sentence> Sentiment analysis and XML • Bring the analysis down to a subdocument level • Provides a more accurate sentiment assessment • Enables precise searching • Enables retrieval of the “evidence” • Scalar statistics aggregated over matching scopes • E.g, average sentiment degree

Task-oriented interfaces q, e → xml:sentence:(q and scope(e))

Question answering • Some classes of questions transform naturally to scope queries • Statistics to the rescue! when (is|was|did|...) X? → xml:sentence:(X’ and (scope(date) or scope(time))) what does X stand for? → xml:acronym:(@base:X and scope(@definition)) who works (at|for) X? → xml:sentence:(company:X and scope(person) and scope(jobtitle)) why (is|are|do|does|have|has|...) X? → xml:sentence:(X’ and (“because” or “due to” or ...)) when did the berlin wall fall? when is christmas? when was d-day?

Contextual navigation Contextual access Audio and video • Speech-to-text software as an XML producer • Speaker identification • Scene changes • Timing information • Contextual access to multimedia content

Scope “Paragraphs that contain quotations by George W. Bush, where he mentions a monetary amount.” xml:paragraph:quotation:(@speaker:”bush” and scope(price))) “Paragraphs that discuss a company merger or acquisition.” “Quotations where somebody says something about the Gaza Strip.” xml:paragraph:(string(“merger”, linguistics=“on”) and scope(company) and scope(price)) xml:sentence:acronym:(@base:“mit” and scope(@definition)) xml:sentence:(“d-day” and (scope(date) or scope(location))

Concluding thoughts • Opportunities for semantically motivated relevancy models • Like business rules, almost • What kind of applications would emerge... • ...if the entire web was enriched in this way?

Contextual Insight in Search Enabling Technologies and Applications

Contextual Insight in Search Enabling Technologies and Applications

Presentation Transcript

Chapter 3 Enabling Technologies

Search Technologies

Privacy and Contextual Integrity: Framework and Applications

Enabling Technologies (Chapter 1)

Privacy and Contextual Integrity: Framework and Applications

Enabling Collaborative Applications in MANETs

Privacy and Contextual Integrity: Framework and Applications

KEY ENABLING TECHNOLOGIES

Key Enabling Technologies

Enabling Technologies

Wireless Broadband Access for the Automobile: Applications and Enabling Technologies

Contextual Image Search

“Grid-enabling” applications

Contextual Image Search

VCell Project: Enabling Technologies

Enabling Data Intensive Applications with Advanced Optical Technologies

Contextual Search and Name Disambiguation in Email using Graphs

Contextual Search and Name Disambiguation in Email Using Graphs

Contextual Insight in Search Enabling Technologies and Applications