contextual insight in search enabling technologies and applications l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Contextual Insight in Search Enabling Technologies and Applications PowerPoint Presentation
Download Presentation
Contextual Insight in Search Enabling Technologies and Applications

Loading in 2 Seconds...

play fullscreen
1 / 35

Contextual Insight in Search Enabling Technologies and Applications - PowerPoint PPT Presentation


  • 201 Views
  • Uploaded on

Contextual Insight in Search Enabling Technologies and Applications. Aleksander Øhrn, PhD August 31, 2005. Outline. Background and scope Enabling technologies Scalable search and data aggregation XML search Information extraction Example applications. Background.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Contextual Insight in Search Enabling Technologies and Applications' - jaden


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
contextual insight in search enabling technologies and applications

Contextual Insight in SearchEnabling Technologies and Applications

Aleksander Øhrn, PhD

August 31, 2005

outline
Outline
  • Background and scope
  • Enabling technologies
    • Scalable search and data aggregation
    • XML search
    • Information extraction
  • Example applications
background
Background
  • Fast Search & Transfer ASA
    • Founded 1997, grew out of research at NTNU
    • Enterprise search market leader
    • Previously a major web search player with alltheweb.com
  • Me
    • PhD in computer science from NTNU (2000)
    • Chief scientist at Fast Search & Transfer ASA
    • Main focus on the “intelligence layer” of search
scope
Scope

“Paragraphs that contain quotations by George W. Bush, where he mentions a monetary amount.”

“Paragraphs that discuss a company merger or acquisition.”

“Quotations where somebody says something about the Gaza Strip.”

search engines and databases
Search engines and databases
  • Serve similar purposes
    • Organize your data so that it can be queried
  • Hybrid deployments
    • Database offloading
anatomy of a search platform

>

Anatomy of a search platform

SECURITY ACCESS

User Monitor

ACL Monitor

Web

Portlets

File

Applications

Database

CONTENT API

QUERY API

SFE

Application

Custom

Custom

CONTENT

CONNECTORS

QUERY

CONNECTORS

MANAGEMENT & APPLICATION SERVICES

Deployment

Business

Application

Administration

Custom

TOOLS & TOOL BUILDING FRAMEWORK

document processing
Document processing
  • Processes the content before it gets indexed
    • Documents flow through a pipeline of processing stages
    • Highly customizable
  • Example processing stages
    • Format, language and encoding detection
    • Format and encoding normalization
    • HTML parsing
    • Entity extraction
    • Vectorization
    • Categorization
    • Lemmatization and synonym handling
    • Anchor text harvesting
    • ...

pdf; english; iso-8859-1

pdf → html; iso-8859-1 → utf-8

venus williams; arthur ashe; ...

{(venus williams, 1.0), (wimbledon, 0.81), (center court, 0.65), ...}

sports/tennis

mouse ~ mice; car ~ automobile

document processing8
Document Processing

“UIMA stands for Unstructured Information Management Architecture. It is a component software architecture for the development, discovery, composition, and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies.”

query processing
Query processing
  • Processes the query before it gets sent to the search engine
    • Queries flow through a pipeline of processing stages
    • Highly customizable
  • Example processing stages
    • Stopword handling
    • Phrasing and antiphrasing
    • Spellchecking
    • Natural language handling
    • Lemmatization and synonym handling
    • Ontologies
    • ...

red cross → “red cross”; where can i find information about cars → cars

brittany speers → britney spears

television under 200 dollars → television AND price:<200

mouse ~ mice; car ~ automobile

xsara < citröen < car < vehicle

engine architecture
Engine architecture
  • Scaling in data volume
    • Add columns
    • Each column holds a partition of the data
    • Query the partitions in parallel
  • Scaling in query traffic
    • Add rows
    • Replicate the partitions
    • Distribute the queries
  • Scaling in other dimensions
    • Query complexity
    • Fault tolerance
engine architecture11

qrserver nodes

dispatch nodes

search nodes

(single tier)

indexing nodes

Engine architecture
  • Query and result processing
    • Turn bad queries into good ones
    • Help navigate among matching documents
  • Divide and conquer
    • Parallelize the searching over disjoint sets of documents
    • Merge the results
engine architecture12
For very big deployment scenarios

Web scale, i.e., billions of documents

All search nodes are equal, but some are more equal than others

Organize the search nodes into multiple tiers

Top tier nodes may have fewer documents and run on better hardware

Keep the good stuff in the top tiers

Only fall through to the lower tiers if not enough good matches are not found in the top tiers

Use query logs to decide which documents that belong in which tiers

Tier 1

Fallthrough?

Tier 2

Fallthrough?

Tier 3

Engine architecture
data aggregation
Data aggregation
  • Computes histograms and scalar statistics across the result set
    • Counting is done distributed across the search nodes
    • For numerical or temporal fields, this may involve applying a real-time discretization algorithm
    • The histograms can be scored according to their ”interestingness”
  • Histograms may serve as query refinement sources
    • The counts preview how many results to expect
    • Adds a discovery aspect
data aggregation14

Co-Authors

Research Topics

Documents authoredby J Seward

stress echocardiography

A Tajik

56

image orientation

J Oh

25

Journals

J Am Soc Echocardiography

regurgitant orifice

P Pellikka

16

J Am Coll Cardiology

abnormal relaxation

B Khanderia

Echocardiography

16

Mayo Clinic Proc

Am J Cardiology

Circulation

Am Heart J

two-dimensional echocardiography

D Hagler

13

Chemical substances

V Roger

ventricular response in patients

13

Dobutamine

1

2

5

4

1

K Bailey

initial repair

13

1

Cardiotonic agents

1

3

2

1

2

F Miller

mitral lesions

11

Contrast media

1

1

Atropine

1

echocardiographic contrast

Imagent US

1

myocardial infarction

Data aggregation
  • Trends and correlations
    • Query specific
  • Ad hoc reports and OLAP
    • Analytical applications powered by search
result processing
Result processing
  • Unsupervised clustering
    • Analyzes on-the-fly similarities across matching documents
    • Group together documents having similar content
    • Provides a bird’s-eye view of topical spread
  • Query refinement suggestions
    • Examines the distribution of meta data
    • Builds a histogram of values
    • Provides a means for slicing and dicing the result set
  • Filtering
    • E.g., dynamic duplicate detection
relevancy

CompletenessAuthority

Statistics

Quality

Freshness

Relevancy

primitives

C

A

S

Q

F

C

Rank

score

A

S

Q

F

Relevancy
xml search

<authors> <author>John P. Brown</author> <author>George Smith</author></authors>

XML search
  • Moving beyond a flat document model
    • From a simple (field, value) layout to complex,nested structure having scopes/tags
    • From a predefined index layout to schemaflexibility
  • Some queries cannot be adequately handledwithout structure
    • Flattening out the content won’t quite work
    • False positives slip through

“Show me documents authored by John Smith”

xml search18

<matches>

<match>

<sentence>The publicist of <person>Robert De Niro</person> announces that the actor has prostate cancer.</sentence>

</match>

<match>

<sentence><person>De Niro</person> was diagnosed with cancer last week.</sentence>

</match>

<match>

<sentence>”He’ll fight the cancer,” says <person>John Barnes</person>, founder of his Welsh fanclub.

</match>

</matches>

xml:sentence:(“cancer” and scope(person))//sentence[fast:matches(., “cancer”) and .//person]

XML search
  • Queries that combine structure and content
    • Impose contextual constraints on the content
  • FAST Query Language (FQL)
    • Partial overlap with XQuery
    • Linguistic extensions
  • Return matching scopes
    • See the matching documentfragments, including markupand annotations
data aggregation and xml
Data aggregation and XML
  • Constrain the aggregation to the level of matching scopes
    • Document-level aggregation may yield too imprecise results
    • Provides a more factual and relevant relation between the query and the histogram entries
    • Example from Wikipedia
  • Or to some other constrained scope
    • E.g., match documents that within the same sentence contains a company the word ’scandal’, but aggregate over all person names that occur in sentences that contain the job title ’cfo’.

Persons that appear in documents that contain the word ‘soccer’

Persons that appear in paragraphs that contain the word ‘soccer’

data aggregation and xml20
Data aggregation and XML
  • Aggregate not only over content, but also over structure!
  • Discovery, schema exploration and disambiguation

xml:sentence:(“clinton” and scope(date))

<sentence><person>Hillary Rodham Clinton</person> was elected <location type=“country”>United States</location> Senator from <location type=“city”>New York</location> on <date base=“2000-11-07”>November 7, 2000</date>.</sentence>

<sentence><person>Chelsea Clinton</person> has been dating <person>Jeremy Kane</person>, a <university>Stanford University</university> classmate since <date base=“2000-09-08”>Sept. 8, 2000</date>.</sentence>

<sentence>So, whether you are visiting for the <date base=“XXXX-07-04”>4th of July</date> weekend or an extended stay, we are confident that you will enjoy your time in <location type=“city”>Clinton</location>!</sentence>

person (3)date (3)location (2)university (1)

“Which scopes exist in my matching document fragments? “

person (2)location (1)

“Inside which scopes are my query terms found?”

information extraction

...in fair and free elections. Kuchma was reelected in November 1999 to another five-year term, with 56% of the vote. International observers criticized...

...in fair and free elections.</sentence> <sentence><person base=“Leonid Kuchma” title=“president” >Kuchma</person> was reelected in <date base=“1999-11-XX”>November 1999</date> to another five-year term, with 56% of the vote.</sentence> <sentence>International observers criticized...

Information extraction
  • Apply text mining techniques to identify entities of interest
    • Structural and semantic regions
    • Makes unstructured data more structured
  • Mark them up in context
    • Grammars as scope producers
    • Scopes can be annotated with meta data
  • Make it possible to act on the information!
    • E.g., make it searchable in a way that preserves context
searching with context
Searching with context
  • High search precision and expressiveness
    • Semantically logical boundaries
    • Disambiguation
    • Wildcards at the semantic level
    • Work with normalized forms
    • Where there’s smoke, there’s fire
    • ...
  • Navigation
    • Suggestions to broaden/refine
    • Discover, summarize and disambiguate

E.g., use sentence or paragraph as the main enclosing scope

E.g., ‘bush’ as in ‘George Bush’ and not ‘burning bush’

Predicates that test for scope existence

2001-09-11, Sept. 11, 11th of September, ...

E.g., use existence of a zip code to find an address

sentence → paragraph; paragraph → sentence

’clinton’ as in ’person’ or in ’location’?

complex relations

<works-for person=“John Smith” company=“Microsoft” title=“Vice President”> <sentence><person>Johnny</person> works at <company>Microsoft</company>.</sentence> <sentence>He is a <jobtitle>VP</jobtitle> there.</sentence></works-for>

Complex relations
  • Relations are often of interest
    • E.g., (gene x disease) or (person x company x jobtitle) or (company x product)
    • Possible to model explicitly, if we are willing to invest the effort
  • Can we avoid upfront modelling?
    • We want to enable ad hoc fact finding and relations
    • We cannot foresee everything
    • Even if we did, we may not want to spend the time modelling all the targeted combinations
    • We want to maximize document processing throughput
  • Use a scoped search to relate entities
    • Require the entity types involved to co-occur within the same semantically meaningful scope
    • Filter and count over the matching scopes
  • Statistics to the rescue?
    • Co-occurrence is no guarantee for relatedness
    • Implies a certain amount of redundancy to work well
pos tagging
POS tagging
  • Annotate each token with part-of-speech data
    • Noun, verb, adjective, ...
  • Markov based taggers
    • Treat sequences of tags as a Markov model
    • Use the Viterbi algorithm to find the most probable path
    • Extensions to standard Markov models, e.g., using trigrams
  • Transformation based taggers
    • Assign the most probable tags as a starting point
    • Use rewrite rules to modify the tags
    • Learning algorithm to find the best rules and their application order
layered pattern matching
Layered pattern matching
  • Annotate the text with zero or more layers of meta data
    • The original surface form of the text can be viewed as trivial meta data
  • Apply a pattern matcher over selected layers
    • Handcrafted rules or trained models
    • Extract the surface forms that correspond to the matching patterns
  • Regular expressions over the surface forms go a long way
    • Modern regular expression engines go beyond regular languages
    • Callouts are our friends! Use them as “oracles”
sentiment analysis
Sentiment analysis

“Is there a positive spin on current news related to my brand?”

“What is the tone of this document?”

“Thumbs up or thumbs down?”

sentiment analysis28

Global

Local

Distant

Imminent

Sentiment analysis
  • Based on work by Turney and Littman
    • Depends on initial word seeds defined as “polar opposites”
    • Search for interesting words and phrases in a training corpus that occur together with the seeds, and score them
    • Use the scored words and phrases as the classification vocabulary
  • Interesting generalizations possible
    • Multiple dimensions
    • Dimensions other than thumbs up/down
sentiment analysis and xml

<sentence><company>Adidas</company> <sentiment degree="-0.6">failed to deliver</sentiment> a profit in Q1.</sentence>

<sentence>The new shoe from <company>Adidas</company> is <sentiment degree="0.9">really awesome</sentiment>!</sentence>

Sentiment analysis and XML
  • Bring the analysis down to a subdocument level
    • Provides a more accurate sentiment assessment
    • Enables precise searching
    • Enables retrieval of the “evidence”
  • Scalar statistics aggregated over matching scopes
    • E.g, average sentiment degree
task oriented interfaces
Task-oriented interfaces

q, e → xml:sentence:(q and scope(e))

question answering
Question answering
  • Some classes of questions transform naturally to scope queries
  • Statistics to the rescue!

when (is|was|did|...) X? → xml:sentence:(X’ and (scope(date) or scope(time)))

what does X stand for? → xml:acronym:(@base:X and scope(@definition))

who works (at|for) X? → xml:sentence:(company:X and scope(person) and scope(jobtitle))

why (is|are|do|does|have|has|...) X? → xml:sentence:(X’ and (“because” or “due to” or ...))

when did the berlin wall fall?

when is christmas?

when was d-day?

audio and video

Contextual navigation

Contextual access

Audio and video
  • Speech-to-text software as an XML producer
    • Speaker identification
    • Scene changes
    • Timing information
  • Contextual access to multimedia content
scope33
Scope

“Paragraphs that contain quotations by George W. Bush, where he mentions a monetary amount.”

xml:paragraph:quotation:(@speaker:”bush” and scope(price)))

“Paragraphs that discuss a company merger or acquisition.”

“Quotations where somebody says something about the Gaza Strip.”

xml:paragraph:(string(“merger”, linguistics=“on”) and scope(company) and scope(price))

xml:sentence:acronym:(@base:“mit” and scope(@definition))

xml:sentence:(“d-day” and (scope(date) or scope(location))

concluding thoughts
Concluding thoughts
  • Opportunities for semantically motivated relevancy models
    • Like business rules, almost
  • What kind of applications would emerge...
    • ...if the entire web was enriched in this way?