querying for relations from the semi structured web
Download
Skip this Video
Download Presentation
Querying for relations from the semi-structured Web

Loading in 2 Seconds...

play fullscreen
1 / 50

Querying for relations from the semi-structured Web - PowerPoint PPT Presentation


  • 89 Views
  • Uploaded on

Querying for relations from the semi-structured Web. Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/~sunita Contributors Rahul Gupta Girija Limaye Prashant Borole Rakesh Pimplikar Aditya Somani. Web Search. Mainstream web search User  Keyword queries

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Querying for relations from the semi-structured Web' - kemal


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
querying for relations from the semi structured web

Querying for relations from the semi-structured Web

SunitaSarawagi

IIT Bombay

http://www.cse.iitb.ac.in/~sunita

Contributors

Rahul Gupta GirijaLimayePrashantBorole

RakeshPimplikarAdityaSomani

web search
Web Search
  • Mainstream web search
      • User  Keyword queries
      • Search engine  Ranked list of documents
    • 15 glorious years of serving all of user’s search need into this least common denominator
  • Structured web search
      • User  Natural language queries ~/~ Structured queries
      • Search engine  Point answer, record sets
    • Many challenges in understanding both query and content
    • 15 years of slow but steady progress
the quest for structure
The Quest for Structure
  • Vertical structured search engines

Structure  Schema  Domain-specific

    • Shopping: Shopbot: (Etzoini + 1997)
      • Product name, manufacturer, price
    • Publications: Citeseer (Lawrence, Giles,+ 1998)
      • Paper title, author name, email, conference, year
    • Jobs: Flipdog Whizbang labs (Mitchell + 2000)
      • Company name, job title, location, requirement
    • People: DBLife (Doan 07)
      • Name, affiliations, committees served, talks delivered.

Triggered much research on extraction and IR-style search of structured data (BANKS ‘02).

horizontal structured search
Horizontal Structured Search
  • Domain-independent structure
  • Small, generic set of structured primitives over entities, types, relationships, and properties
    • <Entity> IsA <Type>
      • Mysore is a city
    • <Entity> Has <Property>
      • <City> Average rainful <Value>
    • <Entity1> <related-to> <Entity2>
      • <Person> born-in <City>
      • <Person> CEO-of <Company>
types of structured search
Types of Structured Search
  • Web+People  Structured databases ( Ontologies)
    • Created manually (Psyche), or semi-automatically (Yago)
    • True Knowledge (2009), Wolfram Alpha (2009)
  • Web annotated with structured elements
    • Queries: Keywords + structured annotations
      • Example: <Physicist> +cosmos
    • Open-domain structure extraction and annotations of web docs (2005—)
users ontologies and the web
Users, Ontologies, and the Web
  • Users are from Venus
    • Bi-syllabic, impatient, believe in mind -reading
  • Ontologies are from Mars
    • One structure to fit all

G

  • Web content creators are from
  • some other galaxy
    • Ontologies=axvhjizb
    • Let search engines bring the users
what is missed in ontologies
What is missed in Ontologies
  • The trivial, the transient, and the textual
  • Procedural knowledge
    • What do I do on an error?
  • Huge body of invaluable text of various type
    • reviews, literature, commentaries, videos
  • Context
    • By stripping knowledge to its skeletal form, context that is so valuable for search is lost.
    • As long as queries are unstructured, the redundancy and variety in unstructured sources is invaluable.
structured annotations in html
Structured annotations in HTML
  • Is A annotations
    • KnowITAll (2004)
  • Open-domain Relationships
    • Text runner (Banko 2007)
  • Ontological annotations
    • SemTag and Seeker (2003)
    • Wikipedia annotations (Wikify! 2007, CSAW 2009)

All view documents as a sequence of tokens

Challenging to ensure high accuracy

queries in wwt
Queries in WWT
  • Query by content
  • Query by description
slide12

Keyword search to find structured records

Computer science concept inventor year

Correct answer is not

one click away.

Verbose articles, not

structured tables

Desired records spread

across many documents

The only document

with an unstructured list

of some desired records

slide14

Highly relevant Wikipedia table not retrieved in the top-k

Ideal answer should be integrated from these incomplete sources

attempt 2 include samples in query
Attempt 2: Include samples in query

alan turing machine codd relational database

Known examples

Documents relevant

only to the keywords

Ideal answer still

spread across many

documents

wwt architecture
WWT Architecture

User

Final consolidated table

Query Table

Typesystem Hierarchy

Type Inference

Index Query Builder

Resolver builder

Ranker

Offline

Resolver

Statistics

Web

Cell resolver

Row resolver

Keyword Query

Consolidated Table

Ontology

Extract record sources

Tables

T1,…,Tk

Record labeler

Source L1,…,Lk

Annotate

Consolidator

CRF models

Row and cell scores

Content+

context

index

Extractor

Store

offline annotating to an ontology
Offline: Annotating to an Ontology

Annotate table cells with entity nodes and table columns with type nodes

All

Indian_films

Indian_directors

People

2008_films

movies

Entertainers

English_films

Indian_films

A_Wednesday

2008_films

Terrorism_films

Wednesday

Black&White

A_Wednesday

Coffee_house (film)

Coffee_house (film)

Black&White

Coffee_house (Loc)

challenges
Challenges
  • Ambiguity of entity names
    • “Coffee house” both a movie name and a place name
  • Noisy mentions of entity names
    • Black&White versus Black and White
  • Multiple labels
    • Yago Ontology has average 2.2 types per entity
  • Missing type links in Ontology cannot use least common ancestor
    • Missing link: Black&White to 2008_films
    • Not a missing link: 1920 to Terrorism_films
  • Scale:
    • Yago has 1.9 million entities, 200,000 types
a unified approach
A unified approach

Graphical model to jointly label cells and columns to maximize sum of scores on

ycj = Entity label of cell c of column j

yj= Type label of column j

  • Score(ycj ): String similarity between c & ycj .
  • Score(yj ): String similarity between header in j & yj
  • Score( yj, ycj)
    • Subsumed entity: Inversely proportional to distance between them
    • Outside enity: Fraction of overlapping entities between yj and immediate parent of ycj
      • Handles missing links: Overlap of 2008_movies with 2007_movies zero but with Indian movies is non-zero.

movies

English_films

Indian_films

Terrorism_films

yj

Subsumed entity y2j

Subsumed entity y1j

Outside entity y3j

wwt architecture1
WWT Architecture

User

Final consolidated table

Query Table

Typesystem Hierarchy

Type Inference

Index Query Builder

Resolver builder

Ranker

Offline

Resolver

Statistics

Web

Cell resolver

Row resolver

Keyword Query

Consolidated Table

Ontology

Extract record sources

Tables

T1,…,Tk

Record labeler

Source L1,…,Lk

Annotate

Consolidator

CRF models

Row and cell scores

Content+

context

index

Extractor

Store

extraction content queries
Extraction: Content queries

Extracting queries columns from list records

Query: Q

A source: Li

  • New York University (NYU), New York City, founded in 1831.
  • Columbia University, founded in 1754 as King’s College.
  • Binghamton University, Binghamton, established in 1946.
  • State University of New York, Stony Brook, New York, founded in 1957
  • Syracuse University, Syracuse, New York, established in 1870
  • State University of New York, Buffalo, established in 1846
  • Rensselaer Polytechnic Institute (RPI) at Troy.

Lists are often human generated.

extraction
Extraction

Query: Q

Extracted table columns

  • New York University (NYU), New York City, founded in 1831.
  • Columbia University, founded in 1754 as King’s College.
  • Binghamton University, Binghamton, established in 1946.
  • State University of New York, Stony Brook, New York, founded in 1957
  • Syracuse University, Syracuse, New York, established in 1870
  • State University of New York, Buffalo, established in 1846
  • Rensselaer Polytechnic Institute (RPI) at Troy.

Rule-based extractor insufficient. Statistical extractor needs training data.

Generating that is also not easy!

extraction labeled data generation
Extraction: Labeled data generation

Lists are unlabeled. Labeled records needed to train a CRF

A fast but naïve approach for generating labeled records

  • New York Univ. in NYC
  • Columbia University in NYC
  • Monroe Community College in Brighton
  • State University of New York in
  • Stony Brook, New York.

Query about colleges in NY

Fragment of a relevant list source

extraction labeled data generation1
Extraction: Labeled data generation

A fast but naïve approach

  • New YorkUniv. in NYC
  • Columbia University in NYC
  • Monroe Community College in Brighton
  • State University of New York in
  • Stony Brook, New York.

Another match for New York

Another match for New York University

  • In the list, look for matches of every query cell.
extraction labeled data generation2
Extraction: Labeled data generation

A fast but naïve approach

  • New York Univ. in NYC
  • Columbia University in NYC
  • Monroe Community College in Brighton
  • State University of New York in
  • Stony Brook, New York.

1

2

  • In the list, look for matches of every query cell.
  • Greedily map each query row to the best match in the list
extraction labeled data generation3
Extraction: Labeled data generation

A fast but naïve approach

  • New York Univ. in NYC
  • Columbia University in NYC
  • Monroe Community College in Brighton
  • State University of New York in
  • Stony Brook, New York.

1

2

Unmapped (hurts recall)

Assumed as ‘Other’

Wrongly Mapped

  • Hard matching criteria has significantly low recall
    • Missed segments.
    • Does not use natural clues like Univ = University
  • Greedy matching can be lead to really bad mappings
generating labeled data soft approach
Generating labeled data: Soft approach
  • New York Univ. in NYC
  • Columbia University in NYC
  • Monroe Community College in Brighton
  • State University of New York in
  • Stony Brook, New York.
generating labeled data soft approach1
Generating labeled data: Soft approach

0.9

  • New York Univ. in NYC
  • Columbia University in NYC
  • Monroe Community College in Brighton
  • State University of New York in
  • Stony Brook, New York.

0.3

1.8

  • Match score for each query and source row
    • Score of best segmentation of source row to query columns
    • Score of a segment s of column c:
      • Probability Cell c of query row same as segment s
      • Computed by the Resolver module based on the type of the column
generating labeled data soft approach2
Generating labeled data: Soft approach
  • New York Univ. in NYC
  • Columbia University in NYC
  • Monroe Community College in Brighton
  • State University of New York in
  • Stony Brook, New York.

0.7

0.3

2.0

  • Match score for each query and source row
    • Score of best segmentation of source row into query columns
    • Score of a segment s of column c:
      • Probability Cell c of query row same as segment s
      • Computed by the Resolver module based on the type of the column
generating labeled data soft approach3
Generating labeled data: Soft approach
  • New York Univ. in NYC
  • Columbia University in NYC
  • Monroe Community College
  • in Brighton
  • State University of New York in
  • Stony Brook, New York.

0.9

0.3

0.7

1.8

0.3

1.8

2

Greedy matching in red

  • Compute the maximum weight matching
    • Better than greedily choosing the best match for each row
  • Soft string-matching increases the labeled candidates significantly
    • Vastly improves recall, leads to better extraction models.
extractor
Extractor
  • Use CRF on the generated labeled data
  • Feature Set
    • Delimiters, HTML tokens in a window around labeled segments.
    • Alignment features
  • Collective training of multiple sources
experiments
Experiments
  • Aim: Reconstruct Wikipedia tables from only a few sample rows.
  • Sample queries
    • TV Series: Character name, Actor name, Season
    • Oil spills: Tanker, Region, Time
    • Golden Globe Awards: Actor, Movie, Year
    • Dadasaheb Phalke Awards: Person, Year
    • Parrots: common name, scientific name, family
experiments dataset
Experiments: Dataset
  • Corpus:
    • 16M lists from 500M pages from a web crawl.
    • 45% of lists retrieved by index probe are irrelevant.
  • Query workload
    • 65 queries. Ground truth hand-labeled by 10 users over 1300 lists.
    • 27% queries not answerable with one list (difficult).
    • True consolidated table = 75% of Wikipedia table, 25% new rows not present in Wikipedia.
extraction performance
Extraction performance
  • Benefits of soft training data generation, alignment features, staged-extraction on F1 score.
  • More than 80% F1 accuracy with just three query records
queries in wwt1
Queries in WWT
  • Query by content
  • Query by description
extraction description queries
Extraction: Description queries

Non-informative headers

No headers

context to get at relevant tables
Context to get at relevant tables

All

  • Ontological annotations
  • Context is union of
    • Text around tables
    • Headers
    • Ontology labels when present

Chemical element

People

Alkali

Chemical_elements

Non_Metals

Metals

Alkali

Non-gas

Gas

Non alkali

Hydrogen

Lithium

Aluminium

Sodium

Carbon

joint labeling of table columns
Joint labeling of table columns
  • Given
    • Candidate tables: T1 ,T2,..Tn
    • Query column q1, q2,.. qm
  • Task: label columns of Ti with {q1, q2,…, qm, }to maximize sum of these scores
    • Score (T , j , qk) = Ontology type match + Header string match with qk
    • Score (T , * , qk) = Match of description of Twith qk
    • Score (T , j, T’ , j’, qk) = Content overlap of column j of table T with column j’ of table T’ when both label qk
  • Inference algorithm in a graphical model  solve via Belief Propagation.
wwt architecture2
WWT Architecture

User

Final consolidated table

Query Table

Typesystem Hierarchy

Type Inference

Index Query Builder

Resolver builder

Ranker

Offline

Resolver

Statistics

Web

Cell resolver

Row resolver

Keyword Query

Consolidated Table

Ontology

Extract record sources

Tables

T1,…,Tk

Record labeler

Source L1,…,Lk

Annotate

Consolidator

CRF models

Row and cell scores

Content+

context

index

Extractor

Store

step 3 consolidation
Step 3: Consolidation

Merging the extracted tables into one

+

Merging duplicates

=

consolidation
Consolidation
  • Challenge: resolving when two rows are the same in the face of
    • Extraction errors
    • Missing columns
    • Open-domain
    • No training.
  • Our approach: a specially designed Bayesian Network with interpretable and generalizable parameters
resolver
Resolver

Bayesian Network

P(RowMatch|rows q,r)

P(nth cell match|qn,rn)

P(1st cell match|q1,r1)

P(ith cell match|qi,ri)

  • Cell-level probabilities
  • Parameters automatically set using list statistics
  • Derived from user-supplied type-specific similarity functions

45

ranking
Ranking

Factors for ranking

Relevance: membership in overlapping sources

Support from multiple sources

Completeness: importance of columns present

Penalize records with only common ‘spam’ columns like City and State

Correctness: extraction confidence

47

relevance ranking on set membership
Relevance ranking on set membership
  • Weighted sum approach
    • Score of a set t:
      • s(t) = fraction of query rows in t
    • Relevance of consolidated row r:
      • r  t s(t)
  • Graph walk based approach
    • Random walk from rows to table nodes starting from query rows along with random restarts to query rows

Tables

Consolidated rows

Query rows

ranking criteria
Ranking Criteria
  • Score(Row r):
    • Graph-relevance of r.
    • Importance of columns C present in r (high if C functionally determines the other)
    • Sum of cell extraction confidence: noisy-OR of cell extraction confidence from individual CRFs
overall performance
Overall performance

Difficult Queries

All Queries

  • Justify sophisticated consolidation and resolution. So compare with:
    • Processing only the magically known single best list
    • => no consolidation/resolution required.
    • Simple consolidation. No merging of approximate duplicates.
  • WWT has > 55% recall, beats others. Gain bigger for difficult queries.
running time
Running time
  • < 30 seconds with 3 query records.
    • Can be improved by processing sources in parallel.
  • Variance high because time depends on number of columns, record length etc.
related work
Related Work
  • Google-Squared
    • Developed independently. Launched in May 2009
    • User provides keyword query, e.g. “list of Italian joints in Manhattan”. Schema inferred.
    • Technical details not public.
  • Prior methods for extraction and resolution.
    • Assume labeled data/pre-trained parameters
    • We generate labeled data, and automatically train resolver parameters from the list source.
summary
Summary
  • Structured web search & the role of non-text, partially structured web sources
  • WWT system
    • Domain-independent
    • Online: structure interpretation at query time
    • Relies heavily on unsupervised statistical learning
      • Graphical model for table annotation
      • Soft-approach for generating labeled data
      • Collective column labeling for descriptive queries
      • Bayesian network for resolution and consolidation
      • Page rank + confidence from a probabilistic extractor for ranking
what next
What next?
  • Designing plans for non-trivial ways of combining of sources
  • Better ranking and user-interaction models.
  • Expanding query set
    • Aggregate queries: tables are rich in quantities
    • Point queries: attribute value and relationship queries
  • Interplay between semi-structured web & Ontologies
    • Augmenting one with the other.
  • Quantify information in structured sources vis-à-vis text sources on typical query workloads
ad