Querying for relations from the semi structured web
Download
1 / 50

Querying for relations from the semi-structured Web - PowerPoint PPT Presentation


  • 89 Views
  • Uploaded on

Querying for relations from the semi-structured Web. Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/~sunita Contributors Rahul Gupta Girija Limaye Prashant Borole Rakesh Pimplikar Aditya Somani. Web Search. Mainstream web search User  Keyword queries

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Querying for relations from the semi-structured Web' - kemal


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Querying for relations from the semi structured web

Querying for relations from the semi-structured Web

SunitaSarawagi

IIT Bombay

http://www.cse.iitb.ac.in/~sunita

Contributors

Rahul Gupta GirijaLimayePrashantBorole

RakeshPimplikarAdityaSomani


Web search
Web Search

  • Mainstream web search

    • User  Keyword queries

    • Search engine  Ranked list of documents

  • 15 glorious years of serving all of user’s search need into this least common denominator

  • Structured web search

    • User  Natural language queries ~/~ Structured queries

    • Search engine  Point answer, record sets

  • Many challenges in understanding both query and content

  • 15 years of slow but steady progress


  • The quest for structure
    The Quest for Structure

    • Vertical structured search engines

      Structure  Schema  Domain-specific

      • Shopping: Shopbot: (Etzoini + 1997)

        • Product name, manufacturer, price

      • Publications: Citeseer (Lawrence, Giles,+ 1998)

        • Paper title, author name, email, conference, year

      • Jobs: Flipdog Whizbang labs (Mitchell + 2000)

        • Company name, job title, location, requirement

      • People: DBLife (Doan 07)

        • Name, affiliations, committees served, talks delivered.

          Triggered much research on extraction and IR-style search of structured data (BANKS ‘02).


    Horizontal structured search
    Horizontal Structured Search

    • Domain-independent structure

    • Small, generic set of structured primitives over entities, types, relationships, and properties

      • <Entity> IsA <Type>

        • Mysore is a city

      • <Entity> Has <Property>

        • <City> Average rainful <Value>

      • <Entity1> <related-to> <Entity2>

        • <Person> born-in <City>

        • <Person> CEO-of <Company>


    Types of structured search
    Types of Structured Search

    • Web+People  Structured databases ( Ontologies)

      • Created manually (Psyche), or semi-automatically (Yago)

      • True Knowledge (2009), Wolfram Alpha (2009)

    • Web annotated with structured elements

      • Queries: Keywords + structured annotations

        • Example: <Physicist> +cosmos

      • Open-domain structure extraction and annotations of web docs (2005—)


    Users ontologies and the web
    Users, Ontologies, and the Web

    • Users are from Venus

      • Bi-syllabic, impatient, believe in mind -reading

    • Ontologies are from Mars

      • One structure to fit all

    G

    • Web content creators are from

    • some other galaxy

      • Ontologies=axvhjizb

      • Let search engines bring the users


    What is missed in ontologies
    What is missed in Ontologies

    • The trivial, the transient, and the textual

    • Procedural knowledge

      • What do I do on an error?

    • Huge body of invaluable text of various type

      • reviews, literature, commentaries, videos

    • Context

      • By stripping knowledge to its skeletal form, context that is so valuable for search is lost.

      • As long as queries are unstructured, the redundancy and variety in unstructured sources is invaluable.


    Structured annotations in html
    Structured annotations in HTML

    • Is A annotations

      • KnowITAll (2004)

    • Open-domain Relationships

      • Text runner (Banko 2007)

    • Ontological annotations

      • SemTag and Seeker (2003)

      • Wikipedia annotations (Wikify! 2007, CSAW 2009)

    All view documents as a sequence of tokens

    Challenging to ensure high accuracy



    Queries in wwt
    Queries in WWT

    • Query by content

    • Query by description



    Keyword search to find structured records

    Computer science concept inventor year

    Correct answer is not

    one click away.

    Verbose articles, not

    structured tables

    Desired records spread

    across many documents

    The only document

    with an unstructured list

    of some desired records



    Highly relevant Wikipedia table not retrieved in the top-k

    Ideal answer should be integrated from these incomplete sources


    Attempt 2 include samples in query
    Attempt 2: Include samples in query

    alan turing machine codd relational database

    Known examples

    Documents relevant

    only to the keywords

    Ideal answer still

    spread across many

    documents


    Wwt architecture
    WWT Architecture

    User

    Final consolidated table

    Query Table

    Typesystem Hierarchy

    Type Inference

    Index Query Builder

    Resolver builder

    Ranker

    Offline

    Resolver

    Statistics

    Web

    Cell resolver

    Row resolver

    Keyword Query

    Consolidated Table

    Ontology

    Extract record sources

    Tables

    T1,…,Tk

    Record labeler

    Source L1,…,Lk

    Annotate

    Consolidator

    CRF models

    Row and cell scores

    Content+

    context

    index

    Extractor

    Store


    Offline annotating to an ontology
    Offline: Annotating to an Ontology

    Annotate table cells with entity nodes and table columns with type nodes

    All

    Indian_films

    Indian_directors

    People

    2008_films

    movies

    Entertainers

    English_films

    Indian_films

    A_Wednesday

    2008_films

    Terrorism_films

    Wednesday

    Black&White

    A_Wednesday

    Coffee_house (film)

    Coffee_house (film)

    Black&White

    Coffee_house (Loc)


    Challenges
    Challenges

    • Ambiguity of entity names

      • “Coffee house” both a movie name and a place name

    • Noisy mentions of entity names

      • Black&White versus Black and White

    • Multiple labels

      • Yago Ontology has average 2.2 types per entity

    • Missing type links in Ontology cannot use least common ancestor

      • Missing link: Black&White to 2008_films

      • Not a missing link: 1920 to Terrorism_films

    • Scale:

      • Yago has 1.9 million entities, 200,000 types


    A unified approach
    A unified approach

    Graphical model to jointly label cells and columns to maximize sum of scores on

    ycj = Entity label of cell c of column j

    yj= Type label of column j

    • Score(ycj ): String similarity between c & ycj .

    • Score(yj ): String similarity between header in j & yj

    • Score( yj, ycj)

      • Subsumed entity: Inversely proportional to distance between them

      • Outside enity: Fraction of overlapping entities between yj and immediate parent of ycj

        • Handles missing links: Overlap of 2008_movies with 2007_movies zero but with Indian movies is non-zero.

    movies

    English_films

    Indian_films

    Terrorism_films

    yj

    Subsumed entity y2j

    Subsumed entity y1j

    Outside entity y3j


    Wwt architecture1
    WWT Architecture

    User

    Final consolidated table

    Query Table

    Typesystem Hierarchy

    Type Inference

    Index Query Builder

    Resolver builder

    Ranker

    Offline

    Resolver

    Statistics

    Web

    Cell resolver

    Row resolver

    Keyword Query

    Consolidated Table

    Ontology

    Extract record sources

    Tables

    T1,…,Tk

    Record labeler

    Source L1,…,Lk

    Annotate

    Consolidator

    CRF models

    Row and cell scores

    Content+

    context

    index

    Extractor

    Store


    Extraction content queries
    Extraction: Content queries

    Extracting queries columns from list records

    Query: Q

    A source: Li

    • New York University (NYU), New York City, founded in 1831.

    • Columbia University, founded in 1754 as King’s College.

    • Binghamton University, Binghamton, established in 1946.

    • State University of New York, Stony Brook, New York, founded in 1957

    • Syracuse University, Syracuse, New York, established in 1870

    • State University of New York, Buffalo, established in 1846

    • Rensselaer Polytechnic Institute (RPI) at Troy.

    Lists are often human generated.


    Extraction
    Extraction

    Query: Q

    Extracted table columns

    • New York University (NYU), New York City, founded in 1831.

    • Columbia University, founded in 1754 as King’s College.

    • Binghamton University, Binghamton, established in 1946.

    • State University of New York, Stony Brook, New York, founded in 1957

    • Syracuse University, Syracuse, New York, established in 1870

    • State University of New York, Buffalo, established in 1846

    • Rensselaer Polytechnic Institute (RPI) at Troy.

    Rule-based extractor insufficient. Statistical extractor needs training data.

    Generating that is also not easy!


    Extraction labeled data generation
    Extraction: Labeled data generation

    Lists are unlabeled. Labeled records needed to train a CRF

    A fast but naïve approach for generating labeled records

    • New York Univ. in NYC

    • Columbia University in NYC

    • Monroe Community College in Brighton

    • State University of New York in

    • Stony Brook, New York.

    Query about colleges in NY

    Fragment of a relevant list source


    Extraction labeled data generation1
    Extraction: Labeled data generation

    A fast but naïve approach

    • New YorkUniv. in NYC

    • Columbia University in NYC

    • Monroe Community College in Brighton

    • State University of New York in

    • Stony Brook, New York.

    Another match for New York

    Another match for New York University

    • In the list, look for matches of every query cell.


    Extraction labeled data generation2
    Extraction: Labeled data generation

    A fast but naïve approach

    • New York Univ. in NYC

    • Columbia University in NYC

    • Monroe Community College in Brighton

    • State University of New York in

    • Stony Brook, New York.

    1

    2

    • In the list, look for matches of every query cell.

    • Greedily map each query row to the best match in the list


    Extraction labeled data generation3
    Extraction: Labeled data generation

    A fast but naïve approach

    • New York Univ. in NYC

    • Columbia University in NYC

    • Monroe Community College in Brighton

    • State University of New York in

    • Stony Brook, New York.

    1

    2

    Unmapped (hurts recall)

    Assumed as ‘Other’

    Wrongly Mapped

    • Hard matching criteria has significantly low recall

      • Missed segments.

      • Does not use natural clues like Univ = University

    • Greedy matching can be lead to really bad mappings


    Generating labeled data soft approach
    Generating labeled data: Soft approach

    • New York Univ. in NYC

    • Columbia University in NYC

    • Monroe Community College in Brighton

    • State University of New York in

    • Stony Brook, New York.


    Generating labeled data soft approach1
    Generating labeled data: Soft approach

    0.9

    • New York Univ. in NYC

    • Columbia University in NYC

    • Monroe Community College in Brighton

    • State University of New York in

    • Stony Brook, New York.

    0.3

    1.8

    • Match score for each query and source row

      • Score of best segmentation of source row to query columns

      • Score of a segment s of column c:

        • Probability Cell c of query row same as segment s

        • Computed by the Resolver module based on the type of the column


    Generating labeled data soft approach2
    Generating labeled data: Soft approach

    • New York Univ. in NYC

    • Columbia University in NYC

    • Monroe Community College in Brighton

    • State University of New York in

    • Stony Brook, New York.

    0.7

    0.3

    2.0

    • Match score for each query and source row

      • Score of best segmentation of source row into query columns

      • Score of a segment s of column c:

        • Probability Cell c of query row same as segment s

        • Computed by the Resolver module based on the type of the column


    Generating labeled data soft approach3
    Generating labeled data: Soft approach

    • New York Univ. in NYC

    • Columbia University in NYC

    • Monroe Community College

    • in Brighton

    • State University of New York in

    • Stony Brook, New York.

    0.9

    0.3

    0.7

    1.8

    0.3

    1.8

    2

    Greedy matching in red

    • Compute the maximum weight matching

      • Better than greedily choosing the best match for each row

    • Soft string-matching increases the labeled candidates significantly

      • Vastly improves recall, leads to better extraction models.


    Extractor
    Extractor

    • Use CRF on the generated labeled data

    • Feature Set

      • Delimiters, HTML tokens in a window around labeled segments.

      • Alignment features

    • Collective training of multiple sources


    Experiments
    Experiments

    • Aim: Reconstruct Wikipedia tables from only a few sample rows.

    • Sample queries

      • TV Series: Character name, Actor name, Season

      • Oil spills: Tanker, Region, Time

      • Golden Globe Awards: Actor, Movie, Year

      • Dadasaheb Phalke Awards: Person, Year

      • Parrots: common name, scientific name, family


    Experiments dataset
    Experiments: Dataset

    • Corpus:

      • 16M lists from 500M pages from a web crawl.

      • 45% of lists retrieved by index probe are irrelevant.

    • Query workload

      • 65 queries. Ground truth hand-labeled by 10 users over 1300 lists.

      • 27% queries not answerable with one list (difficult).

      • True consolidated table = 75% of Wikipedia table, 25% new rows not present in Wikipedia.


    Extraction performance
    Extraction performance

    • Benefits of soft training data generation, alignment features, staged-extraction on F1 score.

    • More than 80% F1 accuracy with just three query records


    Queries in wwt1
    Queries in WWT

    • Query by content

    • Query by description


    Extraction description queries
    Extraction: Description queries

    Non-informative headers

    No headers


    Context to get at relevant tables
    Context to get at relevant tables

    All

    • Ontological annotations

    • Context is union of

      • Text around tables

      • Headers

      • Ontology labels when present

    Chemical element

    People

    Alkali

    Chemical_elements

    Non_Metals

    Metals

    Alkali

    Non-gas

    Gas

    Non alkali

    Hydrogen

    Lithium

    Aluminium

    Sodium

    Carbon


    Joint labeling of table columns
    Joint labeling of table columns

    • Given

      • Candidate tables: T1 ,T2,..Tn

      • Query column q1, q2,.. qm

    • Task: label columns of Ti with {q1, q2,…, qm, }to maximize sum of these scores

      • Score (T , j , qk) = Ontology type match + Header string match with qk

      • Score (T , * , qk) = Match of description of Twith qk

      • Score (T , j, T’ , j’, qk) = Content overlap of column j of table T with column j’ of table T’ when both label qk

    • Inference algorithm in a graphical model  solve via Belief Propagation.


    Wwt architecture2
    WWT Architecture

    User

    Final consolidated table

    Query Table

    Typesystem Hierarchy

    Type Inference

    Index Query Builder

    Resolver builder

    Ranker

    Offline

    Resolver

    Statistics

    Web

    Cell resolver

    Row resolver

    Keyword Query

    Consolidated Table

    Ontology

    Extract record sources

    Tables

    T1,…,Tk

    Record labeler

    Source L1,…,Lk

    Annotate

    Consolidator

    CRF models

    Row and cell scores

    Content+

    context

    index

    Extractor

    Store


    Step 3 consolidation
    Step 3: Consolidation

    Merging the extracted tables into one

    +

    Merging duplicates

    =


    Consolidation
    Consolidation

    • Challenge: resolving when two rows are the same in the face of

      • Extraction errors

      • Missing columns

      • Open-domain

      • No training.

    • Our approach: a specially designed Bayesian Network with interpretable and generalizable parameters


    Resolver
    Resolver

    Bayesian Network

    P(RowMatch|rows q,r)

    P(nth cell match|qn,rn)

    P(1st cell match|q1,r1)

    P(ith cell match|qi,ri)

    • Cell-level probabilities

    • Parameters automatically set using list statistics

    • Derived from user-supplied type-specific similarity functions

    45


    Ranking
    Ranking

    Factors for ranking

    Relevance: membership in overlapping sources

    Support from multiple sources

    Completeness: importance of columns present

    Penalize records with only common ‘spam’ columns like City and State

    Correctness: extraction confidence

    47


    Relevance ranking on set membership
    Relevance ranking on set membership

    • Weighted sum approach

      • Score of a set t:

        • s(t) = fraction of query rows in t

      • Relevance of consolidated row r:

        • r  t s(t)

    • Graph walk based approach

      • Random walk from rows to table nodes starting from query rows along with random restarts to query rows

    Tables

    Consolidated rows

    Query rows


    Ranking criteria
    Ranking Criteria

    • Score(Row r):

      • Graph-relevance of r.

      • Importance of columns C present in r (high if C functionally determines the other)

      • Sum of cell extraction confidence: noisy-OR of cell extraction confidence from individual CRFs


    Overall performance
    Overall performance

    Difficult Queries

    All Queries

    • Justify sophisticated consolidation and resolution. So compare with:

      • Processing only the magically known single best list

      • => no consolidation/resolution required.

      • Simple consolidation. No merging of approximate duplicates.

    • WWT has > 55% recall, beats others. Gain bigger for difficult queries.


    Running time
    Running time

    • < 30 seconds with 3 query records.

      • Can be improved by processing sources in parallel.

    • Variance high because time depends on number of columns, record length etc.


    Related work
    Related Work

    • Google-Squared

      • Developed independently. Launched in May 2009

      • User provides keyword query, e.g. “list of Italian joints in Manhattan”. Schema inferred.

      • Technical details not public.

    • Prior methods for extraction and resolution.

      • Assume labeled data/pre-trained parameters

      • We generate labeled data, and automatically train resolver parameters from the list source.


    Summary
    Summary

    • Structured web search & the role of non-text, partially structured web sources

    • WWT system

      • Domain-independent

      • Online: structure interpretation at query time

      • Relies heavily on unsupervised statistical learning

        • Graphical model for table annotation

        • Soft-approach for generating labeled data

        • Collective column labeling for descriptive queries

        • Bayesian network for resolution and consolidation

        • Page rank + confidence from a probabilistic extractor for ranking


    What next
    What next?

    • Designing plans for non-trivial ways of combining of sources

    • Better ranking and user-interaction models.

    • Expanding query set

      • Aggregate queries: tables are rich in quantities

      • Point queries: attribute value and relationship queries

    • Interplay between semi-structured web & Ontologies

      • Augmenting one with the other.

    • Quantify information in structured sources vis-à-vis text sources on typical query workloads


    ad