WebIQ: Learning from the Web to Match Deep-Web Query Interfaces

WebIQ: Learning from the Web to Match Deep-Web Query Interfaces Wensheng Wu Database & Information Systems Group University of Illinois, Urbana Joint work with AnHai Doan & Clement Yu ICDE, April 2006

Search Problems on the Deep Web Find round-trip flights from Chicago to New York under $500 united.com airtravel.com delta.com

Solution: Build Data Integration Systems Find round-trip flights from Chicago to New York under $500 Global query interface united.com delta.com airtravel.com comparison shopping systems “on steroid”

Current State of Affairs • Very active in both research communities & industry • Research • multidisciplinary efforts: Database, Web, KDD & AI • 10+ research groups in US, Asia & Europe • focuses: • source discovery • schema matching & integration • query processing • data extraction • Industry • Transformic, Glenbrook Networks, WebScalers, PriceGrabber, Shopping.com, MySimon, Google, …

Key Task: Schema Matching 1-1 match Complex match

Schema Matching is Ubiquitous! • Fundamental problem in numerous applications • data integration • data warehousing • peer data management • ontology merging • view integration • personal information management • Schema matching across Web sources • 30+ papers generated in past few years • Washington [AAAI-03, ICDE-05], Illinois [SIGMOD-03, SIGMOD-04, ICDE-06], MSR [VLDB-04], Binghamton [VLDB-03], HKST [VLDB-04], Utah [WebDB-05], …

Schema Matching is Still Very Difficult • Must rely on properties of attributes, e.g., label & instances • Often there are little in common between matching attributes • Many attributes do not even have instances! 1-1 match Complex match

Matching Performance Greatly Hampered by Pervasive Lack of Attribute Instances • 28.1% ~ 74.6% of attributes with no instances • Extremely challenging to match these attributes • e.g., does departure city match from city or departure date? • Also difficult to match attributes with dissimilar instances • e.g., airline (with American airliners) vs. carrier (with Europeans)

Our Solution: Exploit the Web • Discover instances from the Web • e.g., Chicago, New York, etc. for departure city & from city • Borrow instances from other attributes & validate via Web • e.g., check if Air Canada is an instance of carrier with the Web

Key Idea: Question-Answering from AI • Search Web via search engines, e.g., Google • … but search engines do not understand natural language questions • Idea: form extraction queries as sentences to be completed • “Trick” search engine to complete sentences with instances • Example extraction query: “departure cities such as” attribute label: departure city

Key Idea: Question-Answering from AI • Search Google & obtain snippets: • Extract instance candidates from snippets: extraction query completion other departure cities such asBoston, Chicago and LAX available … Boston, Chicago, LAX

But Not Every Candidate is True Instance • Reason 1: Extraction queries may not be perfect • Reason 2: Web content is inherently noisy • Example: • attribute: city • extraction query: “and other cities” • extracted candidate: 150 • need to perform instance verification

Instance Verification: Outlier Detection • Goal: Remove statistical outliers (among candidates) • Step 1: Pre-processing • recognize types of instances via pattern matching & 80% rule • types: numeric & string • discard all candidates not of determined type • e.g., most of instance candidates for city are strings, so remove 150 • Step 2: Type-specific detection • perform discordance tests • test statistics, e.g., • # of words: abnormal if more than 5 words in person name • % of numeric characters: US zip code contains only digits

Instance Verification: Web Validation • Goal: Further semantic-level validation • Idea: Exploit co-occurrence statistics of label & instances • “Make: Honda; Model: Accord” • “a variety of makes such as Honda, Mitsubishi” • Form validation queries using validation patterns • e.g., “make Honda”, “makes such as Honda” Validation phrase V

Instance Verification: Web Validation • Possible measure: NumHits(V+x) • e.g., NumHits(“cities such as Los Angeles”) = 26M • Potential problems: bias towards popular instances • Use PMI(V, x), point-wise mutual information • Example: • V = “cities such as”, candidates: California, Los Angeles • NumHits(V, California) = 29 • PMI(V, Los Angeles) = 3000 * PMI(V, California) NumHits(V+x) NumHits(V) * NumHits(x)

Validate Instances from Other Attributes • Method 1: Discover k more instances from Web • then check for borrowed one (Aer Lingus for Airline) • problem: very likely Aer Lingus not among discovered instances • Method 2: Compare validation score with that of instance • problem: score for Aer Lingus may be much lower, how to decide? • Key observation: compare also to scores of non-instances • e.g., Economy (with respect to Airline)

Train Validation-Based Instance Classifier • Naïve Bayes classifier with validation-based features V1: Airlines such as V2: Airline Thresholds: t1=.45, t2=.075 P(C|X) ~ P(C) P(X|C) P(+)=P(-) = ½ P(f1=1|+) = 3/4 P(f1=1|-) = 1/4 …

Validate Instances via Deep Web • Handle attributes while difficult via Web, e.g., from • Disadvantage: ambiguity when no results found

Architecture of Assisted Matching System Attribute matches Interface matcher Source interfaces with augmented instances Instance acquisition Source interfaces

Empirical Evaluation • Five domains: • Experiments: • Baseline: IceQ [Wu et al., SIGMOD-04] • Web assistance • Performance metrics: • precision (P), recall (R), & F1 (= 2PR/(P+R))

Matching Accuracy • Web assistance boosts accuracy (F1) from 89.5 to 97.5

Overhead Analysis • Reasonable overhead: 6~11 minutes across domains

Conclusion • Search problems on the Deep Web are increasingly crucial! • Novel QA-based approach to learning attribute instances • Incorporation into a state-of-art matching system • Extensive evaluation over varied real-world domains  More details: Wensheng Wu on Google

WebIQ: Learning from the Web to Match Deep-Web Query Interfaces

WebIQ: Learning from the Web to Match Deep-Web Query Interfaces

Presentation Transcript

Watch Castres vs Northampton Saints Rugby match of Heineken

ConnectR: Overview

Fundamentals of Misys Query (Tiger and PM)

332:578 Deep Submicron VLSI Design Lecture 3 Deep Sub-micron MOS Transistor Theory

RDF for Developers

Deep Learning from Speech Analysis/Recognition to Language/Multimodal Processing

Objects, Classes, and Interfaces

Multimodal Input Analysis

Multi-dimensional Search Trees

Study Guide

Structured Query Language

Content-based Image Retrieval (CBIR)

CS 245: Database System Principles

Implementation of Relational Operators

Query-Specific Learning and Inference for Probabilistic Graphical Models

Learning Embeddings for Similarity-Based Retrieval

Adaptive Query Processing

CHAPTER 2: MANIPULATING QUERY EXPRESSIONS

Query Recommendation Xiaofei Zhu (zhu@l3s.de) L3S Research Center, Leibniz Universität Hannover

HIGH FREQUENCY WORDS

Regular Expression