1 / 23

WebIQ: Learning from the Web to Match Deep-Web Query Interfaces

WebIQ: Learning from the Web to Match Deep-Web Query Interfaces. Wensheng Wu Database & Information Systems Group University of Illinois, Urbana Joint work with AnHai Doan & Clement Yu ICDE, April 2006. Search Problems on the Deep Web.

tekla
Download Presentation

WebIQ: Learning from the Web to Match Deep-Web Query Interfaces

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WebIQ: Learning from the Web to Match Deep-Web Query Interfaces Wensheng Wu Database & Information Systems Group University of Illinois, Urbana Joint work with AnHai Doan & Clement Yu ICDE, April 2006

  2. Search Problems on the Deep Web Find round-trip flights from Chicago to New York under $500 united.com airtravel.com delta.com

  3. Solution: Build Data Integration Systems Find round-trip flights from Chicago to New York under $500 Global query interface united.com delta.com airtravel.com comparison shopping systems “on steroid”

  4. Current State of Affairs • Very active in both research communities & industry • Research • multidisciplinary efforts: Database, Web, KDD & AI • 10+ research groups in US, Asia & Europe • focuses: • source discovery • schema matching & integration • query processing • data extraction • Industry • Transformic, Glenbrook Networks, WebScalers, PriceGrabber, Shopping.com, MySimon, Google, …

  5. Key Task: Schema Matching 1-1 match Complex match

  6. Schema Matching is Ubiquitous! • Fundamental problem in numerous applications • data integration • data warehousing • peer data management • ontology merging • view integration • personal information management • Schema matching across Web sources • 30+ papers generated in past few years • Washington [AAAI-03, ICDE-05], Illinois [SIGMOD-03, SIGMOD-04, ICDE-06], MSR [VLDB-04], Binghamton [VLDB-03], HKST [VLDB-04], Utah [WebDB-05], …

  7. Schema Matching is Still Very Difficult • Must rely on properties of attributes, e.g., label & instances • Often there are little in common between matching attributes • Many attributes do not even have instances! 1-1 match Complex match

  8. Matching Performance Greatly Hampered by Pervasive Lack of Attribute Instances • 28.1% ~ 74.6% of attributes with no instances • Extremely challenging to match these attributes • e.g., does departure city match from city or departure date? • Also difficult to match attributes with dissimilar instances • e.g., airline (with American airliners) vs. carrier (with Europeans)

  9. Our Solution: Exploit the Web • Discover instances from the Web • e.g., Chicago, New York, etc. for departure city & from city • Borrow instances from other attributes & validate via Web • e.g., check if Air Canada is an instance of carrier with the Web

  10. Key Idea: Question-Answering from AI • Search Web via search engines, e.g., Google • … but search engines do not understand natural language questions • Idea: form extraction queries as sentences to be completed • “Trick” search engine to complete sentences with instances • Example extraction query: “departure cities such as” attribute label: departure city

  11. Key Idea: Question-Answering from AI • Search Google & obtain snippets: • Extract instance candidates from snippets: extraction query completion other departure cities such asBoston, Chicago and LAX available … Boston, Chicago, LAX

  12. But Not Every Candidate is True Instance • Reason 1: Extraction queries may not be perfect • Reason 2: Web content is inherently noisy • Example: • attribute: city • extraction query: “and other cities” • extracted candidate: 150 • need to perform instance verification

  13. Instance Verification: Outlier Detection • Goal: Remove statistical outliers (among candidates) • Step 1: Pre-processing • recognize types of instances via pattern matching & 80% rule • types: numeric & string • discard all candidates not of determined type • e.g., most of instance candidates for city are strings, so remove 150 • Step 2: Type-specific detection • perform discordance tests • test statistics, e.g., • # of words: abnormal if more than 5 words in person name • % of numeric characters: US zip code contains only digits

  14. Instance Verification: Web Validation • Goal: Further semantic-level validation • Idea: Exploit co-occurrence statistics of label & instances • “Make: Honda; Model: Accord” • “a variety of makes such as Honda, Mitsubishi” • Form validation queries using validation patterns • e.g., “make Honda”, “makes such as Honda” Validation phrase V

  15. Instance Verification: Web Validation • Possible measure: NumHits(V+x) • e.g., NumHits(“cities such as Los Angeles”) = 26M • Potential problems: bias towards popular instances • Use PMI(V, x), point-wise mutual information • Example: • V = “cities such as”, candidates: California, Los Angeles • NumHits(V, California) = 29 • PMI(V, Los Angeles) = 3000 * PMI(V, California) NumHits(V+x) NumHits(V) * NumHits(x)

  16. Validate Instances from Other Attributes • Method 1: Discover k more instances from Web • then check for borrowed one (Aer Lingus for Airline) • problem: very likely Aer Lingus not among discovered instances • Method 2: Compare validation score with that of instance • problem: score for Aer Lingus may be much lower, how to decide? • Key observation: compare also to scores of non-instances • e.g., Economy (with respect to Airline)

  17. Train Validation-Based Instance Classifier • Naïve Bayes classifier with validation-based features V1: Airlines such as V2: Airline Thresholds: t1=.45, t2=.075 P(C|X) ~ P(C) P(X|C) P(+)=P(-) = ½ P(f1=1|+) = 3/4 P(f1=1|-) = 1/4 …

  18. Validate Instances via Deep Web • Handle attributes while difficult via Web, e.g., from • Disadvantage: ambiguity when no results found

  19. Architecture of Assisted Matching System Attribute matches Interface matcher Source interfaces with augmented instances Instance acquisition Source interfaces

  20. Empirical Evaluation • Five domains: • Experiments: • Baseline: IceQ [Wu et al., SIGMOD-04] • Web assistance • Performance metrics: • precision (P), recall (R), & F1 (= 2PR/(P+R))

  21. Matching Accuracy • Web assistance boosts accuracy (F1) from 89.5 to 97.5

  22. Overhead Analysis • Reasonable overhead: 6~11 minutes across domains

  23. Conclusion • Search problems on the Deep Web are increasingly crucial! • Novel QA-based approach to learning attribute instances • Incorporation into a state-of-art matching system • Extensive evaluation over varied real-world domains  More details: Wensheng Wu on Google

More Related