1 / 30

Presented by: Bruce Vincent CSE-718 Seminar April 25, 2008

Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors: Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign) Published in: Proceedings of the 31 st VLDB Conference, Trondheim, Norway 2005 . Presented by: Bruce Vincent

atownsend
Download Presentation

Presented by: Bruce Vincent CSE-718 Seminar April 25, 2008

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Light-weight Domain-based Form Assistant:Querying Web Databases On The Fly Authors: Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)Published in: Proceedings of the 31st VLDB Conference, Trondheim, Norway 2005 Presented by: Bruce Vincent CSE-718 Seminar April 25, 2008

  2. Outline • Overview • Problem Description, Motivating Example • System Architecture • Design Approaches • Query Modeling and Translation • Dynamic Predicate Mapping • Implementation - Form Assistant Toolkit • Experiments • Related Work

  3. Problem Description • “Deep Web” • Estimated to contain 450,000 online databases (2004) • Sometimes referred to as “Invisible Web” or “Hidden Web” • Much of this is accessible only by query forms instead of static URL links • Common domains such as: books, cars, airfares

  4. Problem Description • Often it can be useful to query multiple alternative sources in the same domain • Automation of this entails several components • One key component is dynamic query translation • Software toolkit “Form Assistant” designed to provide potential translations of user queries for alternative sources • e.g., User-entered Amazon form query automatically translated to potential Barnes & Noble form query

  5. Problem Description • Goals of query translator: • Source-generality • Built-in translation must generally cope with new or “unseen” sources • Domain-portability • Translator must be easily customizable with domain-specific knowledge, and thus deployable for new domains

  6. Motivating Example Source query Qs on source form S:(e.g. Amazon) Target query form T:(e.g. Barnes & Noble)

  7. U Tom Clancy Tom Clancy Motivating Example Source query Qs on source form S Target query form T Query Translation Filter::σtitle contain“red storm” and price < 35andage > 12 Union Query Qt*:

  8. System Architecture Form Extractor Form Extractor Source query Qs Target query form QI Domain-specific Thesaurus Attribute Matcher: Syntax-based schema matching FormAssistant(FA) Domain-specific type handlers Predicate Mapper: Type-based search-driven mapping Query Rewriter: Constraint-based query rewriting Target query Qt*

  9. Design Approaches • Query Modeling • Vocabulary and Syntax • Query Translation • Dynamic Predicate Modeling

  10. Query Modeling • Vocabulary • Predicate templates: { P1, P2, P3, P4, P5 } • Example: P3 P1 P5 P2 P4

  11. Query Modeling • Example Vocabulary (predicate templates) • P1 = [author; contain; $au] • P2 = [title; contain; $ti] • P3 = [subject; contain; $su] • P4 = [isbn; contain; $isbn] • P5 = [price; between; $s, $e] • Example Syntax (valid conjunctive forms) • F1 = P1 P5 • F2 = P2 P5 • F3 = P3 P5 • F4 = P4 P5 • F5 = P1 • F6 = P2 • F7 = P3 • F8 = P4

  12. Query Modeling • Example Vocabulary Instantiations • p1 = [author; contain; Tom Clancy] • p2 = [title; contain; red storm] • p51 = [price; between; 0-25] • p52 = [price; between; 25-45] • Corresponding Form Queries: • f1 = p1 p51 • f2 = p1 p52 • Resultant Union Query: • Qt = f1 f2

  13. Tom Clancy Query Modeling • Syntax • Valid combination of predicate templates {F1, F2, F3, F4, F5, F6, F7, F8 } • Example (‘v’ indicates ‘valid’): F1: F2:

  14. Query Translation • Based on semantic closeness of query predicates: • Finds minimal subsuming Cmin • Benefits of this approach: • No false positives • Minimizes false negatives • Has clear semantics, independent of DB content • Modular translation

  15. Query Translation • Example: 0 35 s: 25 0 t1: 25 45 t2: ? 45 65 t3: Cmin 0 45 t1vt2: 0 65 t1vt2vt3:

  16. Query Translation • Definition: • Given source query Qs and target query form T, a query Qt* is a “minimal subsuming translation” w.r.t. T if: • 1. Qt* is a validquery w.r.t T • 2. Qt* subsumes Qs • i.e., for any database instance Di, Qs(Di) ≤ Qt*(Di) • 3. Qt* is minimal • i.e., there is no query Qt such that Qt satisfies (1.) and (2.) above and Qt* subsumes Qt

  17. Query Translation • Qt1 = (f1: p1 p51) (f2 : p1 p52) • Qt2 = f2 • Qt3= f3: p1 • Example: • Consider source query Qs in first example and three target queries Qt1,Qt2,Qt3 • Qt1 and Qt3 subsume Qs while Qt2 does not • Misses price range 0-25 • Thus can’t be the best translation Cmin • Prune Qt3 because it subsumes Qt1 • That leaves Qt1 as Cmin • p1 = [author; contain; Tom Clancy] • p51 = [price; between; 0-25] • p52 = [price; between; 25-45]

  18. Dynamic Predicate Mapping • Tasks: • Choose operator • Fill in values • Objective: • Minimal subsuming between source and target

  19. U Predicate Mapping Predicate Mapping Dynamic Predicate Mapping • Example: Input: output:

  20. System Architecture (reminder) Form Extractor Form Extractor Source query Qs Target query form QI Domain-specific Thesaurus Attribute Matcher: Syntax-based schema matching FormAssistant(FA) Domain-specific type handlers Predicate Mapper: Type-based search-driven mapping Query Rewriter: Constraint-based query rewriting Target query Qt*

  21. Implementation – Form Assistant Toolkit • Form Extractor • Parses HTML into query predicate templates [attr; op; val] • Details discussed in a different paper [3.] by same research group • Attribute Matcher (1:1) • Identifies semantically corresponding attributes between forms • Customized with domain thesaurus (indexes synonyms for commonly used concepts) • Stems (e.g., “children” -> “child) and removes stop words (e.g., “the”) • Matched by value type and synonym attributes • Predicate Mapper (discussed in previous slides) • Query Rewriter • Well-studied problem to find minimal subsuming query of given predicate-mapped query (uses approach of [5.] by Papakonstantinou, et al)

  22. Experiments • Datasets • 447 Deep Web sources (query forms) in 8 domains • 3 “Basic” domains – each with custom thesaurus in FA • Books, Airfares, Automobiles • 5 “New” domains (for tests, these don’t have thesaurus) • Car Rentals, Jobs, Hotels, Movies, Music/Records • Test Approach • Run the FA to translate 120 form queries • Each translation test corresponds to random pairing of sources within a domain • Count correct mappings in translation suggested by FA • Indicates amount of user effort the Form Assistant has saved

  23. Experiments • Results: Accuracy Distributions • X: % correct predicate translations; Y: % tested query forms • Forms with all 1:1 mappings had 87% perfect accuracy for Basic dataset, 85% perfect for New dataset (good domain flexibility) • Forms having complex mapping: 76%, 70% “near perfect” (Y>80%) • FA did not attempt complex (n:m) mappings, such as a full name in source mapping to separate first and last names in target For Basic dataset: For New dataset:

  24. Experiments • Accuracy ratio: correct results per 1:1 query • Raw: includes some forms whose input form extraction step had errors • Perfect: manually forces all correct form extractions • Avg. accuracy improves for perfectly correct extraction step: • for Basic dataset, 90.4% improves to 96.1% • For New dataset, 81.1% improves to 86.7% Basic: 3 domains New: 5 domains

  25. Experiments • Example Error in Form Extraction • delta.com form has link to alternative reservation page • “One-way & multi-city reservations” • Wrongly interpreted by Form Extractor as input field label (attribute)

  26. Experiments • Error Distribution • % of errors caused by each component • Fewest errors are due to Attribute Matching • Most errors due to Predicate Mapping • Cited reason for PM errors is insufficient domain knowledge • Example failure: source subject value “computer science” didn’t properly map to target subject value “programming languages” • Improvement could entail better domain-specific ontology and type handlers Form Extraction 40% Attribute Matching 18% 42% Predicate Mapping

  27. Related Work • From the same research group: • Complex Matchings (n:m) • Defines “Type Recognizer” used in Form Assistant’s Attribute Matcher, and discusses complex n:m matchings not attempted by Form Assistant: • [1.] Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach. B. He, K. C.-C. Chang, and J. Han. In Proceedings of the 2004 ACM SIGKDD Conference (KDD 2004) (Full Paper), Seattle, Washington, August 2004 • MetaQuerier System • Fuller system for both exploring (to find) and integrating (to query) Deep Web databases: • [2.] Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web. K. C.-C. Chang, B. He, and Z. Zhang. In Proceedings of the Second Conference on Innovative Data Systems Research (CIDR 2005), Asilomar, California, January 2005

  28. Related Work • From the same research group: • Form Extraction • As used by implementation of Form Assistant: • [3.] Understanding Web Query Interfaces: Best-Effort Parsing with Hidden Syntax. Z. Zhang, B. He, and K. C.-C. Chang. In Proceedings of the 2004 ACM SIGMOD Conference (SIGMOD 2004), Paris, France, June 2004 • 2007 thorough analysis of the Deep Web • Interesting survey of web databases and query interfaces: • [4.] Accessing the Deep Web: A Survey. B. He, M. Patel, Z. Zhang, and K. C.-C. Chang. Communications of the ACM (CACM), 50(5):94-101, May 2007 • Public Datasets • Cached real world query form web pages (used in experiments): • http://metaquerier.cs.uiuc.edu/repository/datasets/tel-8 • Additional Deep Web integration resources: • http://metaquerier.cs.uiuc.edu/repository

  29. Related Work • Query Rewriting • As used by implementation of Form Assistant: • [5.] Y. Papakonstaninou, A. Gupta, H. Garcia-Molina, and J. Ullman. A query translation scheme for rapid implementation of wrappers In proceedings of the Fourth International Conference on Deductive and Object-Oriented Databases, Singapore, December 1995.

  30. Thank you !

More Related