1 / 16

Inferring XML Schema Definitions from XML Data

Inferring XML Schema Definitions from XML Data. Eert Jan Bex, Frank Neven, Stijn Vansummeren Hasselt Univ. and transnational Univ. of Limburg, Belgium VLDB 2007 2008. 02. 15. Summarized by Chulki Lee , IDS Lab., Seoul National University

xena
Download Presentation

Inferring XML Schema Definitions from XML Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Inferring XML Schema Definitions from XML Data Eert Jan Bex, Frank Neven, Stijn Vansummeren Hasselt Univ. and transnational Univ. of Limburg, Belgium VLDB 2007 2008. 02. 15. Summarized by Chulki Lee, IDS Lab., Seoul National University Presented by Chulki Lee, IDS Lab., Seoul National University

  2. Inferring XML Schema • Why schemas? • automation & optimization of search • integration of XML data sources • … • Why infer schemas? • 50% of XML on the web have none • 33% of schemas are not valid • Why infer XSD? (XML Schema Definition) • DTD (Document Type Definitions) has limitations • element type only depend on the element’s name (not consider path)

  3. Example: DTD vs. XSD name type

  4. Theorem • Inferring XSD from XML corpus is • impossible to learn from positive data only • Content model of an element is • uniquely determined by the path from the root to that element

  5. Observation: local context • XSD is k-local • its content models depend only on labels up to the k-th ancestor • 98% of XSD, k = 2

  6. Observation: SORE duplicated element names • Single Occurrence Regular Expression (SORE) • What’s SORE title, (author, affiliation?)+, abstract • What’s not SORE title, ((author, affiliation)++(editor, affiliation)+), abstract • 99 % of regular expressions is single occurrence

  7. Proposed Algorithms • SOA: Single Occurrence Automaton • Theorem • XSDs with local context and SORE content models arelearnable from positive examples only (need ‘sufficiently large’) • iLocal = iSOA + TOSORE + MINIMIZE • infer k-local and single occurrence target XSD Schema • iXSD = iLocal & REDUCE • REDUCE = (unify sufficiently similar types)

  8. Algorithm: iLocal (1/4)

  9. Algorithm: iLocal (2/4)

  10. Algorithm: iLocal (3/4) iSOA: make SOA from strings ToSORE: translate SOA → SORE

  11. Algorithm: iLocal (4/4)

  12. Algorithm: iXSD • incomplete data • iLocal derives too many types • REDUCE: practical heuristics • define distance between types • for type s and t • if distance(s, t) < ε then unify s and t

  13. Experiments • 8 schemas & 200 generated documents for each schema • schema: 12~23 types with unbounded depth and width • local with k = 2, 3 • types of iXSDimprecisions: • content model for target and inferred type can differ • based on positive examples, can’t be avoided • type in target XSD can corresponds to multiple types in inferred XSD: false positives • type in inferred XSD can corresponds to multiple types in target XSD: false negatives • type in target XSD is not derived • incomplete corpus, can't be avoided

  14. Experiments • k = 3, parsing 697 XSDs (40Mb), PentiumM1.73 → 17 seconds • k = 2, without REDUCE → 29 false positive • power of REDUCE • Sensitivity to parameters • context size k ↑ ⇒ false positives ↑ ⇒ false negatives ↓ • ε ↑ ⇒ false positives ↓ ⇒ false negatives ↑

  15. Experiments • iXSD derives good XSDs from small training sets (50~)

  16. Conclusions • Propose two algorithms • iLocal – sound & k-complete • iXSD – deal with poor data • good performance on real world • good runtime performance • Future work • determine best locality k

More Related