1 / 30

Part One XML and Databases

Explore the role of XML and databases in the web, focusing on content representation and storage of semi-structured data. Learn about query processing, optimization, and search modalities.

loree
Download Presentation

Part One XML and Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Part OneXML and Databases Soumen Chakrabarti CSE, IIT Bombay

  2. Form and content • The Web today • HTML generated by hand, wysisyg editors, ‘webified’ databases • HTML specifies rendering for human reading • Screen scraping required to consolidate data • The Web in the future • Common interchange format (XML) • Concentrate on content, not form • Represent data class broader than relations

  3. Role of databases • Contribute • Data storage and indexing • Query processing and optimization • Views, transformations, integration • Adopt • Search modalities • Content-based approximate search • Linguistic analysis

  4. Features of semi-structured data • No explicit schema, or volatile schema • Schema size comparable to data size • Structure changes without notice • Heterogeneous, deeply nested, irregular • Has nature of documents rather than tables

  5. Semi-structured data model example Bib &o1 complex object paper paper book references &o12 &o24 &o29 references references author page author year author title http title title publisher author author author &o43 &25 &96 1997 last firstname atomic object firstname lastname first lastname &243 &206 “Serge” “Abiteboul” “Victor” 122 133 “Vianu” Object Exchange Model (OEM)

  6. Syntax { paper: { author: “Abiteboul”, author: { firstname: “Victor”, lastname: “Vianu”}, title: “Regular path queries …”, page: { first: 122, last: 133 } } }

  7. Some observations • Missing or additional attributes • Multiple attributes • Different types in different objects • Heterogeneous collections

  8. Object ID’s and references <personid=“o555”><name>Jane </name></person> <personid=“o456”><name>Mary</name><childrenidref=“o123 o555”/></person> <personid=“o123” mother=“o456”><name>John</name></person> o456 children children mother o555 o123

  9. Names and acronyms • OEM (Object Exchange Model): a semi-structured data model from Stanford, 1995 • Lore: a system for storing data adhering to the OEM • Lorel: a query language for Lore • XML (eXtensible Markup Language): a simplification of SGML and a generalization of HTML • XML-QL: Query language for XML

  10. Lorel query examples select Bib.paper.title from Bib.paper where Bib.paper.year >1995 Alternative select X.title from Bib.paper X, Bib.(paper|book) Y where Y.author.lastname? = “Ullman” and Y.reference+ X Navigating partiallyknown structures Transitive closure

  11. XML-QL query examples where <booklanguage=“french”> <publisher><name>Morgan Kaufmann</name> </publisher> <author> $a </author> </book> in “www.a.b.c/bib.xml” construct $a where <booklanguage = $l> <author> $a </> </> in “www.a.b.c/bib.xml” construct <result><author>$a</><lang>$l</></>

  12. Ref Val XML storage in ternary relation • Too many joins • Label name storage redundant &o1 paper &o2 year title author author &o3 &o4 &o5 &o6 “The Calculus” “…” “…” “1986”

  13. Paper1 paper paper paper paper year author title title author author author author title title ln fn fn ln fn fn ln ln Paper2 Storage optimization through mining • Inline common cases • Tolerate a few nulls

  14. Schema extraction • Schema: a template for type/semantics specification • Conformance • Does that data conform to a given schema ? • Classification • If so, which objects belong to what classes/types? • Applications • Storage and query optimization

  15. x1 x2 a R y2 Graph simulation Given two edge-labeled graphs G1 and G2, a simulation is a relation R between nodes such that if (x1, x2) is in R, and (x1, a, y1) is in G1, then there exists (x2, a, y2) in G2 (same label) such that (y1,y2) is in R R G1 G2 a y1

  16. Upper and lower bound schema • Lower bound schema • Conformance: find simulation R from S to D • Classification: check if (c,x) in R • Used in storage optimization • Upper bound schema (data guides) • Conformance: find simulation R from D to S • Classification: check if (x,c) in R • Used in path index generation and query optimization

  17. Sample data &r employee employee employee employee employee employee employee employee manages manages manages manages manages &p1 &p2 &p3 &p4 &p5 &p6 &p7 &p8 managedby managedby managedby managedby managedby worksfor worksfor worksfor worksfor worksfor company worksfor worksfor worksfor &c

  18. Lower bound schema Root &r employee company employee Bosses &p1,&p4,&p6 Regulars &p2,&p3,&p5,&p7,&p8 manages managedby worksfor Company &c worksfor

  19. Root person company works-for managed-by Employee Company c.e.o. Employee name address name string Storage using lower bound schema Lower-bound schema Store rest in overflow graph

  20. Upper bound schema (DataGuides) Root &r employee Employees &p1,&p1,&p3,P4 &p5,&p6,&p7,&p8 company managedby manages worksfor Bosses &p1,&p4,&p6 Regulars &p2,&p3,&p5,&p7,&p8 manages managedby worksfor Company &c worksfor

  21. Query optimization issues Select x from A.B x where exists y in x.C: y=5 D A A A D B B D D B B B B B B B C C C C C C C C C 5 5 5 4 4 5 4 4 5

  22. What makes the problem difficult • Selectivity estimation • Index selection • Access cost models • Clustering choices

  23. Part Two Information Retrieval and Databases Soumen Chakrabarti CSE, IIT Bombay

  24. Information retrieval (IR) • Search • ‘Inverted’ index • Boolean match • Relevance ranking • Classification • Learn topics from examples • Clustering • Discover topics from a document collection • Never done inside a relational database D5: 3, 37, 50 cat D7: 9, 20 dog D7: 7, 90, 400 D20: 22, 533

  25. Current style of loose integration • RDBMS provides hooks • Declare some columns as textual with keyword index • Inserts, updates, and deletes trigger external program, e.g., Verity search engine • Search engine maintains separate indices • Simple query rewriting to combine relational and text-match where-clauses

  26. Reasons • Space • BLOB vs. pure relational representation • Average English word is only 5 bytes • Time • Most text engines are resigned to flexible (i.e., no) model for data consistency • Much faster read-only access than relational database lookups

  27. New features desired • Operations that are more complex than keyword search can benefit from tighter coupling with RDBMS • Approximate search is essential (Anand Rajaraman, Amazon.com, SIGMOD 99) • Misspelling book title, author name common • Variant of OEM edge label (author/writer/poet) • Similarity extends to structure as well (‘Travolta’ NEAR ‘Cage’ = ‘Face/Off’)

  28. Case study: generalized ‘like’ • SQL has limited string matching constructs • like ‘%x’, ‘x%’, ‘%x%’ • x must be exact match • Need more lenient match • Applications: LDAP, IR • String edit distance is not suitable • “Given query, order strings in database in increasing order of edit distance and pick top 5”

  29. Sliding-window matching • Given a query, scan to get a set of 3-grams • Similarity of string in database to query = number of shared 3-grams nas asc sce cen ent pas sca cal ras rascal nascent pascal

  30. Issues • Minimally disruptive architecture • Low storage overheads • Fast query processing • Good selectivity estimates • Combining with other predicates for ranking • Efficiently handling updates

More Related