1 / 55

Some of my XML/Internet Research Projects

Some of my XML/Internet Research Projects. CSCI 6530 October 5, 2005 Kwok-Bun Yue University of Houston-Clear Lake. Content. Areas of My Research Interest Some Current Projects Storage of XML in Relational Database Example Internet Computing Projects Conclusions.

chet
Download Presentation

Some of my XML/Internet Research Projects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Some of my XML/Internet Research Projects CSCI 6530 October 5, 2005 Kwok-Bun Yue University of Houston-Clear Lake

  2. Content • Areas of My Research Interest • Some Current Projects • Storage of XML in Relational Database • Example Internet Computing Projects • Conclusions Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  3. Areas of My Research Interest • Internet Computing • XML • Databases • Concurrent Programming Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  4. Content • Areas of My Research Interest • Some Current Projects • Storage of XML in Relational Database • Example Internet Computing Projects • Conclusions Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  5. Some Current Projects • Storage of XML in relational database • Measuring Web bias using authorities and hubs • Measuring information quality of Web pages • Distributed computer security laboratory • Collaborative Open Community for developing educational resources • Generalized exchanges within organizations Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  6. Some Recent Student Work • McDowell, A., Schmidt, C. & Yue, K., Analysis and Metrics of XML Schema, Proceedings of the 2004 International Conference on Software Engineering Research and Practice, pp 538-544, Las Vegas, June 2004. • Yang A., Yue K., Liaw K., Collins G., Venkatraman J., Achar S., Sadasivam K., and Chen P., Distributed Computer Security Lab and Projects, Journal of Computing Sciences in Colleges. Volume 20, Issue 1. October 2004. • Yue, K., Alakappan, S. and Cheung, W., A Framework of Inlining Algorithms for Mapping DTDs to Relational Schemas, Technical Report COMP-05-005, Computer Science Department, the Hong Kong Baptist University, 2005, http://www.comp.hkbu.edu.hk/en/research/?content=tech-reports. Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  7. Content • Areas of My Research Interest • Some Current Projects • Storage of XML in Relational Database • Example Internet Computing Projects • Conclusions Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  8. Storing XML in RDB • Advantages: • Mature database technologies. • May be queried by • XML technology: e.g. XPath, XQuery. • RDB technology: e.g. SQL. • Disadvantages: • impedance mismatch: XML and relations are different data models. Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  9. Related Issues • Effective mapping XML DTDs (~ ordered tree model) to relational schemas. • Mapping of XML queries (e.g. XQuery) to RDB queries (e.g. SQL). • Mapping of RDB query results back to XML format. Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  10. Related Work and Context • Mapping • With or without schemas for XML. • With or without user input. • Schemas for XML: • Document Type Definition (DTD) • XML Schema • We consider mapping with DTD and without user input. Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  11. Naïve Mapping • An XML element is mapped to a relation. Example 1a: XML: <a><b><c><d>hello</d></c></b></a> -> Relations: a, b, c and d. Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  12. Problems of Naïve Mapping • Many relations. • Ineffective queries: multiple query joins. Example 1b: XPath Query: //a SQL Query: need to join the relations a, b, c and d. Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  13. Inlining Algorithms • First proposed by Shanmugasundaram, et. al. • Expanded by Lu, Lee, Chu and others. • Extended in various directions by various researchers, e.g., • Preserving XML element orders. • Preserving XML constraints. • Do not consider extensions here. Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  14. Basic Idea of Inlining Algorithms • Inline child element into the relation for the parent element when appropriate. • Different inlining algorithms differ in inlining criteria. Example 1c: XML: <a><b><c><d>hello</d></c></b></a> Inlined Relation: a. Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  15. Inlining Algorithms • Child elements & attributes may be inlined. • Child elements may not have their own relations. • Results in less number of relations. • In general, more inlining -> less joins. Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  16. Inlining Algorithm Structure • Simplification of DTD. • Generation of DTD graphs • Generation of Relational Schemas Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  17. Our Preliminary Results • A more complete and optimal DTD Simplification Algorithm • A generic DTD Graph that can be used by inlining algorithms. • Inlining Considerations: framework for analyzing inlining algorithm • A new and aggressive inlining algorithm Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  18. Examples of Our Work • Use DTD Simplification as an example of the flavor of our work. • Show the new Inlining Algorithm. Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  19. Brief Introduction to DTD • DTD: a simple language to describe XML vocabulary: • Element declarations: contents of elements. • Attribute declarations: types and properties of attributes. • DTD is still very popular. Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  20. DTD Element Declarations • Define element contents: • #PCDATA: string • ANY: anything go • EMPTY: no content (attributes only) • Content models: child elements. • Mixed contents: child elements and strings. Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  21. DTD Example Example 2: A complete DTD <!ELEMENT addressBook (person+)> <!ELEMENT person (name,email*)> <!ELEMENT name (last,first)> <!ELEMENT first (#PCDATA)> <!ELEMENT last (#PCDATA)> <!ELEMENT email (#PCDATA)> <!ATTLIST person id ID #REQUIRED> Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  22. Operators for Element Declaration • ,: sequence • +: 1 or more • *: 0 or more • ?: optional; 0 or 1 • |: choice • (): parenthesis Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  23. Simplification of DTD • Mapping of DTD to Relational Schemas: • Input: DTDs • Output: Relational Schemas • DTD can be complicated => simplification. Example 3: <!ELEMENT a (b,((b+,c)|(d,b*,c?)),(e*,f)?)> Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  24. Simplification Principles • The relational schema needs to store all possible scenarios. • Some relations/columns may not be populated in some instances. Example 3: <!ELEMENT a (b|c)> and <!ELEMENT a (b,c)>: May be the same from the RDB’s point of view. Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  25. Simplification Details • Comma-separated clauses: only operators remain: (), , and *. • + -> *, e.g. a+ -> a*. • Removal of | and ?, e.g. (a|b?) -> (a,b) • Removal of (), e.g. (a, (b)) -> (a,b) • Removal of repetition, e.g. (a, b, a) -> (a*, b) • Note that element orders are not preserved. Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  26. Previous Simplification Results • Not complete: e.g. • Shanmugasundaram: not specify how to handle |. • Lu: not specify how to remove (). • Not optimal (may generate * when it is not needed). Example 4a: For Lu and Lee, 2 steps: (b|(b,c)) -> (b,b,c) -> (b*,c) Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  27. Our Simplification Algorithm • A set of definitions. • A set of 7 simplification rules. • An algorithm on how and when to use them. Example 4b: For us, 1 step: (b|(b,c)) -> (b,c) Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  28. Simplification Rules Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  29. Simplification Algorithm Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  30. Complexity Time complexity = O(Nop) Where Nop is the total number of operators (including parentheses) in the element declarations of the DTD. Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  31. Advantages • Complete: handle all DTDs. • Optimal: in the sense that * will not be generated if not needed. Example 5: <!ELEMENT a (b,((b+,c)|(d,b*,c?)),(e*,f)?)> => <!ELEMENT a (b*,c,d,e*,f)> Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  32. A New Inlining Algorithm (1) • Aggressive in inlining. • More complete. • Elaborated algorithms. • Handle more details: e.g. element types of ANY, EMPTY and mixed contents. Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  33. A New Inlining Algorithm (2) Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  34. A New Inlining Algorithm (3) Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  35. A New Inlining Algorithm (4) Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  36. Main Results • Yue, K., Alakappan, S. and Cheung, W., A Framework of Inlining Algorithms for Mapping DTDs to Relational Schemas, Technical Report COMP-05-005, Computer Science Department, the Hong Kong Baptist University, 2005, http://www.comp.hkbu.edu.hk/en/research/?content=tech-reports. Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  37. Future Works • Implemented the algorithms and tested with many DTDs. • Need to implement the XQuery/SQL bridge for performance study. Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  38. Content • Areas of My Research Interest • Some Current Projects • Storage of XML in Relational Database • Example Internet Computing Projects • Conclusions Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  39. Measuring Web Bias • Search engines dominate how information are accessed. • Search results have major social, political and commercial consequences. • Are search engines bias? • How bias are them? Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  40. Previous Works • To measure bias, results should be compared to a norm. • The norm may be from human experts. • Mowshowitz and Kawaguchi: the average search result of a collection of popular search engines as the norm. Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  41. Mowshowitz and Kawaguchi Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  42. Limitations • Based on URL Vector -> cannot measure bias quality. Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  43. Our Approach • Use Kleinberg’s HITS algorithm to create clusters, authorities and hubs of the result norm URLs. • Use them as norm clusters, authorities and hubs. • Measure distances between norms and individual results as bias. Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  44. HITS • Obtain a directed graph G where • Node: page • Edge: URL link from between pages. • Two indices: xp,i (authority) & yp,i (hub) • Iterate until steady state: • xp,i+1 <- ∑ q,q->pyq,i • yp,i+1 <- ∑ q,p->qxq,i Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  45. Our Approach Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  46. Current Progress • Implemented previous results. • Implemented vector analysis • Implemented HITS algorithm, but it is not accurate enough: • ‘Conglomerate’ effect. Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  47. Measuring Page’s Information Quality • People find information from Web pages. • How good is the content of a given page? Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  48. Previous Works • Measuring different kinds of quality: • Web site design quality • Navigational quality • Many framework on how to measure information quality: • Most results in surveys so users can rank informational quality. • Very few automated or semi-automated tool. Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  49. Our Objectives • Build automated and/or semi-automated tool to measure and/or assist user to measure information quality of a Web page. Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

  50. Approach • Hypothesis, measure, usage guidelines. • Example: • Hypothesis: a Web page with many spelling mistakes is likely to have low information quality. • Measures: • Show frequencies of word occurrences. • Show percentage of spelling ‘mistakes’. • Usage guideline: • Spelling ‘mistakes’ may not be actual mistakes (e.g. UHCL). Bun Yue: yue@cl.uh.edu, http://dcm.uhcl.edu/yue

More Related