1 / 121

WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

WHOWEDA is a web warehouse that allows direct querying and analysis of web data, enabling organizations and individuals to extract value from their web information assets. It provides a subject-oriented, integrated, and time-variant repository of web data for decision making.

pavone
Download Presentation

WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science Purdue University, West Lafayette, IN 47907 skm@cs.purdue.edu copy-right@sanjay madria

  2. www.is.a.mess copy-right@sanjay madria

  3. WWW • Huge, widely distributed, hetreogenous collection of semi-structured multimedia documents in the form of web pages connected via hyperlinks. copy-right@sanjay madria

  4. Characteristics of WWW • WWW is a set of directed graphs • data in the WWW has a heterogeneous nature • unstructured versus structured information • no central authority to manage information • Dynamic verses static information • Web information discoveries - search engines copy-right@sanjay madria

  5. As WWW grows, more chaotic it becomes • Web is fast growing, distributed, non-administered global information resource • WWW allows access to text, image, video, sound and graphic data • more business organizations creating web servers - e-commerce • more chaotic environment to locate information of interest • lost in hyperspace syndrome copy-right@sanjay madria

  6. WWW data - Does it affect the corporate world? • Lack of credibility of data • Different sites with different data • Same site different data • Historical information is not available • Previous versions of web data • How does web data change with time • Summarization over time • Data to information • Reduction in productivity • Analysis is manual copy-right@sanjay madria

  7. How users find web sites • Indexes and search engines 75 • UseNet newsgroups 44 • Cool lists 27 • New lists 24 • Listservers 23 • Print ads 21 • Word-of-mouth and e-mail 17 • Linked web advertisement 4 copy-right@sanjay madria

  8. Limitations of Search Engines • Do not exploit hyperlinks - recently google • search is limited to string matching • key-world oriented search queries are evaluated on archived data rather than up-to-date data; no indexing on current data • low accuracy • replicated results • no further manipulation possible copy-right@sanjay madria

  9. Continue ……. • ERROR 404! • No efficient document management • Query results cannot be further manipulated • No efficient means for knowledge discovery copy-right@sanjay madria

  10. Current Research Projects • Web Query System • W3QS, WebSQL, AKIRA, NetQL, RAW, WebLog, XML-QL • Semistructured Data • LOREL, UnQL, WebOQL, Website Management System • STRUDEL • Web Warehouse - WHOWEDA copy-right@sanjay madria

  11. WHOWEDA -Key Objectives • Design a suitable data model to represent web information • development of web algebra and query language • Maintenance of Web data • Development of knowledge discovery and web mining tools • Web warehouse copy-right@sanjay madria

  12. WHOWEDA - What? • WareHouse Of Web Data • Subject - oriented • Integrated • Temporal • Granularity - Lower, higher • Some summary • Not updatable • Alternative information sources copy-right@sanjay madria

  13. Web Warehouse? • Subject-oriented, integrated, time-variant, non-volatile repository of web data for direct querying and analysis for some sort of decision making • A process whereby organizations or individuals extract value from their Web informational assets through the use of special stores called web warehouses copy-right@sanjay madria

  14. WHOWEDA!www.cais.ntu.edu.sg:8000/~whoweda • A WareHouse Of WEb DAta • Web Information Coupling Model (WICM) • Web Objects • Web Schema • Web Information Coupling Algebra • Web Information Maintenance • Web Mining and Knowledge discovery copy-right@sanjay madria

  15. User WWW Warehouse Concept Mart Web Querying & Analysis Component Web Information Mining System Web Information Coupling System Web Information Maintenance System Web Mart Web Mart Web Warehouse Web Mart Web Mart

  16. User WWW Web Query & Display Warehouse Concept Mart Global Web Manipulation Global Web Coupling Global Ranking Pre processing Data Visualization Schema Tightness Web Warehouse Data Visualization Web Union Web Select Web Intersection Web Project Local Web Manipulation Local Web Coupling Schema Tightness Local Ranking Schema Search Web Join Schema Match

  17. Web Objects • Node - url, title, format, size, date, text • Link - source-url, target-url, label, link-type • Web tuple • Web table • Web schema • Web database copy-right@sanjay madria

  18. Web Schema • Metadata in the warehouse • Structural ‘summary’ of web table • Information Coupling using a Query graph • Query graph ->Web schema • directed graph represented by Ordered 4-tuple: • Set of node variables • Set of link variables • Connectivities • Predicates copy-right@sanjay madria

  19. copy-right@sanjay madria

  20. Information Square's homepage Headline article 1 Headline article n News@TCS Local news 1 (List of video files) List of links to local news News specials Local news k World news 1 Airport info List of links to world news World news t copy-right@sanjay madria

  21. e e x x y y target_url CONTAINS "article” g g f z label CONTAINS "Local News" target_URL CONTAINS "newshub/specials" url CONTAINS "local" h w label CONTAINS "World News" url CONTAINS "world" url contains “headlines” copy-right@sanjay madria

  22. Information Square's homepage Headline article 1 List of links to local news Local news 1 News specials World news 1 List of links to world news copy-right@sanjay madria

  23. Schema- example • Node variables: Xn = { x, y, z, w } • Link variable: Xl = { e, f, g } • Connectivities: C = { x<e>y and x<fg->z and x<fh->w } • The symbol # represents an unbound node variable or link variable; a variable not restricted by any predicate. • “-” represents one unbound links • “-+” represents more than one unbound links copy-right@sanjay madria

  24. Predicates • P={x.url=”http://www.mediacity.com.sg/i-square”, • y.url CONTAINS “headlines” • e.target_url CONTAINS "article", • f.target.url CONTAINS "newshub/specials", • g.label CONTAINS "Local News", • z.url CONTAINS "local", • h.label CONTAINS "World News", • w.url CONTAINS "world" } copy-right@sanjay madria

  25. Query Graph - Example 1 • Query graph - same as schema except that it has one more parameter to control the results returned. • Informally, it is directed connected graph consists of nodes, links and keywords imposed on them. • Produce a list of diseases with their symptoms, evaluation procedures and treatment starting from the web site at http://www.panacea.org/ • Web tableDiseases copy-right@sanjay madria

  26. Treatment list q g Treatment http://www.panacea.org/ Issues Symptoms list f y x z Symptoms List of Diseases e Evaluation Evaluation w p

  27. q1 Treatment list g1 Treatment http://www.panacea.org/ Issues f1 Symptoms list x0 y1 z1 Symptoms AIDS List of Diseases e1 Evaluation Evaluation w1 p2 Elisa Test

  28. Example 2 • Produce a list of drugs, and their uses and side effects starting from the web site at http://www.panacea.org/ • Web tableDrugs copy-right@sanjay madria

  29. Drug list Side effects http://www.panacea.org/ Issues r c a b d Side effects List of Diseases Use s k Uses

  30. Side effects of Indavir Drug list http://www.panacea.org/ Issues AIDS r1 a0 b1 c1 d1 Indavir Side effects List of Diseases Use s1 k1 Uses of Indavir

  31. Query Language • Starting from the CS dept. home page at NTU, find all documents that are linked through paths of length less than two containing only local links, and have in their text “database”. copy-right@sanjay madria

  32. COUPLE WEBTABLE W FROM WWW SUCH THAT NODE I, J IN WWW and LINK e,f,g IN WWW AND I<e|f,g>J WHERE I.url EQUALS “http://www.ntu.edu.sg” AND J.text CONTAINS “database” AND f.link-type EQUALS local AND g.link-type EQUALS local; copy-right@sanjay madria

  33. Web Algebra • Formal foundation of data representation and manipulation in a web warehouse • Web operators: • Information access operator • Information manipulation operators • Web schema operators • Data visualization operators copy-right@sanjay madria

  34. Information access operator • Global Web Coupling copy-right@sanjay madria

  35. Information Manipulation - Web select • Web project • Local web coupling • Web join • Web cartesian product • Web union • Web intersect • Local Web coupling copy-right@sanjay madria

  36. Web Select • Extracts web tuples from web tables satisfying certain conditions on node and link variables and on connectivities • Input is select Schema • Output is a web table satisfying the select schema copy-right@sanjay madria

  37. select W1 tuples that contain world news about Indonesia since May 1 1998. • sMsW1 where Ms = < Xsn, Xsl, Cs, Ps >, Xsn = { x, w }, Xsl = { }, Cs = { }, Ps = { x.date > "1May1998", w.text CONTAINS “Indonesia”} copy-right@sanjay madria

  38. Xn’ = { x, y, z, w },Xl’ = { e, f, g } • C’ = { x<e>y and x<fg->z and x<fh->w } • P’={x.url=”http://www.mediacity.com.sg/i-square”, x.date > "1May1998", • e.target_url CONTAINS "article", f.target.url CONTAINS "newshub/specials", • g.label CONTAINS "Local News", • z.url CONTAINS "local", • h.label CONTAINS "World News", • w.url CONTAINS "world", • w.text CONTAINS “Indonesia” } copy-right@sanjay madria

  39. Web Information Coupling System • A database system to couple related web information • Global web Coupling and Local Web Coupling copy-right@sanjay madria

  40. Global Coupling - Information Access • To integrate data from the Web • To create historical data • To couple related information from the WWW satisfying a query graph • Operator to create web tables • From web with no schema to web table with web schema copy-right@sanjay madria

  41. Why local web coupling? • Directly querying the WWW to gather these information is an expensive and repetitive affair • Web documents containing similar information can reside in different web tables in a web warehouse • A mechanism to gather these similar information by additional manipulation of the materialized web tables copy-right@sanjay madria

  42. Local Web Couple operator • Two web tuples and can be coupled if there exist atleast one pair of nodes from and which contains similar information. copy-right@sanjay madria

  43. Local Web Couple operator • The web couple operator is basically a web cartesian product followed by web select: • We denote web couple by the symbol: copy-right@sanjay madria

  44. Web Coupling copy-right@sanjay madria

  45. Example 1 • Produce a list of diseases and their symptoms starting from the web site at http://www.panacea.org/ • Web tableDiseases copy-right@sanjay madria

  46. Issues http://www.panacea.org/ symptoms e z x y symptoms List of Diseases Web Schema or Query Graph of ``Diseases”

  47. Issues Issues Issues Issues http://www.panacea.org/ http://www.panacea.org/ http://www.panacea.org/ http://www.panacea.org/ Symptoms of AIDS Symptoms of Lung Diseases Symptoms of Diabetes Symptoms of Cancer AIDS Diabetes Cancer Lung Disease e0 e2 e3 e1 z0 z1 z2 z3 x0 x0 x0 x0 y1 y2 y0 y3 symptoms symptoms symptoms symptoms List of Diseases List of Diseases List of Diseases List of Diseases Web table ``Diseases”

  48. Example 2 • Produce a list of drugs, and their side effects starting from the web site at http://www.panacea.org/ • Web tableDrugs copy-right@sanjay madria

  49. Drug list Side effects http://www.panacea.org/ Issues r c a b d Side effects List of Diseases Web Schema or Query Graph of ``Drugs”

  50. Side effects of Ritonavir Side effects of Indavir Drug list Drug list http://www.panacea.org/ http://www.panacea.org/ Issues Issues AIDS AIDS r2 r1 a0 a0 b1 b1 c2 c1 d2 d1 Indavir Ritonavir Side effects Side effects List of Diseases List of Diseases Side effects of letrozole Drug list http://www.panacea.org/ Issues Cancer r3 a0 b2 c3 d3 Letrozole Side effects List of Diseases Side effects of Beta Carotene Drug list http://www.panacea.org/ Issues Heart Disorder r4 a0 b4 c4 d4 Side effects Beta Carotene List of Diseases Web table ``Drugs”

More Related