1 / 70

Detecting and Representing Relevant Page-Level Web Deltas

Detecting and Representing Relevant Page-Level Web Deltas. Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907 skm@cs.purdue.edu. Replaces its antecedents leaving no trace!!!!. Current Situation of W 3.

Download Presentation

Detecting and Representing Relevant Page-Level Web Deltas

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907 skm@cs.purdue.edu

  2. Replaces its antecedents leaving no trace!!!! Current Situation of W3 • The Web allows information to change at any time and in any way • Two forms of changes • Existence • Structure and content modification • Leaves no trace of the previous document

  3. Problems of Change Management • Problem: • Detecting, Representing and Querying these changes • The problem is challenging • Typical database approaches to detect changes based on triggering mechanisms are not usable • Information sources typical do not keep track of historical information to a format that is accessible to the outside user

  4. Motivating Example • Assume that there is a web site at www.panacea.gov • Provides information related to drugs used for various diseases

  5. Motivating Example • Suppose, on 15th January, a user wishes to find out periodically (every 30 days) • information related to side effects and uses of drugs used for various drugs and • changes to these information at the page-level compared to its previous version

  6. Structure of www.panacea.gov • Web page at www.panacea.gov contains a list of diseases • Each link of a particular disease points to a web page containing a list of drugs used for prevention and cure of the disease • Hyperlinks associated with each drug points to documents containing a list of various issues related to a particular drug (description, manufacturers, clinical pharmacology, uses, side-effects etc) • From the hyperlinks associated with each issue, one can retrieve details of these issues for a particular drug

  7. A Snapshot as on 15th Jan Side effects Indavir Ritonavir Uses AIDS Cancer Heart disease Alzheimer’s Disease Side effects Hirudin Uses Diabetes Niacin Ibuprofen Impotence Side effects Vasomax Side effects Side effects Caverject Uses Uses

  8. Some Changes • 25th January • Links related to Diabetes are removed • New link containing information related to Parkinson’s Disease • Information related to issues, side-effects and uses of various drugs for Cancer are also modified

  9. A Partial Snapshot as on 25th Jan Side effects Tolcapone Parkinson’s Disease Uses Cancer www.panacea.gov Diabetes Side effects

  10. Some Changes • 30th January • Links related to Impotence is modified • Previously provided by www.pfizer.com • Now by www.panacea.gov • Inter-linked structure of the Web pages related to Caverject is also modified • Information about Viagra, a new drug for Impotence is added

  11. A Partial Snapshot as on 30th Jan Side effects www.panacea.gov Uses Caverject Impotence Side effects Vasomax Viagra Uses

  12. Some Changes • 8th February • Link structure of Heart Disease is modified • Label Heart Disease is modified to Heart Disorder • Content of the pages dealing with side-effects and uses of Hirudin are updated • Inter-linked document structure of Niacin is modified • Web pages related to the side effects and uses of Ibuprofen (Alzheimer’s Disease) are removed

  13. On 8th February www.panacea.gov Heart disorder Alzheimer’s Disease Side effects Hirudin Uses Niacin Side effects

  14. Side effects Uses A Snapshot as on 15th Feb Indavir Ritonavir AIDS Alzheimer’s Disease Cancer Heart disease Parkinson’s Disease Hirudin Niacin Impotence Viagra Vasomax Caverject

  15. Objectives • Web deltas - Changes to web information • Detecting and representing relevant page-level web deltas • changes that are relevant to user’s query, not any arbitrary changes or web deltas • Restricted to page level • Detect those documents • which are added to the site • deleted from the site • those documents which has undergone content or structural modification • How these delta documents are related to one another and with other documents relevant to the user’s query

  16. The WHOWEDA Project • WHOWEDA: A WareHouse of WEb DAta • To design and implement a web warehousing system capable of effective extraction, management, and processing of information on the World Wide Web • Data model: WHOM (WareHouse Object Model)

  17. Overview of WHOM • Our web warehouse can be conceived of as a collection of web tables • A set of web tuples and a set of web schemas represents a web table • A web tuple is a directed graph containing nodes and links and satisfies a web schema • Nodes and links contain content, metadata and structural information associated with Web documents and hyperlinks • Tree representation • Web algebra containing web operators to manipulate web tables • Global Coupling, Web Select, Web Join etc.

  18. Overview of our approach • Step 1: Two snapshots of old and new relevant data is coupled from the Web using global web coupling operation and materialized in two web tables. • Step 2: Web join, left outer join and right outer joined operations are performed on these two web tables • Result is joined, left and right outer joined web tables • Step 3: Delta web tables containing different types of web deltas are generated from these resultant web tables. • Elaborate on these steps……...

  19. Step 1: Retrieving snapshots of Web data using Global Web Coupling

  20. Web Query Specification • Features: • Draw a web query as a directed connected acyclic graph (also called a coupling query) • Query can also be specified in text form • Specify search conditions on the nodes and edges of the graph • Performed by the global web coupling operator

  21. Coupling Query • Set of node variablesXn • Each variable represents set of Web documents • Set of link variablesXl • Each variable represent set of hyperlinks • Set of connectivities C in DNF defined over node and link variables • To specify hyperlink structure of the documents • Set of predicates P defined over some of the node and link variables • Specify metadata, content or structural conditions • Set of coupling query predicates Q • Conditions on execution of the query

  22. Example • Suppose, on 15th January, a user wishes to find out periodically (every 30 days) from the web site at www.panacea.gov • information related to side effects and uses of drugs used for various diseases • Result of the query is stored in the form of web table

  23. Coupling Query • Xn = {a, b, d, k} • Xl = { - } • P = {p1, p2, p3, p4} • p1(a) = METADATA:: a[url] EQUALS “www.panacea.gov” • p2(b) = CONTENT:: b[html.body.title] NON-ATTR-CONT “drug list” • p3(k) = CONTENT:: k[html.body.title] NON-ATTR-CONT “uses” • p4(d) = CONTENT:: d[html.body.title] NON-ATTR-CONT “side effects”

  24. Coupling Query • C = k1 AND k2 AND k3 • k1 = a < - > b • k2 = b < -{1, 6} > d • k3 = b < -{1, 3} > k • Q = {q1} • q1(b) = COUPLING_QUERY:: polling_frequency EQUALS “30 days”

  25. Pictorial Representation “side effects” d {1, 6} www.panacea.gov a b “drug list” {1, 3} k “uses”

  26. Web Table Drugs (15th Jan) a0 b0 u0 d0 Indavir AIDS k0 a0 b0 u1 d1 Ritonavir AIDS k1 Beta Carotene a0 b1 d2 Cancer k2 a0 b5 d12 Ibuprofen Alzheimer’s Disease k12

  27. a0 b3 d4 k5 Albuterol Diabetes a0 b4 u4 u5 u6 d5 Impotence Vasomax k6 a0 b4 u7 d6 Cavarject Impotence u8 k7 a0 b2 u2 d3 Heart Disease Hirudin k3 Web Table Drugs (15th Jan)

  28. a0 b0 u0 d0 Indavir AIDS k0 a0 b0 u1 d1 Ritonavir AIDS k1 a0 b2 u2 d3 Heart Disorder Hirudin k3 Web Table New Drugs (15th Feb) Beta Carotene a0 b1 d2 Cancer k2

  29. a0 b2 u3 d7 Heart Disorder Niacin k7 a0 b4 u9 d8 Impotence Vasomax k8 a0 b6 u10 d10 b6 Tolcapone Parkinson’s Disease k10 Web Table New Drugs (15th Feb) a0 b4 u7 d6 Cavarject Impotence k7

  30. a0 b4 u12 d9 Impotence Viagra k9 a0 b6 u10 d10 b6 Tolcapone Parkinson’s Disease k10 Web Table New Drugs (15th Feb)

  31. Step 2: Performing Web Join, Left and Right Outer Web Join

  32. Web Join • Information composition operator • Combines two web tables into a single web table under certain conditions • Combine two web tables by concatenating a web tuple of one web table with a web tuple of other web table whenever there exist joinable nodes • Two nodes are joinable if they are identical • Two nodes are identical if the URL and last modification date of the nodes are same • The joined web tuple is stored in a different web table

  33. Web Join • Join web tables Drugs and New Drugs • Nodes which has not undergone any changes are the joinable nodes in these two web tables. • Content modified nodes, new nodes and deleted nodes cannot be joinable nodes

  34. a0 b0 u0 d0 Indavir (3) AIDS k0 Ritonavir a0 u1 d1 AIDS k1 Joined web table a0 b0 u0 d0 AIDS Indavir (1) AIDS k0 a0 AIDS a0 b0 d1 u1 Ritonavir (2) AIDS a0 k1

  35. a0 b4 u7 d6 Cavarject (5) Impotence u8 k7 a0 b4 u7 Cavarject Impotence Joined Web Table a0 b2 u3 d7 Heart Disorder Niacin (4) k4 a0 u2 d3 Heart Disease Hirudin k3

  36. Joined Table a0 b2 u2 d3 Heart Disease Hirudin (6) k3 Hirudin a0 u2 d3 Heart Disorder k3

  37. a0 b4 u7 d6 Cavarject (5) Impotence u8 k7 a0 b4 u7 Cavarject Impotence Types of web tuples • Web tuples in which all the nodes are joinable • Results of joining two versions of web tuples that has remained unchanged during the transition • Web tuples in which • some of the nodes are joinable nodes • remaining nodes are the result of insertion, deletion or modification operations

  38. a0 b0 u0 d0 Indavir (3) AIDS k0 Ritonavir a0 u1 d1 AIDS k1 Types of web tuples • Tuples in which • Some of the nodes are joinable nodes • Out of the remaining nodes some are result of insertion, deletion or modification and • The remaining ones remained unchanged during the transition

  39. Outer Web Join • Web tuples that do not pariticipate in the web join process (dangling web tuples) are absent from the joined web table • Outer web join enables us to identify them • Left outer web join • Right outer web join

  40. a0 b0 u0 d0 Indavir AIDS k0 a0 b0 u1 d1 Ritonavir AIDS k1 a0 b2 u2 d3 Heart Disorder Hirudin k3 Web Table New Drugs (15th Feb) Beta Carotene a0 b1 d2 Cancer k2

  41. a0 b2 u3 d7 Heart Disorder Niacin k7 a0 b4 u9 d8 Impotence Vasomax k8 Web Table New Drugs (15th Feb) a0 b4 u7 d6 Cavarject Impotence k7

  42. a0 b4 u12 d9 Impotence Viagra k9 a0 b6 u10 d10 b6 Tolcapone Parkinson’s Disease k10 Web Table New Drugs (15th Feb)

  43. a0 b4 u9 d8 Impotence Vasomax k8 Beta Carotene a0 b1 d2 Cancer a0 b4 u12 d9 Impotence Viagra k2 k9 a0 b6 u10 d10 b6 Tolcapone Parkinson’s Disease k10 Right Outer Web Join

  44. Types of web tuples • New web tuples which are added during the transition • These tuples contain some new nodes and remaining ones content are changes • Tuples in which all the nodes have undergone content modification • Tuples which existed before and in which some of the nodes are new and remaining ones content have changed.

  45. Beta Carotene a0 b1 d2 Cancer k2 a0 b5 d12 Ibuprofen Alzheimer’s Disease k12 Web Table Drugs (15th Jan) a0 b0 u0 d0 Indavir AIDS k0 a0 b0 u1 d1 Ritonavir AIDS k1

  46. a0 b3 d4 k5 Albuterol Diabetes a0 b4 u4 u5 u6 d5 Impotence Vasomax k6 a0 b4 u7 d6 Cavarject Impotence u8 k7 a0 b2 u2 d3 Heart Disease Hirudin k3 Web Table Drugs (15th Jan)

  47. a0 b3 d4 k5 Albuterol Diabetes a0 b4 u4 u5 u6 d5 Impotence Vasomax k6 Beta Carotene a0 b1 d2 Cancer k2 a0 b5 d12 Ibuprofen Alzheimer’s Disease k12 Left Outer Web Join

  48. Types of web tuples • Web tuples which are deleted during the transition • These tuples do not occur in the new web table • Tuples in which all the nodes have undergone content modification • Tuples in which some of the nodes are deleted and remaining ones content have changed.

  49. Step 3: Generating Delta Web Tables

  50. Overview • Input • Joined, left outer joined and right outer joined web tables • Output • Set of delta web tables

More Related