1 / 59

Data Processing with Missing Information

Data Processing with Missing Information. Haixuan Yang Supervisors: Prof. Irwin King & Prof. Michael R. Lyu. Outline. Introduction Link Analysis Preliminaries Related Work Predictive Ranking Model Block Predictive Ranking Model Experiment Setup Conclusions Information Systems

winter
Download Presentation

Data Processing with Missing Information

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Processing with Missing Information Haixuan YangSupervisors: Prof. Irwin King & Prof. Michael R. Lyu

  2. Outline • Introduction • Link Analysis • Preliminaries • Related Work • Predictive Ranking Model • Block Predictive Ranking Model • Experiment Setup • Conclusions • Information Systems • Preliminaries • Related Work • Generalized Dependency Degree ’ • Extend ’ to Incomplete Information Systems • Experiments • Conclusions • Future Work

  3. Introduction • We handle missing information in two areas • Link analysis on the web. • Information systems with missing values. • In both areas, there is a common simple technique: using the sample to estimate the density. If the total number of the sample is n, and the number for a phenomena in the sample is m, then the probabilty for this phenomena is estimated as m/n. • In both areas, the difficulty is how to use the above simple technique in its right places.

  4. Link Analysis

  5. Preliminaries • PageRank (1998) • It uses the link information to rank web page; • The importance of a page depends on the number of pages that point to it; • The importance of a page also depends on the importance of pages that point to it. • If x is the rank vector, then the rank xi can be expressed as Where dj is the outdegree of the node j fi: probability that the user will randomly jump to the node i : probability that the user follows the actual link structure.

  6. Preliminaries • Problems • Manipulation • The “richer-get-richer” phenomenon • Computation Efficiency • Dangling nodes problem

  7. Preliminaries • Nodes that either have no out-link or for which no out-link is known are called dangling nodes. • Dangling nodes problem • It is hard to sample the entire web. • Page et al (1998) reported that they have 51 million URL not downloaded yet when they have 24 million pages downloaded. • Eiron et al (2004) reported that the number of uncrawled pages (3.75 billion) still far exceeds the number of crawled pages (1.1 billion). • Including dangling nodes in the overall ranking may not only change the rank value of non-dangling nodes but also change the order of them.

  8. An example If we ignore the dangling node 3, then the ranks for nodes 1 and 2 are . If we consider the dangling node 3, then the ranks are by the revised pagerank algorithm (Kamvar 2003).

  9. Related work • Page (1998): Simply removing them. After doing so, they can be added back in. The details are missing. • Amati (2003): Handle dangling nodes robustly based on a modified graph. • Kamvar (2003): Add uniform random jump from dangling nodes to all nodes. • Eiron (2004): Speed up the model in Kamvar (2003), but sacrificing accuracy. Furthermore, suggest algorithm that penalize the nodes that link to dangling nodes of class 2 that we will define later.

  10. Related work - Amati (2003)

  11. Related work - Kamvar (2003)

  12. Related work - Eiron (2004)

  13. Predictive Ranking Model • Classes of Dangling nodes • Nodes that are found but not visited at current time are called dangling nodes of class 1. • Nodes that have been tried but not visited successfully are called dangling nodes of class 2. • Nodes, which have been visited successfully but from which no outlink is found, are called dangling nodes of class 3. • Handle different kind of dangling nodes in different way. Our work focuses on dangling nodes of class 1, which cause missing information.

  14. Illustration of dangling nodes Known information at time 2: red links At time 1 visited node:1. Dangling nodes of class 1: 2, 4, 5,7. 1 1 2 2 2 3 3 4 4 5 5 6 6 At time 2, Dangling nodes of class 3 : 7 Visited nodes : 1,7,2; Dangling nodes of class 1: 3,4,5,6 7 7 7 Missing information at time 2: White links

  15. Predictive Ranking Model • For dangling nodes of class 3, we use the same technique as Kamvar (2003). • For dangling nodes of class 2, we ignore them at current model although it is possible to combine the push-back algorithm (Eiron 2004) with our model. (Penalizing nodes is a subjective matter.) • For dangling nodes of class 1, we try to predict the missing information about the link structrue.

  16. Predictive Ranking Model (Cont.) • Suppose that all the nodes V can be partitioned into three subsets: . • denotes the set of all non-dangling nodes (that have been crawled successfully and have at least one out-link); • denotes the set of all dangling nodes of class 3; • denotes the set of all dangling nodes of class 1; • For each node v in V, the real in-degree of v is not known. • For each node v in , the real out-degree of v is known. • For each node v in , the real out-degree of v is known to be zero. • For each node v in , the real out-degree of v is unknown.

  17. Predictive Ranking Model (Cont.) • We predict the real in-degree of v by the number of found links from C to v. • Assumption: the number of found links from C to v is proportional to the real number of links from V to v. • For example, if C and have 100 nodes, V has 1000nodes, and if the number of links from C to v is 5, then we estimate that the number of links from V to v is 50. • The difference between these two numbers is distributed uniformly to the nodes in .

  18. Predictive Ranking Model Models the missing information from unvisited nodes to nodes in V: from D2 to V. Model the known link information as Page (1998): from C to V. Model the user’s behavior as Kamvar (2003) when facing dangling nodes of class 3: from D1 to V. n : the number of nodes in V; m: the number of nodes in C; m1: the number of nodes in D1.

  19. Predictive Ranking Model Model users’ behavior (called as “teleportation”) as Page (1998) and Kamvar (2003) when the users get bored in following actual links and they may jump to some nodes randomly. is the rank vector.

  20. Block Predictive Ranking Model • Predict the in-degree of v more accurately. • Divide all nodes into p blocks (v[1], v[2], …, v[p]) according to their top level domains (for example, edu), or domains (for example, stanford.edu), or the countries (for example, cn). • Assumption: the number of found links from C[i] (C[i] is the meet of C and V[i]) to v is proportional to the real number of links from V[i] to v. Consequently, the matrix A is changed. • Other parts are same as the Predictive Ranking Model.

  21. Block Predictive Ranking Model Models the missing information from unvisited nodes in 1st block to nodes in V.

  22. Experiment Setup • Get two datasets (by Patrick Lau): one is within the domain cuhk.edu.hk, the other is outside this domain. In the first dataset, we snapshot 11 matrices during the process of crawling; in the second dataset, we snapshot 9 matrices. • Apply both Predictive Ranking Model and the revised RageRank Model (Kamvar 2003) to these snapshots. • Compare the results of both models at time t with the future results of both models. • The future results rank more nodes than the current results. So it is difficult to make a direct comparison.

  23. Illustration for comparison Cut future result Normalize Difference computed by 1-norm current result

  24. Within domain cuhk.edu.hk Data Description Number of iterations

  25. Within domain cuhk.edu.hk Comparison Based on future PageRank result at time 11. PageRank result at time 11 PageRank result at time t Difference Difference PreRank result at time t

  26. Visited nodes at time 11: 502610 Found nodes at time 11: 607170 Within domain cuhk.edu.hk Comparison Based on future PageRank result at time 11.

  27. Within domain cuhk.edu.hk Comparison Based on future PreRank result at time 11. PageRank result at time t Difference Difference PreRank result at time 11 PreRank result at time t

  28. Visted nodes at time 11: 502610 Found nodes at time 11: 607170 Within domain cuhk.edu.hk Comparison Based on future PreRank result at time 11

  29. Outside cuhk.edu.hk Data Description Number of iterations

  30. Visted nodes at time 9: 39824 Found nodes at time 9: 882254 Actual nodes: more than 4 billions Outside cuhk.edu.hk Comparison Based on future PageRank result at time 9

  31. Visted nodes at time 9: 39824 Found nodes at time 9: 882254 Actual nodes: more than 4 billions Outside cuhk.edu.hk Comparison Based on future PreRank result at time 9

  32. Conclusions on PreRank • PreRank performs better than PageRank in accuracy in a close web. • PreRank needs less iterations than PageRank (outside our expectation !).

  33. Information Systems (IS)

  34. Preliminaries • In Rough Set Theory, the dependency degree  is a traditional measure, which expresses the percentage of objects that can be correctly classified into D-class by employing attribute C. However,  does not accurately express the dependency between C and D. • =0 when there is no deterministic rule. • The table in next slide is an Information system.

  35. An example =0when C={a}, D={d}. But in fact D depends on C to some degree.

  36. Original Definition of  where U is set of all objects Each block is a C-class X is one D-class is the lower approximation of X

  37. Preliminaries • Incomplete IS: has missing values.

  38. Related work • Lingras (1998): represents missing values by the set of all possible values, by which a rule extraction method is proposed. • Kryszkiewicz (1999): establishes a similarity relation based on the missing values, upon which a method for computing optimal certain rules is proposed. • Leung (2003): introduces maximal consistent block technique for rule acquisition in incomplete information systems.

  39. Related work • The conditional entropy, is used in C4.5 (Quinlan, 1993).

  40. Definition of  We find the following variation of  . U:universe of objects; C, D: sets of attributes; C(x): C-class containing x; D(x): D-class containing x.

  41. Generalized Dependency Degree ’ The first form of ’ is defined as U:universe of objects; C, D: sets of attributes; C(x): C-class containing x; D(x): D-class containing x.

  42. Variations of ’ and  The second form of ’ Where MinR(C,D) denotes the set of all minimal rules. Minimal rule takes the following form:

  43. Properties of ’ ’ can be extended to equivalence relations R1 and R2. Property 1. Property 2. Property 3. Property 4.

  44. Extend ’ to incomplete IS • Replace the missing value by the probabilistic distribution. where

  45. Extend to incomplete IS (Cont.) • Then use the second form of the generalized dependency degree as our definition of ’ in an incomplete IS. • We need to define the strength and confidence of a rule.

  46. Extend to incomplete IS (Cont.) • The confidence of a rule is defined as • The strength of a rule is defined as • The set , is defined inductively as

  47. Extend to incomplete IS (Cont.) where is the algebraic sum of the fuzzy sets is the algebraic product of the fuzzy sets is the complement of the fuzzy sets

  48. Comparison with conditional entropy The conditional entropy is as follows: ’(C,D) can be proved to be The third form of ’

  49. Comparison by experiments • The conditional entropy H(D|C) is used in C4.5 • We replace H(D|C) with ’(C,D) in C4.5 algorithm such that a new C4.5 algorithm is formed.

  50. Datasets we use • We use all the same eleven datasets with missing values from the UCI Repository as Quinlan uses in his paper. Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research, 4, 77-90, 1996. • Both old C4.5 and new C4.5 use ten-fold cross-validations with each task.

More Related