1 / 39

A Framework for Hierarchical Cost-sensitive Web Resource Acquisition

A Framework for Hierarchical Cost-sensitive Web Resource Acquisition. Tan Yee Fan WING Group Meeting 2009 November 10. Contents. Background and Motivation Cost-sensitive Record Acquisition Framework Solution for Record Matching Problems Evaluation Conclusion. Contents.

gabi
Download Presentation

A Framework for Hierarchical Cost-sensitive Web Resource Acquisition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Framework for Hierarchical Cost-sensitive Web Resource Acquisition Tan Yee Fan WING Group Meeting 2009 November 10

  2. Contents • Background and Motivation • Cost-sensitive Record Acquisition Framework • Solution for Record Matching Problems • Evaluation • Conclusion

  3. Contents • Background and Motivation • Record linkage • Using web-based resources • Issues with existing work • Cost-sensitive Record Acquisition Framework • Solution for Record Matching Problems • Evaluation • Conclusion

  4. Record Linkage A B • Input • Two lists of items, A and B • Output • For each item pair (a, b) in A× B, are a and b a match? KDD PAKDD DKMD KDID Knowledge Discovery and Data Mining Pacific-Asia Conference on Knowledge Discovery and Data Mining Research Issues on Data Mining and Knowledge Discovery Knowledge Discovery in Inductive Databases ? Input information is sometimes incomplete or insufficientMay be useful to acquire additional information from the web

  5. Information from Search Engine Hit count Title Snippet URL → Web page Possible queries “jcdl” “jcdl” “joint conference on digital libraries” “jcdl” OR “joint conference on digital libraries” jcdl digital libraries …

  6. Using Search Engine Results • Inverse host frequency • Counts from snippets • Number of snippets containing a particular term • Web page similarity • TF-IDF cosine similarity between two pages • Test instance for classification • Frequency counts • Similarity metrics • …

  7. Web-based Record Matching • Matching entities to concepts(Cimiano et al., 2005) • Semantic similarity between words(Bollegala et al., 2007) • Word sense disambiguation(Mihalcea and Moldovan, 1999) • Machine transliteration(Ohand and Isahara, 2008) • Record linkage(Elmacioglu et al., 2007) • Author name disambiguation(Tan et al., 2006) • Web people search(Kalashnikov et al., 2008)

  8. Web-based Record Matching • Most existing work use this scheme • Query search engine with queries of one or more of these types: a, b, and a b • Optionally, append predefined terms or tokens t, e.g.,a  t • Compute similarity metrics • Optionally, construct test instances for classification

  9. Cost of Acquiring Web Resources • Acquiring web resources are slow and may incur other access costs • For a dataset of 1000 records, querying all records individually takes a day or more, querying them pairwise takes even longer • For large datasets, not feasible to query and download everything • Existing work sidesteps or ignores this issue • Some discussion in Tan et al. (2008)

  10. Hierarchical Dependencies • Basically ignored in existing work! • Same resource may be used to extract different types of attribute values • Different instances can share attribute values Search engine results Web pages Term frequencies

  11. Contents • Background and Motivation • Cost-sensitive Record Acquisition Framework • Resource dependency graph • Resource acquisition problem • Observations for record matching problems • Solution for Record Matching Problems • Evaluation • Conclusion

  12. search(ab) search(b) search(a) hitcount(ab) hitcount(b) hitcount(a) a b dice(a, b) instance xa, b search(a1b1) search(b1) search(a1) hitcount(a1) hitcount(a1b1) hitcount(b1) dice(a1, b1) search(a1b2) search(a2b1) hitcount(a2b1) hitcount(a1b2) dice(a1, b2) dice(a2, b1) search(a2) search(a2b2) search(b2) hitcount(a2b2) hitcount(b2) hitcount(a2) dice(a2, b2) Resource Dependency Graph • Directed acyclic graph G = (V, E) • Vertex types T with |T| = 7

  13. Resource Acquisition Graph • Feasible vertex set V’  V: • V’ can be acquired as-is without violating acquisition dependencies • If v V’, then for all u → v E, we require u V’ • Set of feasible subsets of vertices F(G)  2V: • All sets of feasible vertices

  14. search(ab) search(b) search(a) hitcount(ab) hitcount(b) hitcount(a) a b dice(a, b) instance xa, b Cost and Utility • Cost • search(.) cost 10 each • All other cost 1 each • Utility • Positive utility for vertices that are attribute values in test instances • Zero utility otherwise

  15. Resource Acquisition Problem • Given • Resource dependency graph G = (V, E) • Acquisition cost function cost: V → R+ {0} • Utility function utility: F(G) → R • Vertex type function type: V → T • Budget budget • Select V’  F(G) with cost(V’) ≤ budget to maximize obj(V’) = utility(V’) – cost(V’)

  16. Applications • Easy to construct resource dependency graphs • When web pages can be downloaded • When TF or TF-IDF cosine similarity is involved • Blocking is applied • …

  17. search(ab) search(b) search(a) hitcount(ab) hitcount(b) hitcount(a) a b dice(a, b) instance xa, b Observations for Record Matching • Structure of resource acquisition graph • Depth of 2 to 4 is typical • Vertex out-degree varies widelye.g., from 1 to max(|A|, |B|)

  18. search(b) search(a) search(ab) search(ab) search(b) search(a) hitcount(ab) hitcount(b) hitcount(ab) hitcount(b) hitcount(a) hitcount(a) dice(a, b) a b a b instance xa, b dice(a, b) instance xa, b Observations for Record Matching • Usually possible to flatten graph to two levels • Root level: expensive search engine results and web page downloads • Leaf level: cheap attribute values

  19. Contents • Background and Motivation • Cost-sensitive Record Acquisition Framework • Solution for Record Matching Problems • Application of Tabu search • More intelligent legal moves • Propagation of utility values • Utility function when classifier is SVM • Evaluation • Conclusion

  20. Search Algorithm • State space • F(G), set of feasible subsets of vertices • Search process • Current state denoted by V’  F(G) • Starting state V’ = φ • Legal moves add or remove vertices from V’ to reach next state • Objective • Maximize obj(V’) = utility(V’) – cost(V’)

  21. search(b) search(a) search(ab) hitcount(ab) hitcount(b) hitcount(a) dice(a, b) a b instance xa, b Issues • Heuristic search required • Not feasible to enumerate all possible V’ to acquire • Many local maxima • V’ = φ is a local maxima • Simple greedy search does not work • Easy for states to be revisited • No guidance as to which root vertices to acquire • Each search(.) has the same objective value

  22. Solutions • Application of Tabu search • More intelligent legal moves • Propagation of utility values

  23. Tabu search • Simple hill climbing but with Tabu list • Moves in Tabu list are prohibited until expiry of prohibition, this avoids repeated visiting of states • Except when new objective value exceeds current known best objective value • When a move is made, moves that reverse the effect of the move are added to Tabu list • Expiration set by current iteration plus Tabu tenure • Stop when best objective value never improve for some number of iterations

  24. Legal Moves • Let D(V’) be the set of children from any vertex in V’ that are not in V’ • If V’ is empty, then set D(V’) = V • Legal moves • Add(v): Add v from D(V’) and its ancestors into V’ • Remove(v): Remove v and its descendants from V’ • AddType(t): Add a maximal set of vertices (within acquisition budget) of type t from D(V’) into V’

  25. Propagated Utility u v v

  26. Surrogate Utility and Objective • Current state V’ and we consider adding V’’ • Surrogate utility function • Surrogate objective function • surr-obj(V’, V’’) = surr-utility(V’, V’’) – cost(V’’)

  27. Utility Function for SVM • Decompose utility(V’) into sum ofutility(x, A(V’, x)) over test instances x • Compute utility(x, A’) as expected decrease in misclassification cost for acquiring attribute values A’ • Apply earlier work on cost-sensitive attribute value acquisition to compute utility(x, A’)

  28. Contents • Background and Motivation • Cost-sensitive Record Acquisition Framework • Solution for Record Matching Problems • Evaluation • Experimental setup and datasets • Results • Conclusion

  29. search(sf) search(lf) search(sflf) hitcount(sflf) hitcount(lf) hitcount(sf) cosine(sf, lf) sf lf snippetcount(sf → lf) snippetcount(sf ← lf) instance xsf, lf Evaluation Task • Linkage of short forms (SF) to long forms (LF) • Evaluation metric • Total cost of acquisitions and misclassifications cost = 10 cost = 1

  30. Datasets • Human genomes • 307 short forms × 307 long forms • Training: 154 × 154 pairs • 23716 training instances • Testing: 153 × 153 pairs • 23409 test instances • 117657 vertices, 140760 edges • Misclassification costs • 50 false positive, 500 false negative • DBLP venues • 920 short forms ×906 long forms • Training: 460 × 453 pairs • Randomly selected 100000 training instances • Testing: 460 × 453 pairs • 213444 test instances • 1069068 vertices, 1281588 edges • Misclassification costs • 50 false positive, 1500 false negative

  31. Algorithms • Baselines • Random • Least cost • Highest utility • Best cost-utility ratio • Best type • Our algorithm • Manual (with access to solution set)

  32. Evaluation Methodology • Start with graph with no vertices acquired • For each iteration • Run acquisition algorithm with budget 100 • Record data point • Total cost of acquisitions • Total cost of misclassifications • Total cost of acquisitions and misclassifications • Run for 100 iterations

  33. Results • GENOMES

  34. Results • DBLP

  35. Results • Almost 50% improvement in average difference in misclassification cost!

  36. Contents • Background and Motivation • Cost-sensitive Record Acquisition Framework • Solution for Record Matching Problems • Evaluation • Conclusion • Contributions • Future work

  37. Contributions • Framework for performing cost-sensitive acquisitions of resources with hierarchical dependencies • Model using resource dependency graph • Acquisition algorithm for record matching problems, given cost and utility function • Utility function when classifier is SVM • Evaluation demonstrate effectiveness on record matching problems when using SVM

  38. Future Work • Apply resource acquisition framework to clustering problems

  39. Thank You

More Related