Cluster based fact finders

Cluster based fact finders Manish Gupta, Yizhou Sun, Jiawei Han Feb 10, 2011

Why perform cluster based fact finding? • Books: Goldstone Books is a highly trustworthy provider, but it is not the best for history books • Google/Yahoo/Bing are good search engines. But I would prefer Monster for jobs or 101apartments for apartments • CNN or CBS or Google news are best for news. But I prefer Slashdot or Techcrunch for technical news and ESPN or cricinfo for sports news. Aljazeera for Middle East news! Providers excel in their fields of focus.

Our Contributions • Formally define problem of cluster based fact finding • Algorithm that performs trust analysis and clustering of objects iteratively • Comparison of our algorithm using different fact finders on multiple datasets showing better accuracy and interesting clusters • Analysis of clustering based fact finders using synthetic dataset

Related work • Yin et al [TKDE 2008]: Truth finder • Dong et al [PVLDB 2009]: Time varying truth, copycat detection • Pasternack et al [COLING 2010]: Multiple fact finders and effect of priors • Sun et al [EDBT 2009]: Alternate ranking-clustering framework (RankClus) • Gupta et al [WWW 2011]: Trust Analysis with Clustering • Work in Agent-based systems (trust of agents on each other based on past mutual interactions etc)

The iterative fact finder model • Three components of model • Trustworthiness of providers (sources) • Confidence (belief) of facts (claims) • Implications between facts

Basic Fact Finder Algorithm

Intuitive example

Drawbacks of basic fact finders • No object specific trust ranking is generated. Only global trustworthiness ranking of providers is computed. • Confidence ranking of facts for an object is influenced by trustworthiness of providers who are not so “good” for this object or objects related to this object.

Our hypothesis • Objects can be clustered based on provider trustworthiness profiles, to(p), personalized to the particular object. • Restrictive flow of trust information across objects, using clusters, can improve ranking accuracy of facts and providers. • Iterative alternate clustering and trust analysis can provide high quality trust-based clusters and can improve accuracy of trust ranking of providers and confidence ranking of facts.

Clustering before Trust Analysis • Drawbacks • Does not use the information about the providers related to objects in other clusters. • This method needs some input clustering. Clusters are fixed and depend on a particular dimension. In many cases, such a clustering is not available or the desired trustworthiness based clustering may not follow any natural clustering of the objects along just a single dimension.

Clustering in provider trustworthiness space

Basic Cluster Based Fact Finder Drawbacks: There is no trustworthiness information sharing between objects in BCFF2. Every iteration in Algorithm 3 simply re-computes trustworthiness of providers based on implications between various facts about the same object.

Clustering with Trust Analysis

Smoothing • Three kinds of providers • “correct” information about each object • “wrong” information for each of the objects • “correct” for some, “wrong” for some • Our cluster based algorithms would intuitively work better for the third case. • If the vectors are quite close to each other, clustering is not really effective, hence smooth using the global scores • sC is cluster based score and sG is the global score. α is set to average inter-cluster similarity.

Datasets • Books (Yin et al.) 24819 author listings for 1265 books provided by 894 online book stores. Ground truth: manually from scanned book covers. Accuracy and implication values computed as match between best author list and golden list

Datasets • Wikipedia Biography Infobox dataset (Pasternack et al) • Accuracy for date measured as • Accuracy of strings: using Edit distance (if >75% else 0) • Population dataset (Pasternack et al) • 34422 Population claims by 1361 contributors about 30K cities. Golden truth using US Census data. • Accuracy measured as

Analysis of clustering profiles

Accuracy results

Synthetic dataset • 60 objects, 21 providers, 3 clusters • Each object has 4-5 different facts • Providers and objects are assigned to clusters • A provider can provide a fact for an object within the cluster with a probability of 0.8 • For a set of dicy objects (for which most frequent fact is the true fact), prolific providers from other clusters provide false fact with total freq=1+max freq

Improvement in accuracy • Parameters: • max support for true fact of dicyobjects • number of dicyobjects • original strength of the providers Gains are more when number of dicy objects are more and best fact for them is not supported by many providers within their cluster

Comparison of various cluster based fact finders Sums performs better. Sums has no kind of normalization and hence has best chances of improving

Conclusion • We identified the problem of cluster based fact finding • We proposed algorithms for trust analysis using cluster based methods. • We showed using four datasets that our algorithms perform better than traditional fact finders and generate interesting clusters. • In the future, we plan to use the network information within objects and use it to influence clustering of objects

Acknowledgements • XiaoxinYin for basic code base and books and movies datasets • Jeff Pasternack for wikipedia datasets • Dr. Dan Roth for interesting discussions • VinodVydiswaran for reviewing a preliminary version of the work • NSF (IIS-09-05215) and ARL- NSCTA (W911NF-09-2-0053) for funding.

References

Thanks!

Variants of clustering with trust analysis • This version of ACFF would give more importance to trust analysis and tries to organize the clusters around the results of trust analysis. • Drawback: cluster conditional trust computations are used to re-compute object conditional trust vectors and also as centroids for clustering of object conditional trust vectors. This may bias the algorithm heavily towards changes in trust analysis.

Variants of clustering with trust analysis • Use the object conditional trustworthiness vectors computed initially using BCFF2 and avoid re-computing them after the cluster conditional trust analysis iterations. • Iterative trust analysis is done with the sole purpose of improving the cluster centroids. The representation of each of the objects is kept fixed. • Intuition: cluster centroids would organize themselves as far away from each other as possible in the trust space and hence lead to distinct clusters.

Variants of clustering with trust analysis • Perform clustering in a secondary richer space. • The ith element of vector V is computed as the cosine similarity between the object conditional trust vector to and the ith cluster conditional trust vector tci .

Cluster based fact finders

Cluster based fact finders

Presentation Transcript

Cluster-based Visualization

Server Cluster and LVS based Cluster

Finders Keepers

Fact-based question decomposition in DeepQA

TEAM FAULT FINDERS

Finders Keepers

Vision Based Automated Cluster Tester

National Agricultural Statistics Service Fact Finders for Agriculture

Afghanistan Protection Cluster Gender Based Violence Sub-cluster

2D cluster-based Coordination Polymer

According to the 2008 Institutional Fact Finders submitted in preparation for this conference…

Finders Keepers

Comparison between DAQ / TCL cluster finders…

Finders Keepers

Fact-Based Decision-Making

~PharmaDecisions powers fact-based decisions~

Key Finders

North Finders

Cluster Based Value Chain: Mango