truth validation and veracity analysis with information networks l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Truth Validation and Veracity Analysis with Information Networks PowerPoint Presentation
Download Presentation
Truth Validation and Veracity Analysis with Information Networks

Loading in 2 Seconds...

play fullscreen
1 / 34

Truth Validation and Veracity Analysis with Information Networks - PowerPoint PPT Presentation


  • 157 Views
  • Uploaded on

Truth Validation and Veracity Analysis with Information Networks. Jiawei Han Data Mining Group, Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj April 2, 2014. Outline. TruthFinder: Tuth Validation by Information Network Analysis

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Truth Validation and Veracity Analysis with Information Networks' - mireille


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
truth validation and veracity analysis with information networks
Truth Validation and Veracity Analysiswith Information Networks

Jiawei Han

Data Mining Group, Computer Science

University of Illinois at Urbana-Champaign

www.cs.uiuc.edu/~hanj

April 2, 2014

outline
Outline
    • TruthFinder: Tuth Validation by Information Network Analysis
  • Beyond TruthFinder: Multiple Versions of Truth and Evolution of Truth
  • Enhancing Truth Validation by InfoNet Analysis: The RankClus & NetClus Methodology
  • Summary
motivation
Motivation
  • Why truth validation and veracity analysis?
  • Information sharing
    • Sharing trustable, quality information
    • Identifying false information among many conflicting ones
  • Information security
    • Protecting trustable information and its sources
    • Identifying which information providers are suspicious ones: frequently providing false information
    • Tracing back suspicious information providers via information networks
truth validation and veracity analysis by information network analysis
Truth Validation and Veracity Analysis by Information Network Analysis
  • The trustworthiness problem of the web (according to a survey):
    • 54% of Internet users trust news web sites most of time
    • 26% for web sites that sell products
    • 12% for blogs
  • TruthFinder: Truth discovery on the Web by link analysis
    • Among multiple conflict results, can we automatically identify which one is likely the true fact?
  • Veracity (conformity to truth):
    • Given a large amount of conflicting information about many objects, provided by multiple web sites (or other information providers), how to discover the true fact about each object?
  • Xiaoxin Yin, Jiawei Han, Philip S. Yu, “Truth Discovery with Multiple Conflicting Information Providers on the Web”, TKDE’08
conflicting information on the web
Conflicting Information on the Web
  • Different websites often provide conflicting info. on a subject, e.g., Authors of “Rapid Contextual Design”
mapping it to information networks
Mapping It to Information Networks
  • Each object may have a set of conflicting facts
    • E.g., different author names for a book
  • And each web site provides some facts
  • How to find the true fact for each object?

Web sites

Facts

Objects

w1

f1

o1

f2

w2

f3

w3

f4

o2

w4

f5

basic heuristics for problem solving
Basic Heuristics for Problem Solving
  • There is usually only one true fact for a property of an object
  • This true fact appears to be the same or similar on different web sites
    • E.g., “Jennifer Widom” vs. “J. Widom”
  • The false facts on different web sites are less likely to be the same or similar
    • False facts are often introduced by random factors
  • A web site that provides mostly true facts for many objects will likely provide true facts for other objects
mutual consolidation between confidence of facts and trustworthiness of providers
Mutual Consolidation between Confidence of Facts and Trustworthiness of Providers
  • Confidence of facts↔Trustworthiness of web sites
    • A fact has high confidence if it is provided by (many) trustworthy web sites
    • A web site is trustworthy if it provides many facts with high confidence
  • The TruthFinder mechanism, an overview:
    • Initially, each web site is equally trustworthy
    • Based on the above four heuristics, infer fact confidence from web site trustworthiness, and then backwards
    • Repeat until achieving stable state
analogy to authority hub analysis
Analogy to Authority-Hub Analysis
  • Facts ↔ Authorities, Web sites ↔ Hubs
  • Difference from authority-hub analysis
    • Linear summation cannot be used
      • A web site is trustable if it provides accurate facts, instead of many facts
      • Confidence is the probability of being true
    • Different facts of the same object influence each other

Web sites

Facts

High trustworthiness

High confidence

w1

f1

Hubs

Authorities

inference on trustworthness

Web sites

Facts

Objects

w1

f1

o1

w2

f2

w3

f3

o2

w4

f4

Inference on Trustworthness
  • Inference of web site trustworthiness & fact confidence

True facts and trustable web sites will become apparent after some iterations

computation model t w and s f

t(w1)

w1

s(f1)

f1

t(w2)

w2

Computation Model: t(w) and s(f)
  • The trustworthiness of a web site w: t(w)
    • Average confidence of facts it provides
  • The confidence of a fact f: s(f)
    • One minus the probability that all web sites providing f are wrong

Sum of fact confidence

Set of facts provided by w

Probability that w is wrong

Set of websites providing f

experiments finding truth of facts
Experiments: Finding Truth of Facts
  • Determining authors of books
    • Dataset contains 1265 books listed on abebooks.com
    • We analyze 100 random books (using book images)
experiments trustable info providers
Experiments: Trustable Info Providers
  • Finding trustworthy information sources
    • Most trustworthy bookstores found by TruthFinder vs. Top ranked bookstores by Google (query “bookstore”)

TruthFinder

Google

outline14
Outline
    • TruthFinder: Tuth Validation by Information Network Analysis
  • Beyond TruthFinder: Multiple Versions of Truth and Evolution of Truth
  • Enhancing Truth Validation by InfoNet Analysis: The RankClus & NetClus Methodology
  • Summary
beyond truthfinder extensions
Beyond TruthFinder: Extensions
  • Limitations of TruthFinder:
    • Only one version of truth
      • But people may have different, contrasting opinions
    • Not consider the time factor
      • But truth may change with time, e.g., Obama’s status in 2008 and 2009
  • Needed Extensions
    • Multiple versions of truth or opinions
    • Evolution of truth
  • Philosophy
    • Truth is a relative, evolving, and dynamically changing judgment
multiple versions of truth
Multiple Versions of Truth
  • Watch out of copy-cats!
    • Copy-cat: Some information providers or even new agencies simply copy each other
    • Falsity could be amplified by copy-cats
    • How to judge copy-cats: Always copying in certain dimensional space
    • Treat copy-cats as one instead of multiples
  • Statements can be clustered into multiple centers
    • False statements: still diverse, spread, and lack of converge
    • Statements could be clustered based on different dimensional space (context), e.g., Java

Web sites

Facts

Objects

w1

f1

o1

f2

w2

f3

w3

f4

o2

w4

f5

transition evolution of truth
Transition/Evolution of Truth
  • Truth is not static: It changes dynamically with time
    • Associating different versions of truth with different time periods
  • Clustering statements based on time durations
  • Statements
    • Identifying clusters (density-based clustering)
    • Distinguishing time-based clusters from outliers
  • Information providers
    • Leaders, followers, and old-timers
  • Information-network based ranking and clustering
    • Powerful analysis by information network analysis
outline18
Outline
    • TruthFinder: Tuth Validation by Information Network Analysis
  • Beyond TruthFinder: Multiple Versions of Truth and Evolution of Truth
  • Enhancing Truth Validation by InfoNet Analysis: The RankClus & NetClus Methodology
  • Summary
why rankclus
Why RankClus?
  • More meaningful cluster
    • Within each cluster, ranking score for every object is available as well
  • More meaningful ranking
    • Ranking within a cluster is more meaningful than in the whole network
  • Address the problem of clustering in heterogeneous networks
    • No need to compute pair-wise similarity of objects
    • Mapping each object into a low measure space
  • What type of objects to be clustered: Target objects (specified by user)
    • Clustering of target objects can induce a sub-network of the original network
algorithm framework illustration
Algorithm Framework - Illustration

Sub-Network

Ranking

Clustering

algorithm framework summary
Algorithm Framework - Summary
  • Step 0. Initialization
    • Randomly partition target objects into K clusters
  • Step 1. Ranking
    • Ranking for each sub-network induced from each cluster, which serves as feature for each cluster
  • Step 2. Generating new measure space
    • Estimate mixture model coefficients for each target object
  • Step 3. Adjusting cluster
  • Step 4. Repeat Step 1-3 until stable
focus on a bi type network case
Focus on A Bi-type Network Case
  • Conference-author network, links can exist between
    • Conference (X) and author (Y)
    • Author (Y) and author (Y)
  • Use W to denote the links and there weights
    • W =
step 1 feature extraction ranking
Step 1: Feature Extraction — Ranking
  • Simple Ranking
    • Proportional to degree counting for objects
    • E.g., number of publications of authors
    • Considers only immediate neighborhood in the network
  • Authority Ranking
    • Extension to HITS in weighted bi-type network
    • Rules:
      • Rule 1: Highly ranked authors publish many papers in highly ranked conferences
      • Rule 2: Highly ranked conferences attract many papers from many highly ranked authors
      • Rule 3: The rank of an author is enhanced if he or she co-authors with many authors or many highly ranked authors
rules in authority ranking
Rules in Authority Ranking
  • Rule 1: Highly ranked authors publish many papers in highly ranked conferences
  • Rule 2: Highly ranked conferences attract many papers from many highly ranked authors
  • Rule 3: The rank of an author is enhanced if he or she co-authors with many authors or many highly ranked authors
example authority ranking in the 2 area conference author network
Example: Authority Ranking in the 2-Area Conference-Author Network
  • Given the correct cluster, the ranking of authors are quite distinct from each other
example 2 d coefficients in the 2 area conference author network
Example: 2-D Coefficients in the 2-Area Conference-Author Network
  • The conferences are well separated in the new measure space

Scatter plots of two conferences and component coefficients

a running case illustration for 2 area conf author network
A Running Case Illustration for 2-Area Conf-Author Network

Initially, ranking distributions are mixed together

Two clusters of objects mixed together, but preserve similarity somehow

Improved a little

Two clustersare almost well separated

Improved significantly

Well separated

Stable

time complexity analysis
Time Complexity Analysis
  • At each iteration, |E|: edges in network, m: number of target objects, K: number of clusters
    • Ranking for sparse network
      • ~O(|E|)
    • Mixture model estimation
      • ~O(K|E|+mK)
    • Cluster adjustment
      • ~O(mK^2)
  • In all, linear to |E|
    • ~O(K|E|)
case study dataset dblp
Case Study: Dataset: DBLP
  • All the 2676 conferences and 20,000 authors with most publications, from the time period of year 1998 to year 2007.
  • Both conference-author relationships and co-author relationships are used.
  • K=15
beyond rankclus a netclus model
Beyond RankClus: A NetClus Model
  • RankClus combines ranking and clustering successfully to analyze information networks
  • A study on how ranking and clustering can mutually reinforce each other in information network analysis
  • RankClus works well on bi-typed information networks
  • Extension of bi-type network model to star-network model
    • DBLP: Author - paper - conference - title (subject)

Author

Conference

Paper

Subject

netclus database system cluster
NetClus: Database System Cluster

Surajit Chaudhuri 0.00678065

Michael Stonebraker 0.00616469

Michael J. Carey 0.00545769

C. Mohan 0.00528346

David J. DeWitt 0.00491615

Hector Garcia-Molina 0.00453497

H. V. Jagadish 0.00434289

David B. Lomet 0.00397865

Raghu Ramakrishnan 0.0039278

Philip A. Bernstein 0.00376314

Joseph M. Hellerstein 0.00372064

Jeffrey F. Naughton 0.00363698

Yannis E. Ioannidis 0.00359853

Jennifer Widom 0.00351929

Per-?ke Larson 0.00334911

Rakesh Agrawal 0.00328274

Dan Suciu 0.00309047

Michael J. Franklin 0.00304099

Umeshwar Dayal 0.00290143

Abraham Silberschatz 0.00278185

database 0.0995511

databases 0.0708818

system 0.0678563

data 0.0214893

query 0.0133316

systems 0.0110413

queries 0.0090603

management 0.00850744

object 0.00837766

relational 0.0081175

processing 0.00745875

based 0.00736599

distributed 0.0068367

xml 0.00664958

oriented 0.00589557

design 0.00527672

web 0.00509167

information 0.0050518

model 0.00499396

efficient 0.00465707

VLDB 0.318495

SIGMOD Conf. 0.313903

ICDE 0.188746

PODS 0.107943

EDBT 0.0436849

Ranking authors in XML

outline32
Outline
    • TruthFinder: Tuth Validation by Information Network Analysis
  • Beyond TruthFinder: Multiple Versions of Truth and Evolution of Truth
  • Enhancing Truth Validation by InfoNet Analysis: The RankClus & NetClus Methodology
  • Summary
summary
Summary
  • Progress Highlights
    • 3 PhD graduated in 2009
    • Currently over 20 Ph.D.s working on closely related projects
    • Attract more funded projects: 3 NSFs, NASA, DHS, …
    • Industry collaborations: Microsoft Research, IBM Research, Boeing, HP Labs, Yahoo!, Google, …
    • Research papers published in 2008 & 2009: 8 journal papers and 53 conference papers, including KDD, NIPS, SIGMOD, VLDB, ICDM, SDM, ICDE, ECML/PKDD, SenSys, ICDCS, IJCAI, AAAI, Discovery Science, PAKDD, SSDBM, ACM Multimedia, EDBT, CIKM, …
  • Truth validation by information network analysis: A promising direction: TruthFinder, iNextCube, and beyond
  • Knowledge is power, but knowledge is hidden in massive links
  • Integration of data mining with the project: Much more to be explored!
recent publications related to the talk
Recent Publications Related to the Talk
  • X. Yin, J. Han, and P. S. Yu, “Truth Discovery with Multiple Conflicting Information Providers on the Web”, TKDE’08
  • Y. Sun, J. Han, P. Zhao, Z. Yin, H. Cheng, T. Wu, “RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis”, EDBT'09
  • Y. Sun, Y. Yu, and J. Han, “Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema", KDD'09
  • Y. Sun, J. Han, J. Gao, and Y. Yu, “iTopicModel: Information Network-Integrated Topic Modeling", ICDM'09
  • J. Han, “Mining Heterogeneous Information Networks by Exploring the Power of Links", Discovery Science'09 (Invited Keynote Speech)
  • M.-S. Kim and J. Han, “A Particle-and-Density Based Evolutionary Clustering Method for Dynamic Networks", VLDB'09
  • Y. Yu, C. Lin, Y. Sun, C. Chen, J. Han, B. Liao, T.Wu, C. Zhai, D. Zhang, and B. Zhao, “iNextCube: Information Network-Enhanced Text Cube", VLDB'09 (system demo).