truth discovery with multiple conflicting information providers on the web
Download
Skip this Video
Download Presentation
Truth Discovery with Multiple Conflicting Information Providers on the Web

Loading in 2 Seconds...

play fullscreen
1 / 24

truth discovery with multiple conflicting information - PowerPoint PPT Presentation


  • 442 Views
  • Uploaded on

Truth Discovery with Multiple Conflicting Information Providers on the Web. KDD 07. Motivation. Example: Authors of books We tried to find out who wrote the book “Rapid Contextual Design”. Many different sets of authors from different online bookstores. Accurate information. Incomplete

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'truth discovery with multiple conflicting information' - Michelle


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
truth discovery with multiple conflicting information providers on the web

Truth Discovery with Multiple Conflicting InformationProviders on the Web

KDD 07

motivation
Motivation
  • Example: Authors of books
    • We tried to find out who wrote the book “Rapid Contextual Design”.
      • Many different sets of authors from different online bookstores

Accurate

information

Incomplete

information

motivation3
Motivation
  • Is the world-wide web always trustable?
  • Unfortunately, the answer is “NO”.
  • There is no guarantee for the correctness of information on the web.
motivation4
Motivation
  • Different web sites often provide conflicting information on a subject.
  • 54% of Internet users trust news web sites
  • 26% for web sites that sell products
  • 12% for blogs
veracity
Veracity
  • Veracity i.e., conformity to truth
  • How to find true facts from a large amount of conflicting information on many subjects that is provided by various web sites.
  • This paper invent an algorithm called TRUTHFINDER,
    • A web site is trustworthy if it provides many pieces of true information, and a piece of information is likely to be true if it is provided by many trustworthy web sites.
problem definitions

Authors

Emotions

Authors

Emotions

Bookstore

Readers

Books

News

Problem Definitions

Facts: properties of the objects

problem definitions7
Problem Definitions
  • Definition 1: (Confidence of facts.)
    • The confidence of a fact f (denoted by s(f)) is the probability of f being correct, according to the best of our knowledge.
  • Definition 2: (Trustworthiness of web sites.)
    • The trustworthiness of a web site w (denoted by t(w)) is the expected confidence of the facts provided by w.
problem definitions8
Problem Definitions
  • Implication between facts
    • Imp( f1 f2 ) :
      • how much f2’s confidence should be increased or decreased according to f1’s confidence.
    • Imp( f1 f2 ) is a value between -1 and 1.
      • A positive value indicates if f1 is correct, f2 is likely to be correct.
      • A negative value means if f1 is correct, f2 is likely to be wrong.
    • Imp( f1 f2 ) = sim(f1, f2) - base_sim,
      • where sim(f1, f2) is the similarity between f1 and f2, and base_sim is a threshold for similarity.

T

F

computational model
Computational Model
  • Heuristic 1: Usually there is only one true fact for a property of an object.
  • Heuristic 2: This true fact appears to be the same or similar on different web sites.
  • Heuristic 3: The false facts on different web sites are less likely to be the same or similar.
  • Heuristic 4: In a certain domain, a web site that provides mostly true facts for many objectswill likely provide true facts for other objects.
computational model10
Computational Model
  • If a fact is provided by many trustworthy web sites, it is likely to be true
  • If a fact is conflicting with the facts provided by many trustworthy web sites, it is unlikely to be true.
  • A web site is trustworthy if it provides facts with high confidence
  • Web site trustworthiness and fact confidence can be determined by each other
  • True facts are more consistent than false facts (Heuristic 3)
computational model basic inference
Computational Model- Basic Inference
  • Basic Inference

t(w1) = 0.9 and t(w2) = 0.99

t(w2) = 1.1 × t(w1)

(w2) = 2× (w1)

computational model basic inference12
Computational Model- Basic Inference
  • f1 is provided by w1 and w2, if f1 is wrong then both w1 and w2 are wrong
  • the probability that both of them are wrong is
    • (1 − t(w1)) · (1 − t(w2))
  • the probability that f1 is not wrong is
    • 1 − (1 − t(w1)) · (1 − t(w2))
  • s(f) can be computed as :
computational model inferences between facts
Computational Model- Inferences between Facts
  • Adjusted confidence score

New score of the fact

computational model handling additional subtlety
Computational Model- Handling Additional Subtlety
  • Different web sites are not independent with each other
    • dampening factor 
  • the confidence of a fact f can easily be negative
experiments
Experiments
  • Baseline
    • Voting :
      • This method chooses the fact that is provided by most web sites.
  •  = 0.5
  •  = 0.3
  • Book Authors Dataset:
    • 1,265 computer science books
    • Using ISBN search on www.abebooks.com
      • 894 bookstores and 34,031 listings
      • 5.4 different authors / book
experiments18
Experiments
  • randomly select 100 books
  • manually find out their authors
  • accuracy :
  • Partially correct facts:
    • last name , first name and middle name : 3:2:1
    • for example : “Graeme Simsion” is 5/6 (omit middle name)
  • If f1 has x authors and f2 has y authors, and there are z shared ones, then imp(f1 f2) = z/x − base-sim
    • base-sim = 0.5
experiments book authors dataset20
Experiments- Book Authors Dataset
  • One book may make multiple errors
  • Miss authors:
    • only provide subset of all authors
experiments query with google
Experiments- Query with Google
  • Google is good at finding authoritative web sites.
  • But do these web sites provide accurate information ?
  • Compare the online bookstores
    • highest ranks by Google vs.
    • highest trustworthiness found by TruthFinder
  • Querying Google with “bookstore”
    • Find all bookstores that exist in their dataset from the top 300 Google results.
conclusion
Conclusion
  • Veracity problem
  • TRUTHFINDERAlgorithm
ad