1 / 17

Inconsistent Data on the Semantic Web A Theoretical Approach Brian Goodrich

Inconsistent Data on the Semantic Web A Theoretical Approach Brian Goodrich. The Problem. An computer application has a set of input and a set of output based upon the set of input and its internal logic.

jodie
Download Presentation

Inconsistent Data on the Semantic Web A Theoretical Approach Brian Goodrich

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Inconsistent Data on the Semantic WebA Theoretical ApproachBrian Goodrich

  2. The Problem • An computer application has a set of input and a set of output based upon the set of input and its internal logic. • If an application is given data as input which causes a conflicted state in deciding its output, it will crash without some kind of logic by which to decide that conflict. • The Semantic Web is based being able to parse human intent from structured, semi-structured, and unstructured data on the Web. • Human intent is frequently conflicting.

  3. Conflicting Data Sources • Malicious - (deceptive or rerouting attempts) or just ignorantly incorrect information • Incomplete Information – having insufficient context or simply unfinished data • Humor – especially sarcasm, satire and exaggeration (e.g. political cartoons) • Time – what once was one thing is now another (e.g. quality of service, price, etc.) • Ontological Deficiency – when extraction ontology lacks sufficient vividness to separate data appropriately.

  4. Solution

  5. Thesis • To propose a method for simplifying the task of dealing with conflicting data on the Semantic Web in a fast, accurate and dynamic way by supplying each web source with a derived indicator of its communal usage called a Consensual Reliability Score. (CRS)

  6. Methods (a*z) + (b*y) + (c*x) + (d*w) = CRS(f) • Formula for deriving CRS from inputs a, b, c, & d. • With weighted constants z, y, x, & w.

  7. Site Type Mining (a * z)… Five types of Web Pages • Head Pages • Navigation Pages • Content Pages • Look up Pages • Personal Pages

  8. Incoming Index …(b * y)… • Distributed web crawler that counts hyperlinks then traverses the unique hyperlink paths, looking for additional links. • Link counts are stored in a hash indexed by the destination of the hyperlinks. • Provides a dynamic count of how often the internet as a whole is pointing to a given web source. Therefore an indication of how often people use the given web source. • Excludes orphan sites (mostly personal sites and spam pop-ups) • Based on the success of the Google search engine

  9. Usage Mining …(c * x)… • Most straight forward approach of testing how often people use a web source. Query site’s # of hits or how many people have seen this site? • Problem: Unlike Incoming Index method, does not exclude orphan sites. • Further experimentation needed to determine x’s weight.

  10. Direct Survey …(d * w)… • Most reliable method of determining reliability. Manually query users directly. • Too slow and costly to be consider a whole solution but can assist in CRS derivation. Hopefully offset frequently visited sites with no true info (onion.com, humor, etc.) • More experimentation needed to determine w’s weight.

  11. Review (a*z) + (b*y) + (c*x) + (d*w) = CRS(f)

  12. “Classical content data mining is not applicable in this case (CRS derivation) because it is the content of the web sources that is in question.” -Brian Goodrich

  13. Storage • Global Index – • Fast access • Centralized storage for CRSBot. • Centralized vulnerability. • Vital non-distributed resource in a distributed system. • Local Storage • Non-centralized vulnerability • Non-unified derivation formula (disrupts trust algorithm) • Local Derivation • Too slow to be useful (problem size too large)

  14. Related Work • Tim Berners-Lee • There is a choice here, and I am not sure right now which appeals to me most. One is to say precicely, • "whatever any document says of the form xxxx is a member of W3C so long as it is signed with key 32457934759432". • The other is to say, • "whatever is of form xxxx and can be inferred from information signed with key 32457934759432“ • Problems with both choices, but both use static references in a dynamic environment (the web)

  15. Contributions • CRS provides a fast and accurate measure of community consensus on the web. • Allows reliable decision about deciding between conflicting data on the web, fine-tuning the results from the Semantic Web.

  16. Limitations • Totally reliant on usage patterns of the internet, which may not always reflect which data is more correct. • Reflects only consensus to a data source, not the actual data contained in it. • Cannot express complex or compound relationships or extract partial truths.

  17. Questions?

More Related