Inconsistent Data on the Semantic Web A Theoretical Approach Brian Goodrich

Inconsistent Data on the Semantic WebA Theoretical ApproachBrian Goodrich

The Problem • An computer application has a set of input and a set of output based upon the set of input and its internal logic. • If an application is given data as input which causes a conflicted state in deciding its output, it will crash without some kind of logic by which to decide that conflict. • The Semantic Web is based being able to parse human intent from structured, semi-structured, and unstructured data on the Web. • Human intent is frequently conflicting.

Conflicting Data Sources • Malicious - (deceptive or rerouting attempts) or just ignorantly incorrect information • Incomplete Information – having insufficient context or simply unfinished data • Humor – especially sarcasm, satire and exaggeration (e.g. political cartoons) • Time – what once was one thing is now another (e.g. quality of service, price, etc.) • Ontological Deficiency – when extraction ontology lacks sufficient vividness to separate data appropriately.

Solution

Thesis • To propose a method for simplifying the task of dealing with conflicting data on the Semantic Web in a fast, accurate and dynamic way by supplying each web source with a derived indicator of its communal usage called a Consensual Reliability Score. (CRS)

Methods (a*z) + (b*y) + (c*x) + (d*w) = CRS(f) • Formula for deriving CRS from inputs a, b, c, & d. • With weighted constants z, y, x, & w.

Site Type Mining (a * z)… Five types of Web Pages • Head Pages • Navigation Pages • Content Pages • Look up Pages • Personal Pages

Incoming Index …(b * y)… • Distributed web crawler that counts hyperlinks then traverses the unique hyperlink paths, looking for additional links. • Link counts are stored in a hash indexed by the destination of the hyperlinks. • Provides a dynamic count of how often the internet as a whole is pointing to a given web source. Therefore an indication of how often people use the given web source. • Excludes orphan sites (mostly personal sites and spam pop-ups) • Based on the success of the Google search engine

Usage Mining …(c * x)… • Most straight forward approach of testing how often people use a web source. Query site’s # of hits or how many people have seen this site? • Problem: Unlike Incoming Index method, does not exclude orphan sites. • Further experimentation needed to determine x’s weight.

Direct Survey …(d * w)… • Most reliable method of determining reliability. Manually query users directly. • Too slow and costly to be consider a whole solution but can assist in CRS derivation. Hopefully offset frequently visited sites with no true info (onion.com, humor, etc.) • More experimentation needed to determine w’s weight.

Review (a*z) + (b*y) + (c*x) + (d*w) = CRS(f)

“Classical content data mining is not applicable in this case (CRS derivation) because it is the content of the web sources that is in question.” -Brian Goodrich

Storage • Global Index – • Fast access • Centralized storage for CRSBot. • Centralized vulnerability. • Vital non-distributed resource in a distributed system. • Local Storage • Non-centralized vulnerability • Non-unified derivation formula (disrupts trust algorithm) • Local Derivation • Too slow to be useful (problem size too large)

Related Work • Tim Berners-Lee • There is a choice here, and I am not sure right now which appeals to me most. One is to say precicely, • "whatever any document says of the form xxxx is a member of W3C so long as it is signed with key 32457934759432". • The other is to say, • "whatever is of form xxxx and can be inferred from information signed with key 32457934759432“ • Problems with both choices, but both use static references in a dynamic environment (the web)

Contributions • CRS provides a fast and accurate measure of community consensus on the web. • Allows reliable decision about deciding between conflicting data on the web, fine-tuning the results from the Semantic Web.

Limitations • Totally reliant on usage patterns of the internet, which may not always reflect which data is more correct. • Reflects only consensus to a data source, not the actual data contained in it. • Cannot express complex or compound relationships or extract partial truths.

Questions?

Inconsistent Data on the Semantic Web A Theoretical Approach Brian Goodrich

Inconsistent Data on the Semantic Web A Theoretical Approach Brian Goodrich

Presentation Transcript

A Hybrid Approach for Searching in the Semantic Web

The Semantic Web: A Web of Machine Processible Data

Trust on the Semantic Web

Data Integration on the Semantic Sensor Web

Data on the (Semantic) Web

Provenance Challenge: A Semantic Web Approach

A Flexible Approach for Ranking Complex Relationships on the Semantic Web

The Ontological Semantic Perspective on the Semantic Web

LCA data on the Semantic Web

Semantic Data lives everywhere on the Web

A Semantic Web Approach for the Third Provenance Challenge

Data Wiki: A Semantic Web Approach to Government Data

Agents on the Semantic Web

Languages on the Semantic Web

Semantic Web and the Grid Brian Matthews

Data Quality on the Semantic Web

Theoretical approach

Instance Data Evaluation on the Semantic Web

ZemPod : A semantic web approach to podcasting

Multimedia on the Semantic Web

Semantic Similarity Computation on the Web of Data

Data Structure Theoretical Approach