Steven Seida

Semantic Web Application Architecture Introduction and Issues Steven Seida

Semantic Application Components Information Extraction in Context Knowledge Store Information extraction spider Data mining Data-base Actionable knowledge Filter Segment Classify Associate Normalize Deduplicate Textual documents, web pages, etc. • Discover patterns • Select models • Fit parameters • Inference • Report results Prediction Outlier detection Decision support Unstructured Semi-structured Structured From Information Extraction by Andrew McCallum in November 2005 ADM Queue, pp. 49-57.

Information/Knowledge Extraction • Convert information (and knowledge) into computer understandable knowledge • Unstructured Extraction • Semi-Structured Extraction • Structured Extraction

Information/Knowledge Extraction • Unstructured Extraction • Extract Entities and Relationships • Populate an ontology • -Also referred to as populating the slots of a frame (old A.I. speak) • -Populating can be a challenge • -Semantic technology separates the description of structure from the data

Knowledge Extraction Technologies • Natural Language Processing • Multi-Lingual Handling • Multi-Modal (text, audio, video) • Foundational Technologies • Rule-Based Systems • Machine Learning • On Extraction Accuracy • Be wary of junk in your knowledge store – it is very hard to clean out! • It is better to extract less with higher accuracy! • Especially since many sources tend to yield redundant information. http://www.opencalais.com

Example • New York Times says: • Saddam found in bunker • <Entity> <Relationship> <Entity> • Associated Press says: • Saddam Hussein captured at secret underground room • <Entity> <Relationship> <Entity> • Lots of redundancy, and likely significantly improved accuracy from one (based on strengths of extractors and ability to map to relationships your system is ready to handle)

Structured Knowledge Extraction • Given a database • Typically rows of well defined relationships • Extract useful entities • Often, a row (or joined row) is a useful entity • Given a schema, the knowledge is pretty explicit • If structure is good, leave it there (conversion to RDF or similar probably doesn’t make sense) D2RQ (http://www4.wiwiss.fu-berlin.de/bizer/d2rq/) converts database data into RDF form. It can convert entire databases or provide a SPARQL endpoint that just converts on-the-fly.

Structured Knowledge – Semantic Wiki • Normal Wiki Semantics are a Collection of Knowledge! • Pages reference Pages • Pages might have attributes • Semantic Wiki Enables the Knowledge • Pages have specific types • -Class (rdf:type) • References between pages have relationship property • -Object property • Pages have semantic attributes (metadata) • -Datatype property A Full Featured Semantic Wiki is Semantic Mediawiki – an extension to the wikipedia infrastructure. Extensions for input forms are required to make system usable by normal humans. One example is: http://foodfinds.referata.com/wiki/Home List of examples in use at http://semantic-mediawiki.org/wiki/Sites_using_Semantic_MediaWiki

Semi-Structured Knowledge Extraction • Mix of structured and unstructured • Like a form with check boxes for some fields and open text areas for others • -Common to human intelligence reports like police or border patrol reports • Leverage structured tools and provide the known or simple knowledge to the unstructured extractor • -Many extractors won’t use this additional contextual information (still a research area) • GRDDL (http://www.w3.org/2001/sw/grddl-wg/) is a W3C tool for creating/extracting RDF from normal XML documents. • Semi-structured data can typically be converted into xml • Then use unstructured tools on the unstructured fields

More Data vs More Entities • Be wary of more data – knowledge extraction must result in entities, not just data! • Data grows exponentially (forever) • Entities grow near asymptotic (and therefore manageable) • -For example: face book people, cars, cell phones… Typical Growth of Data Through Time Expected Growth of Entities Through Time

Disambiguation (Deduplication) • Ambiguous: adj, capable of being understood in two or more possible senses or ways • Disambiguate: v, to establish a single semantic interpretation for • Also known as entity resolution and deduplication • -Given sets of extracted knowledge, determine when pieces refer to the same object or entity • Ex: Hussein is the same entity as Saddam Hussein in the previous news example.

Jaccard Measures for Disambiguation • Jaccard similarity measure • Similarity between sample sets: size of intersection divided by size of the union • J(A,B) = |A intersect B| / |A union B| • Jaccard distance = the dissimilarity: 1-J(A,B) • Is John(#1) the same as John(#2)? • We have a set of assertions about John(#1) and John(#2) derived from source sets A and B. • If we presume #1 and #2 are the same, how many features do we get versus how many features do we get if we assume they are different. • If sizes are similar enough, then references are to the same John

Jaccard Measures for Disambiguation • Binary data example (slight formulation difference-see wikipedia) • Compare features of two entities (where each feature is true/false) • Both True – M11 = 1 • First True, Second false – M10 = 3 • First False, Second True – M01= 0 • Both False – M00 = 0 • Jaccard similarity = M11 / (M10+M01+M11) = 1/4 • Jaccard distance = (M01+M10) / (M10+M01+M11) = 3/4 Brief, handy Jaccard tutorial at: http://people.revoledu.com/kardi/tutorial/Similarity/Jaccard.html

Random Forest for Name Disambiguation • Automatically create a large number of decision trees, each created from a different bootstrap sample set of the training data • Select number of trees • Select number of features to consider at each node split • Decision Tree Conclusion • Run data through all trees for matches yes/no – majority count wins • Tested on names from Medline Bibliography • 16 million biomedical research references • Raw Features • Primary Author Name – first, last, middle • Co-author last names, Affiliation of first author • Title, Journal Name, Publication Year, Language • Mesh/Keyword terms • Decision Tree Features (21 total) • Comparisons between each articles of name, etc. • Uniqueness of names, titles, etc.

Disambiguation Results • Random Forests (RF) work as well as any – At impressive accuracy • RF trains orders of magnitude faster than SVM • Bayesian issue - High correlation between features violated independence assumption of random variables Number of Article Pairings in Test Set Number of Features in Test Set Majority class in training data is best Bayesian Simple Regression Decision Tree Support Vector Machines (Radial Basis) Random Forests

Knowledge Store • Need to be able to discover and manipulate knowledge across stores and leave most of the knowledge where it is • Problems: database schemas and knowledge store ontologies are different, inconsistent, and conflicting • Entities are not aligned across the stores • Some Approach to Federating the Information Is Required • Federated data must conform to some schema or ontology applicable to the problem being solved Swoogle can help with ontologies: http://swoogle.umbc.edu/ Swoogle provides search of over a million RDF documents on the web

Federating Knowledge Stores Approaches • Unifying Relational Database Schema • Distributed RDF Query

Unified Relational Database • Unifying Relational Database Schema • Define domain schema • Convert all data into domain schema • Possible to have schema federate across multiple stores • Performance likely to drive all data onto central store • Disambiguate across data sources • Also likely to drive all data onto central store

Distributed RDF • Distributed RDF Query • Define domain ontology • Enable all data as RDF • SPARQL end-points • (D2RQ, RDF files from GRDDL or other, triplify from web pages, RDF Stores) • Enable within the domain schema • Disambiguate across data sources • Can leave data where it is – use owl:sameAs or convert via SWRL • Queries will have to include reasoning effects of the disambiguation Twine at http://www.twine.com – people create shared threads around a topic (of tags). Not a federated store, but uses RDF for storage. To start search on twine for twine.

Semantic Application Components Information Extraction in Context Knowledge Store Information extraction spider Data mining Data-base Actionable knowledge Filter Segment Classify Associate Normalize Deduplicate Textual documents, web pages, etc. • Discover patterns • Select models • Fit parameters • Inference • Report results Prediction Outlier detection Decision support Unstructured Semi-structured Structured From Information Extraction by Andrew McCallum in November 2005 ADM Queue, pp. 49-57.

Data Mining / Discovery Patterns • Vector Space Model • Multi-dimensional representation of documents as vectors • Every dimension of the vector is a term of interest • All vectors represent the same order of terms (at least to be manipulated together) • Value at that vector location represents the weight for that term • Given vectors, you can perform vector operations like ‘angle-between’ as similarity measure. Weighting approach is some of the magic: • Term Frequency–Inverse Document Frequency (TF-IDF) Weighting • Used to evaluate how important a word is to a document in a collection • Includes how likely a term is to occur in a document and a weight based on how unique the term is across all the documents of interest • See wikipedia for a good explanation of the details. • Entropy Weighting • Used to estimate relative importance of all words within a document • Take the negative log of the number of occurrences (normalized by number of unique words in document): -ln(cnt/sum_of_all_words)

Data Mining / Discovery Patterns • F-Measure as a General Measure of Mining Performance • Measure of a test's accuracy, considering both the precision p and the recall r of the test • Precision: typically the number of relevant documents returned scaled by the total number returned • Recall: number of relevant documents retrieved by a search scaled by the total number of existing relevant documents (that should have been returned) • Fβ measures the effectiveness of retrieval with respect to a user who attaches β times as much importance to recall as precision • F1 = 2*Precision*Recall / (Precision + Recall) • See wikipedia for a good explanation of the details.

Data Mining / Discovery Patterns • Seeing some popularity to use • Kullback-Leibler (K-L) divergence as a metric • Also referred to as Information gain • See wikipedia to become quite confused on the details Wolfram-Alpha at http://www.wolframalpha.com supports data mining by model fitting, inference, and reporting given human language questions. For example, you can ask, “What is the capital of Texas” and receive both short and long answers.

Data Mining / Discovery Patterns • Consistency issues • Mining and Discovery include inference • What about when a new fact or inference contradicts a prior? • IBM Entity Analytics removes old assertions that are contradicted by new information • Alternative is to have some sort of confidence metric • But still have need for aging (at least) some relationships Most important is to: Have a way to detect your are contradicting yourself Have some way to address it

Actionable Knowledge/Decision Making • Decision making is quite domain dependent • Deciding someone is a terrorist versus where to construct a highway • Typically involves inference and reasoning • Rule languages (Pellet, SWRL, SPIN, Jena Reasoners, etc) • Bayesian networks • Dempster-Schafer belief handling • Provenance is required • RDF reification or custom approach (albeit a bit messy) • Tends to make databases messy, too • Must be able to remove assertions (i.e. step back) when an incorrect path is detected • Multi-hypothesis reasoning approaches can be a help • Maintain multiple paths of possibility, simultaneously • Obviously an information management headache

Information Superiority Phase Transition “Low Yield to High Yield” Information Yield Potential (# Inferences / time ) Information Inferiority Knowledge Density (#Related Facts /domain) The Payoff – Information Superiority Interesting problem domains are typically large, so information superiority also requires a large set of related facts.

Summary • Leveraging Semantic Technologies for Arriving at Actionable Knowledge Requires • -Converting information into computer usable knowledge • -Resolving ambiguity across many sources of knowledge • -Accumulating sufficient knowledge density to result in high yield intelligence

Backup

RMS In Class Teams Exercise • Relationship Management System (RMS) - You need an information system to allow you to maintain information about university partnerships with your company. This is initially for your own use, but eventually will be a system to share with others that includes summary reporting about recent activity (including contact reports and contracts). • You will have to deal with at least the types of entities: • Organization • Person (in a position at an organization) • Contract • Contact Reports (by people at your organization with a person) • Management has decided that semantic wiki technology should be tried out on this system. (You are allowed to convince them otherwise, but they may override your objections.)

Federation In Class Teams Exercise • Intelligent Skills Inventory System – When a Request For Proposal (RFP) arrives, we need to construct an inventory of skills among our personnel that is specific to those needed for the RFP. The skills of your personnel are scattered across a number of systems, which you are required to deal with. You will not be providing any new system for gathering skills, rather, you are to: • 1. build a system to provide federated query and reporting of skills • 2. provide a system to extract appropriate skills from the RFP and provide a list of potential personnel with matching (or near-matching) skills.

Emerging Alternative Federation Approach • Object Accumulation Overlay • Develop an abstraction layer of persisted objects • Adapters convert sources into objects – generally driven by high level ontology • Objects have methods – set/retrieve attributes or perform operations • Data (i.e. attributes) is left at the original source • Disambiguate within the object space • Adapters and custom operations perform disambiguation

Steven Seida

Steven Seida

Presentation Transcript

Steven Farrah

Steven Lappin

Steven

Steven Slater

Steven Ruza

Steven Wilcox

Steven Seida

Steven Seida

Steven Spielberg

Steven Palstermans

Steven Spielberg

Steven Seida

Steven Spielberg

Steven staggs

Steven

Steven

Steven Kourkoutis

Steven Goldmann

Steven Seida

Steven Palstermans

Steven Seida