From Unstructured Information to Linked Data: How to Extract and Utilize Linked Data from Various Sources

FromUnstructured InformationtoLinked Data Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16th 2012

Motivation

Motivation • Wheredoesthe LOD Cloudcomefrom? • Structured data • Triplify, D2R • Semi-structureddata • DBpedia • Unstructureddata • ??? • Unstructureddatamakeup 80% ofthe Web • How do weextractLinked Data fromunstructureddatasources?

Overview • Problem Definition • NamedEntity Recognition • Algorithms • Ensemble Learning • Relation Extraction • General approaches • OpenIEapproaches • EntityDisambiguation • URI Lookup • Disambiguation • Conclusion NB: Will bemainlyconcernedwiththenewestdevelopments.

Overview • Problem Definition • NamedEntity Recognition • Algorithms • Ensemble Learning • Relation Extraction • General approaches • OpenIEapproaches • EntityDisambiguation • URI Lookup • Disambiguation • Conclusion

Problem Definition • Simple(?) problem: given a textfragment, retrieve • All entitiesand • relationsbetweentheseentitiesautomatically plus • „groundthem“ in an ontology • Also coinedKnowledgeExtraction • John Petrucci was born in New York. :John_Petruccidbo:birthPlace :New_York . :New_York :John_Petrucci • dbo:birthPlace

Problems 1. Findingentities  NamedEntity Recognition 2. Findingrelationinstances  Relation Extraction 3. Finding URIs  URI Disambiguation

NamedEntity Recognition • Problem definition: Given a setofclasses, find all stringsthatarelabelsofinstancesoftheseclasseswithin a textfragment • John Petrucci was born in New York. • [John Petrucci, PER] was born in [New York, LOC].

NamedEntity Recognition • Problem definition: Given a setofclasses, find all stringsthatarelabelsofinstancesoftheseclasseswithin a textfragment • Common setsofclasses • CoNLL03: Person, Location, Organization, Miscelleaneous • ACE05: Facility, Geo-Political Entity, Location, Organisation, Person, Vehicle, Weapon • BioNLP2004: Protein, DNA, RNA, cellline, cell type • Severalapproaches • Directsolutions (singlealgorithms) • Ensemble Learning

NER: Overviewofapproaches • Dictionary-based • Hand-crafted Rules • Machine Learning • Hidden Markov Model (HMMs) • Conditional Random Fields (CRFs) • Neural Networks • k NearestNeighbors (kNN) • Graph Clustering • Ensemble Learning • Veto-Based (Bagging, Boosting) • Neural Networks

NER: Dictionary-based • Simple Idea • Definemappingsbetweenwordsandclasses, e. g., Paris  Location • Try tomatcheachtokenfromeachsentence • Return themappingentities • Time-Efficientatruntime • Manuel creationofgazeteers • Low Precision (Paris = Person, Location) • Low Recall (esp. on PersonsandOrganizationsasthenumberofinstancesgrows)

NER: Rule-based • Simple Idea • Define a setofruleto find entities, e.g., [PERSON] was born in [LOCATION]. • Try tomatcheachsentencetooneorseveralrules • Return themappingentities • High precision • Manuel creationofrulesisverytedious • Low recall (finite numberofpatterns)

NER: Markov Models • Stochastic process such that (Markov Property) ) = ) • Equivalentto finite-statemachine • Formallyconsistsof • Set S ofstates S1, … , Sn • Matrix M such thatmij = P(Xt+1=Sj|Xt=Si)

NER: Hidden Markov Models • Extension ofMarkov Models • States arehiddenandassigned an outputfunction • Onlyoutputisseen • Transitionsarelearnedfromtrainingdata • How do theywork? • Input: Discretesequenceoffeatures(e.g., POS Tags, wordstems, etc.) • Goal: Find thebestsequenceofstatesthatrepresenttheinput • Output: hopefullyrightclassificationofeachtoken S0 PER S1 _ … Sn LOC

NER: k NearestNeighbors • Idea • Describeeachtokenqfrom a labelledtrainingdatasetwith a setoffeatures (e.g., leftandrightneigbors) • Eachnewtokentisdescribedwiththe same features • Assignttheclassofits k nearestneighbors

NER: So far … • „Simple approaches“ • Applyonealgorithmtothe NER problem • Boundtobe limited byassumptionsofmodel • Implementedby a large numberoftools • Alchemy • Stanford NER • Illinois Tagger • Ontos NER Tagger • LingPipe • …

NER: Ensemble Learning • Intuition: Eachalgorithmhasitsstrengthsandweaknesses • Idea: Useensemblelearningtomergeresultsof different algorithms so astocreate a meta-classifierofhigheraccuracy Pattern-basedapproaches Dictionary-basedapproaches Support Vector Machines Condition Random Fields

NER: Ensemble Learning • Idea: Mergetheresultsofseveralapproachesforimprovingresults • Simplestapproaches: • Voting • Weightedvoting Output Merger System 1 System 2 System n Input

NER: Ensemble Learning • Whendoesitwork? • Accuracy • Need forexisitingsolutionstobe „good“ • Mergingrandomresultsleadtorandomresults • Given, currentapproachesreach 80% F-Score • Diversity • Need forsmallestpossibleamountofcorrelationbetweenapproaches • E.g., mergingtwo HMM-basedtaggerswon‘thelp • Given, large numberofapproachesfor NER

NER:FOX • FederatedKnowledgeExtraction Framework • Idea: Applyensemblelearningto NER • Classicalapproach: Voting • Does not makeuseofsystematicerror • Partlydifficulttotrain • Useneuralnetworksinstead • Can makeuseofsystematicerrory • Easy totrain • Converge fast • http://fox.aksw.org

NER: FOX

NER: FOX on MUC7

NER: FOX on Website Data

NER: FOX on Companies and Countries • Noruntimeissues(parallel implementation) • NN overheadissmall • Overfitting

NER: Summary • Large numberofapproaches • Dictionaries • Hand-Craftedrules • Machine Learning • Hybrid • … • Combiningapproachesleadstobetterresultsthansinglealgorithms

RE: Problem Definition • Find the relations between NEs if such relations exist. • NEs not always given a-priori (open vs. closed RE) • John Petrucci was born in New York. • [John Petrucci, PER] was born in [New York, LOC]. • bornIn ([John Petrucci, PER], [New York, LOC]).

RE: Approaches • Hand-craftedrules • Pattern Learning • Coupled Learning

RE: Pattern-based • Hearst patterns [Hearst: COLING‘92] • POS-enhanced regular expression matching in natural-language text NP0 {,} such as {NP1, NP2, … (and|or) }{,} NPn NP0 {,}{NP1, NP2, … NPn-1}{,} or otherNPn “The bow lute, such as the Bambara ndang, is plucked and has an individual curved neck for each string.”  isA(“Bambara ndang”, “bow lute”) • Time-Efficientatruntime • Verylowrecall • Not adaptabletootherrelations

RE: DIPRE • DIPRE = Dual Iterative Pattern Relation Extraction • Semi-supervised, iterative gathering of facts and patterns • Positive & negative examples as seeds for a given target relation • e.g. +(Hillary, Bill); +(Carla, Nicolas); –(Larry, Google) • Various tuning parameters for pruning low-confidence patterns and facts • Extended to SnowBall / QXtract (Hillary, Bill) X and her husband Y X and Y on theirhoneymoon (Carla, Nicolas) (Angelina, Brad) (Victoria, David) (Hillary, Bill) (Carla, Nicolas) X and Y andtheirchildren X hasbeendatingwith Y X loves Y (Larry, Google) …

RE: NELL • Never-Ending Language Learner (http://rtw.ml.cmu.edu/) • Open IE withontological backbone • Closed set of categories & typed relations • Seeds/counter seeds (5-10) • Open set of predicate arguments(instances) • Coupled iterative learners • Constantly running over a large Web corpus since January 2010 (200 Mio pages) • Periodic human supervision athletePlaysForTeam (Athlete, SportsTeam) athletePlaysForTeam (Alex Rodriguez, Yankees) athletePlaysForTeam (Alexander_Ovechkin, Penguins)

RE: NELL Conservativestrategy  AvoidSemantic Drift

RE: BOA • Bootstrapping Linked Data (http://boa.aksw.org) • Core idea: Useinstancedata in Data Web todiscover NL patternsandnewinstances

RE: BOA • Followsconservativestrategy • Only top pattern • Frequencythreshold • Score Threshold • Evaluation results

RE: Summary • Severalapproaches • Hand-craftedrules • Machine Learning • Hybrid • Large numberofinstancesavailableformanyrelations • Runtimeproblem Parallel implementations • Manynewfactscanbefound • Semantic Drift • Long tail • EntityDisambiguation

ED: Problem Definition • Given(a) refenceknowledgebase(s), a textfragment, a listof NEs (incl. position), and a list a relations, find URIs foreachofthe NEs andrelations • Verydifficultproblem • Ambiguity, e.g., Paris = Paris Hilton? Paris (France)? • Difficultevenforhumans, e.g., • Paris‘ mayordiedyesterday • Severalsolutions • Indexing • Surface Form • Graph-based

ED: Problem Definition • John Petrucci was born in New York. :John_Petruccidbo:birthPlace :New_York . • [John Petrucci, PER] was born in [New York, LOC]. • bornIn ([John Petrucci, PER], [New York, LOC]).

ED: Indexing • More retrievalthandisambiguation • Similartodictionary-basedapproaches • Idea • Index all labels in referenceknowledgebase • Given an inputlabel, retrieveall entitieswith a similarlabel • Poor recall (unknownsurface form, e.g., „Mme Curie“ für „Marie Curie“) • Low precision (Paris = Paris Hilton, Paris (France), …)

ED: Type Disambiguation • Extension ofindexing • Index all labels • Infer type information • Retrievelabelsfromentitiesofthegiven type • Same recallaspreviousapproach • Higher precision • Paris[LOC] != Paris[PER] • Still, Paris (France) vs. Paris (Ontario) • Need forcontext

ED: Spotlight • Knownsurfaceforms (http://dbpedia.org/spotlight) • Based on DBpedia + Wikipedia • Usessupplementaryknowledgeincludingdisambiguationpages, redirects, wikilinks • Threemainsteps • Spotting: FindingpossiblementionsofDBpediaresources, e.g.,John Petrucci was born in New York. • CandidateSelection: Find possible URIs, e.g.,John Petrucci :JohnPetrucciNew York  :New_York, :New_York_County, … • Disambiguation: MapcontexttovectorforeachresourceNew York  :New_York

ED: YAGO2 • Joint Disambiguation ♬ Mississippi, one of Bob’s later songs, was first recorded by Sheryl on her album.

ED: YAGO2 sim(cxt(ml ),cxt(ei)) coh(ei,ej) Entity Candidates Mentions of Entities Mississippi (Song) Mississippi (State) Bob Dylan Songs Sheryl Cruz Sheryl Lee Sheryl Crow prior(ml ,ei ) Objective: Maximize objective function (e.g., total weight) Constraint: Keep at least one entity per mention

ED: FOX • Generic Approach • A-priori score (a): Popularityof URIs • Similarity score (s): Similarityofresourcelabelsandtext • Coherence score (z): Correlationbetween URIs a|s z a|s

ED:FOX • Allowstheuseofseveralalgorithms • HITS • Pagerank • Apriori • Propagation Algorithms • …

ED: Summary • Difficultproblemevenforhumans • Severalapproaches • Simple search • Search withrestrictions • Knownsurfaceforms • Graph-based • Improved F-Score forDBpedia (70-80%) • Low F-Score forgenericknowledgebases • Intrinsicallydifficult • Still a lotto do

From Unstructured Information to Linked Data: How to Extract and Utilize Linked Data from Various Sources

From Unstructured Information to Linked Data: How to Extract and Utilize Linked Data from Various Sources

Presentation Transcript

T R A F F I C I N D U C T I O N

V O L U T I O

L u n c h e o n C o m p e t i t i o n F o o d s f r o m A r o u n d t h e W o r l d I t a l y

D I F F I C U L T

G i r l S c o u t s’ F o u n d e r

F r u i t - F i l l i n g

H I L T Z U R O F F

B U I L D Y O U R F U T U R E W I T H U S

F U L L Y I N T E G R A T E D S O L U T I O N S F O R

S U C C E E D T O F A I L

B r o a d S p e c t r u m U t i l i s a t i o n o f W o o d

CLODA: A C rowdsourced L inked O pen D ata A rchitecture

I T A L I A N F O O D