520 likes | 617 Views
This presentation delves into the process of extracting linked data from unstructured sources, focusing on problem definitions, named entity recognition, algorithms, ensemble learning, relation extraction, and entity disambiguation. Various approaches, including dictionary-based, rule-based, and ensemble learning, are explored to uncover entities, relations, and URIs from text fragments. The goal is to automate the extraction of valuable information and map it onto existing ontologies.
E N D
FromUnstructured InformationtoLinked Data Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16th 2012
Motivation • Wheredoesthe LOD Cloudcomefrom? • Structured data • Triplify, D2R • Semi-structureddata • DBpedia • Unstructureddata • ??? • Unstructureddatamakeup 80% ofthe Web • How do weextractLinked Data fromunstructureddatasources?
Overview • Problem Definition • NamedEntity Recognition • Algorithms • Ensemble Learning • Relation Extraction • General approaches • OpenIEapproaches • EntityDisambiguation • URI Lookup • Disambiguation • Conclusion NB: Will bemainlyconcernedwiththenewestdevelopments.
Overview • Problem Definition • NamedEntity Recognition • Algorithms • Ensemble Learning • Relation Extraction • General approaches • OpenIEapproaches • EntityDisambiguation • URI Lookup • Disambiguation • Conclusion
Problem Definition • Simple(?) problem: given a textfragment, retrieve • All entitiesand • relationsbetweentheseentitiesautomatically plus • „groundthem“ in an ontology • Also coinedKnowledgeExtraction • John Petrucci was born in New York. :John_Petruccidbo:birthPlace :New_York . :New_York :John_Petrucci • dbo:birthPlace
Problems 1. Findingentities NamedEntity Recognition 2. Findingrelationinstances Relation Extraction 3. Finding URIs URI Disambiguation
Overview • Problem Definition • NamedEntity Recognition • Algorithms • Ensemble Learning • Relation Extraction • General approaches • OpenIEapproaches • EntityDisambiguation • URI Lookup • Disambiguation • Conclusion
NamedEntity Recognition • Problem definition: Given a setofclasses, find all stringsthatarelabelsofinstancesoftheseclasseswithin a textfragment • John Petrucci was born in New York. • [John Petrucci, PER] was born in [New York, LOC].
NamedEntity Recognition • Problem definition: Given a setofclasses, find all stringsthatarelabelsofinstancesoftheseclasseswithin a textfragment • Common setsofclasses • CoNLL03: Person, Location, Organization, Miscelleaneous • ACE05: Facility, Geo-Political Entity, Location, Organisation, Person, Vehicle, Weapon • BioNLP2004: Protein, DNA, RNA, cellline, cell type • Severalapproaches • Directsolutions (singlealgorithms) • Ensemble Learning
NER: Overviewofapproaches • Dictionary-based • Hand-crafted Rules • Machine Learning • Hidden Markov Model (HMMs) • Conditional Random Fields (CRFs) • Neural Networks • k NearestNeighbors (kNN) • Graph Clustering • Ensemble Learning • Veto-Based (Bagging, Boosting) • Neural Networks
NER: Dictionary-based • Simple Idea • Definemappingsbetweenwordsandclasses, e. g., Paris Location • Try tomatcheachtokenfromeachsentence • Return themappingentities • Time-Efficientatruntime • Manuel creationofgazeteers • Low Precision (Paris = Person, Location) • Low Recall (esp. on PersonsandOrganizationsasthenumberofinstancesgrows)
NER: Rule-based • Simple Idea • Define a setofruleto find entities, e.g., [PERSON] was born in [LOCATION]. • Try tomatcheachsentencetooneorseveralrules • Return themappingentities • High precision • Manuel creationofrulesisverytedious • Low recall (finite numberofpatterns)
NER: Markov Models • Stochastic process such that (Markov Property) ) = ) • Equivalentto finite-statemachine • Formallyconsistsof • Set S ofstates S1, … , Sn • Matrix M such thatmij = P(Xt+1=Sj|Xt=Si)
NER: Hidden Markov Models • Extension ofMarkov Models • States arehiddenandassigned an outputfunction • Onlyoutputisseen • Transitionsarelearnedfromtrainingdata • How do theywork? • Input: Discretesequenceoffeatures(e.g., POS Tags, wordstems, etc.) • Goal: Find thebestsequenceofstatesthatrepresenttheinput • Output: hopefullyrightclassificationofeachtoken S0 PER S1 _ … Sn LOC
NER: k NearestNeighbors • Idea • Describeeachtokenqfrom a labelledtrainingdatasetwith a setoffeatures (e.g., leftandrightneigbors) • Eachnewtokentisdescribedwiththe same features • Assignttheclassofits k nearestneighbors
NER: So far … • „Simple approaches“ • Applyonealgorithmtothe NER problem • Boundtobe limited byassumptionsofmodel • Implementedby a large numberoftools • Alchemy • Stanford NER • Illinois Tagger • Ontos NER Tagger • LingPipe • …
NER: Ensemble Learning • Intuition: Eachalgorithmhasitsstrengthsandweaknesses • Idea: Useensemblelearningtomergeresultsof different algorithms so astocreate a meta-classifierofhigheraccuracy Pattern-basedapproaches Dictionary-basedapproaches Support Vector Machines Condition Random Fields
NER: Ensemble Learning • Idea: Mergetheresultsofseveralapproachesforimprovingresults • Simplestapproaches: • Voting • Weightedvoting Output Merger System 1 System 2 System n Input
NER: Ensemble Learning • Whendoesitwork? • Accuracy • Need forexisitingsolutionstobe „good“ • Mergingrandomresultsleadtorandomresults • Given, currentapproachesreach 80% F-Score • Diversity • Need forsmallestpossibleamountofcorrelationbetweenapproaches • E.g., mergingtwo HMM-basedtaggerswon‘thelp • Given, large numberofapproachesfor NER
NER:FOX • FederatedKnowledgeExtraction Framework • Idea: Applyensemblelearningto NER • Classicalapproach: Voting • Does not makeuseofsystematicerror • Partlydifficulttotrain • Useneuralnetworksinstead • Can makeuseofsystematicerrory • Easy totrain • Converge fast • http://fox.aksw.org
NER: FOX on Companies and Countries • Noruntimeissues(parallel implementation) • NN overheadissmall • Overfitting
NER: Summary • Large numberofapproaches • Dictionaries • Hand-Craftedrules • Machine Learning • Hybrid • … • Combiningapproachesleadstobetterresultsthansinglealgorithms
Overview • Problem Definition • NamedEntity Recognition • Algorithms • Ensemble Learning • Relation Extraction • General approaches • OpenIEapproaches • EntityDisambiguation • URI Lookup • Disambiguation • Conclusion
RE: Problem Definition • Find the relations between NEs if such relations exist. • NEs not always given a-priori (open vs. closed RE) • John Petrucci was born in New York. • [John Petrucci, PER] was born in [New York, LOC]. • bornIn ([John Petrucci, PER], [New York, LOC]).
RE: Approaches • Hand-craftedrules • Pattern Learning • Coupled Learning
RE: Pattern-based • Hearst patterns [Hearst: COLING‘92] • POS-enhanced regular expression matching in natural-language text NP0 {,} such as {NP1, NP2, … (and|or) }{,} NPn NP0 {,}{NP1, NP2, … NPn-1}{,} or otherNPn “The bow lute, such as the Bambara ndang, is plucked and has an individual curved neck for each string.” isA(“Bambara ndang”, “bow lute”) • Time-Efficientatruntime • Verylowrecall • Not adaptabletootherrelations
RE: DIPRE • DIPRE = Dual Iterative Pattern Relation Extraction • Semi-supervised, iterative gathering of facts and patterns • Positive & negative examples as seeds for a given target relation • e.g. +(Hillary, Bill); +(Carla, Nicolas); –(Larry, Google) • Various tuning parameters for pruning low-confidence patterns and facts • Extended to SnowBall / QXtract (Hillary, Bill) X and her husband Y X and Y on theirhoneymoon (Carla, Nicolas) (Angelina, Brad) (Victoria, David) (Hillary, Bill) (Carla, Nicolas) X and Y andtheirchildren X hasbeendatingwith Y X loves Y (Larry, Google) …
RE: NELL • Never-Ending Language Learner (http://rtw.ml.cmu.edu/) • Open IE withontological backbone • Closed set of categories & typed relations • Seeds/counter seeds (5-10) • Open set of predicate arguments(instances) • Coupled iterative learners • Constantly running over a large Web corpus since January 2010 (200 Mio pages) • Periodic human supervision athletePlaysForTeam (Athlete, SportsTeam) athletePlaysForTeam (Alex Rodriguez, Yankees) athletePlaysForTeam (Alexander_Ovechkin, Penguins)
RE: NELL Conservativestrategy AvoidSemantic Drift
RE: BOA • Bootstrapping Linked Data (http://boa.aksw.org) • Core idea: Useinstancedata in Data Web todiscover NL patternsandnewinstances
RE: BOA • Followsconservativestrategy • Only top pattern • Frequencythreshold • Score Threshold • Evaluation results
RE: Summary • Severalapproaches • Hand-craftedrules • Machine Learning • Hybrid • Large numberofinstancesavailableformanyrelations • Runtimeproblem Parallel implementations • Manynewfactscanbefound • Semantic Drift • Long tail • EntityDisambiguation
Overview • Problem Definition • NamedEntity Recognition • Algorithms • Ensemble Learning • Relation Extraction • General approaches • OpenIEapproaches • EntityDisambiguation • URI Lookup • Disambiguation • Conclusion
ED: Problem Definition • Given(a) refenceknowledgebase(s), a textfragment, a listof NEs (incl. position), and a list a relations, find URIs foreachofthe NEs andrelations • Verydifficultproblem • Ambiguity, e.g., Paris = Paris Hilton? Paris (France)? • Difficultevenforhumans, e.g., • Paris‘ mayordiedyesterday • Severalsolutions • Indexing • Surface Form • Graph-based
ED: Problem Definition • John Petrucci was born in New York. :John_Petruccidbo:birthPlace :New_York . • [John Petrucci, PER] was born in [New York, LOC]. • bornIn ([John Petrucci, PER], [New York, LOC]).
ED: Indexing • More retrievalthandisambiguation • Similartodictionary-basedapproaches • Idea • Index all labels in referenceknowledgebase • Given an inputlabel, retrieveall entitieswith a similarlabel • Poor recall (unknownsurface form, e.g., „Mme Curie“ für „Marie Curie“) • Low precision (Paris = Paris Hilton, Paris (France), …)
ED: Type Disambiguation • Extension ofindexing • Index all labels • Infer type information • Retrievelabelsfromentitiesofthegiven type • Same recallaspreviousapproach • Higher precision • Paris[LOC] != Paris[PER] • Still, Paris (France) vs. Paris (Ontario) • Need forcontext
ED: Spotlight • Knownsurfaceforms (http://dbpedia.org/spotlight) • Based on DBpedia + Wikipedia • Usessupplementaryknowledgeincludingdisambiguationpages, redirects, wikilinks • Threemainsteps • Spotting: FindingpossiblementionsofDBpediaresources, e.g.,John Petrucci was born in New York. • CandidateSelection: Find possible URIs, e.g.,John Petrucci :JohnPetrucciNew York :New_York, :New_York_County, … • Disambiguation: MapcontexttovectorforeachresourceNew York :New_York
ED: YAGO2 • Joint Disambiguation ♬ Mississippi, one of Bob’s later songs, was first recorded by Sheryl on her album.
ED: YAGO2 sim(cxt(ml ),cxt(ei)) coh(ei,ej) Entity Candidates Mentions of Entities Mississippi (Song) Mississippi (State) Bob Dylan Songs Sheryl Cruz Sheryl Lee Sheryl Crow prior(ml ,ei ) Objective: Maximize objective function (e.g., total weight) Constraint: Keep at least one entity per mention
ED: FOX • Generic Approach • A-priori score (a): Popularityof URIs • Similarity score (s): Similarityofresourcelabelsandtext • Coherence score (z): Correlationbetween URIs a|s z a|s
ED:FOX • Allowstheuseofseveralalgorithms • HITS • Pagerank • Apriori • Propagation Algorithms • …
ED: Summary • Difficultproblemevenforhumans • Severalapproaches • Simple search • Search withrestrictions • Knownsurfaceforms • Graph-based • Improved F-Score forDBpedia (70-80%) • Low F-Score forgenericknowledgebases • Intrinsicallydifficult • Still a lotto do
Overview • Problem Definition • NamedEntity Recognition • Algorithms • Ensemble Learning • Relation Extraction • General approaches • OpenIEapproaches • EntityDisambiguation • URI Lookup • Disambiguation • Conclusion