1 / 52

F rom U nstructured I nformation t o L inked D ata

F rom U nstructured I nformation t o L inked D ata. Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012. Motivation. Motivation. Where does the LOD Cloud come from ? Structured data Triplify , D2R Semi- structured data DBpedia Unstructured data

vian
Download Presentation

F rom U nstructured I nformation t o L inked D ata

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. FromUnstructured InformationtoLinked Data Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16th 2012

  2. Motivation

  3. Motivation • Wheredoesthe LOD Cloudcomefrom? • Structured data • Triplify, D2R • Semi-structureddata • DBpedia • Unstructureddata • ??? • Unstructureddatamakeup 80% ofthe Web • How do weextractLinked Data fromunstructureddatasources?

  4. Overview • Problem Definition • NamedEntity Recognition • Algorithms • Ensemble Learning • Relation Extraction • General approaches • OpenIEapproaches • EntityDisambiguation • URI Lookup • Disambiguation • Conclusion NB: Will bemainlyconcernedwiththenewestdevelopments.

  5. Overview • Problem Definition • NamedEntity Recognition • Algorithms • Ensemble Learning • Relation Extraction • General approaches • OpenIEapproaches • EntityDisambiguation • URI Lookup • Disambiguation • Conclusion

  6. Problem Definition • Simple(?) problem: given a textfragment, retrieve • All entitiesand • relationsbetweentheseentitiesautomatically plus • „groundthem“ in an ontology • Also coinedKnowledgeExtraction • John Petrucci was born in New York. :John_Petruccidbo:birthPlace :New_York . :New_York :John_Petrucci • dbo:birthPlace

  7. Problems 1. Findingentities  NamedEntity Recognition 2. Findingrelationinstances  Relation Extraction 3. Finding URIs  URI Disambiguation

  8. Overview • Problem Definition • NamedEntity Recognition • Algorithms • Ensemble Learning • Relation Extraction • General approaches • OpenIEapproaches • EntityDisambiguation • URI Lookup • Disambiguation • Conclusion

  9. NamedEntity Recognition • Problem definition: Given a setofclasses, find all stringsthatarelabelsofinstancesoftheseclasseswithin a textfragment • John Petrucci was born in New York. • [John Petrucci, PER] was born in [New York, LOC].

  10. NamedEntity Recognition • Problem definition: Given a setofclasses, find all stringsthatarelabelsofinstancesoftheseclasseswithin a textfragment • Common setsofclasses • CoNLL03: Person, Location, Organization, Miscelleaneous • ACE05: Facility, Geo-Political Entity, Location, Organisation, Person, Vehicle, Weapon • BioNLP2004: Protein, DNA, RNA, cellline, cell type • Severalapproaches • Directsolutions (singlealgorithms) • Ensemble Learning

  11. NER: Overviewofapproaches • Dictionary-based • Hand-crafted Rules • Machine Learning • Hidden Markov Model (HMMs) • Conditional Random Fields (CRFs) • Neural Networks • k NearestNeighbors (kNN) • Graph Clustering • Ensemble Learning • Veto-Based (Bagging, Boosting) • Neural Networks

  12. NER: Dictionary-based • Simple Idea • Definemappingsbetweenwordsandclasses, e. g., Paris  Location • Try tomatcheachtokenfromeachsentence • Return themappingentities • Time-Efficientatruntime • Manuel creationofgazeteers • Low Precision (Paris = Person, Location) • Low Recall (esp. on PersonsandOrganizationsasthenumberofinstancesgrows)

  13. NER: Rule-based • Simple Idea • Define a setofruleto find entities, e.g., [PERSON] was born in [LOCATION]. • Try tomatcheachsentencetooneorseveralrules • Return themappingentities • High precision • Manuel creationofrulesisverytedious • Low recall (finite numberofpatterns)

  14. NER: Markov Models • Stochastic process such that (Markov Property) ) = ) • Equivalentto finite-statemachine • Formallyconsistsof • Set S ofstates S1, … , Sn • Matrix M such thatmij = P(Xt+1=Sj|Xt=Si)

  15. NER: Hidden Markov Models • Extension ofMarkov Models • States arehiddenandassigned an outputfunction • Onlyoutputisseen • Transitionsarelearnedfromtrainingdata • How do theywork? • Input: Discretesequenceoffeatures(e.g., POS Tags, wordstems, etc.) • Goal: Find thebestsequenceofstatesthatrepresenttheinput • Output: hopefullyrightclassificationofeachtoken S0 PER S1 _ … Sn LOC

  16. NER: k NearestNeighbors • Idea • Describeeachtokenqfrom a labelledtrainingdatasetwith a setoffeatures (e.g., leftandrightneigbors) • Eachnewtokentisdescribedwiththe same features • Assignttheclassofits k nearestneighbors

  17. NER: So far … • „Simple approaches“ • Applyonealgorithmtothe NER problem • Boundtobe limited byassumptionsofmodel • Implementedby a large numberoftools • Alchemy • Stanford NER • Illinois Tagger • Ontos NER Tagger • LingPipe • …

  18. NER: Ensemble Learning • Intuition: Eachalgorithmhasitsstrengthsandweaknesses • Idea: Useensemblelearningtomergeresultsof different algorithms so astocreate a meta-classifierofhigheraccuracy Pattern-basedapproaches Dictionary-basedapproaches Support Vector Machines Condition Random Fields

  19. NER: Ensemble Learning • Idea: Mergetheresultsofseveralapproachesforimprovingresults • Simplestapproaches: • Voting • Weightedvoting Output Merger System 1 System 2 System n Input

  20. NER: Ensemble Learning • Whendoesitwork? • Accuracy • Need forexisitingsolutionstobe „good“ • Mergingrandomresultsleadtorandomresults • Given, currentapproachesreach 80% F-Score • Diversity • Need forsmallestpossibleamountofcorrelationbetweenapproaches • E.g., mergingtwo HMM-basedtaggerswon‘thelp • Given, large numberofapproachesfor NER

  21. NER:FOX • FederatedKnowledgeExtraction Framework • Idea: Applyensemblelearningto NER • Classicalapproach: Voting • Does not makeuseofsystematicerror • Partlydifficulttotrain • Useneuralnetworksinstead • Can makeuseofsystematicerrory • Easy totrain • Converge fast • http://fox.aksw.org

  22. NER: FOX

  23. NER: FOX on MUC7

  24. NER: FOX on MUC7

  25. NER: FOX on Website Data

  26. NER: FOX on Website Data

  27. NER: FOX on Companies and Countries • Noruntimeissues(parallel implementation) • NN overheadissmall • Overfitting

  28. NER: Summary • Large numberofapproaches • Dictionaries • Hand-Craftedrules • Machine Learning • Hybrid • … • Combiningapproachesleadstobetterresultsthansinglealgorithms

  29. Overview • Problem Definition • NamedEntity Recognition • Algorithms • Ensemble Learning • Relation Extraction • General approaches • OpenIEapproaches • EntityDisambiguation • URI Lookup • Disambiguation • Conclusion

  30. RE: Problem Definition • Find the relations between NEs if such relations exist. • NEs not always given a-priori (open vs. closed RE) • John Petrucci was born in New York. • [John Petrucci, PER] was born in [New York, LOC]. • bornIn ([John Petrucci, PER], [New York, LOC]).

  31. RE: Approaches • Hand-craftedrules • Pattern Learning • Coupled Learning

  32. RE: Pattern-based • Hearst patterns [Hearst: COLING‘92] • POS-enhanced regular expression matching in natural-language text NP0 {,} such as {NP1, NP2, … (and|or) }{,} NPn NP0 {,}{NP1, NP2, … NPn-1}{,} or otherNPn “The bow lute, such as the Bambara ndang, is plucked and has an individual curved neck for each string.”  isA(“Bambara ndang”, “bow lute”) • Time-Efficientatruntime • Verylowrecall • Not adaptabletootherrelations

  33. RE: DIPRE • DIPRE = Dual Iterative Pattern Relation Extraction • Semi-supervised, iterative gathering of facts and patterns • Positive & negative examples as seeds for a given target relation • e.g. +(Hillary, Bill); +(Carla, Nicolas); –(Larry, Google) • Various tuning parameters for pruning low-confidence patterns and facts • Extended to SnowBall / QXtract (Hillary, Bill) X and her husband Y X and Y on theirhoneymoon (Carla, Nicolas) (Angelina, Brad) (Victoria, David) (Hillary, Bill) (Carla, Nicolas) X and Y andtheirchildren X hasbeendatingwith Y X loves Y (Larry, Google) …

  34. RE: NELL • Never-Ending Language Learner (http://rtw.ml.cmu.edu/) • Open IE withontological backbone • Closed set of categories & typed relations • Seeds/counter seeds (5-10) • Open set of predicate arguments(instances) • Coupled iterative learners • Constantly running over a large Web corpus since January 2010 (200 Mio pages) • Periodic human supervision athletePlaysForTeam (Athlete, SportsTeam) athletePlaysForTeam (Alex Rodriguez, Yankees) athletePlaysForTeam (Alexander_Ovechkin, Penguins)

  35. RE: NELL Conservativestrategy  AvoidSemantic Drift

  36. RE: BOA • Bootstrapping Linked Data (http://boa.aksw.org) • Core idea: Useinstancedata in Data Web todiscover NL patternsandnewinstances

  37. RE: BOA • Followsconservativestrategy • Only top pattern • Frequencythreshold • Score Threshold • Evaluation results

  38. RE: Summary • Severalapproaches • Hand-craftedrules • Machine Learning • Hybrid • Large numberofinstancesavailableformanyrelations • Runtimeproblem Parallel implementations • Manynewfactscanbefound • Semantic Drift • Long tail • EntityDisambiguation

  39. Overview • Problem Definition • NamedEntity Recognition • Algorithms • Ensemble Learning • Relation Extraction • General approaches • OpenIEapproaches • EntityDisambiguation • URI Lookup • Disambiguation • Conclusion

  40. ED: Problem Definition • Given(a) refenceknowledgebase(s), a textfragment, a listof NEs (incl. position), and a list a relations, find URIs foreachofthe NEs andrelations • Verydifficultproblem • Ambiguity, e.g., Paris = Paris Hilton? Paris (France)? • Difficultevenforhumans, e.g., • Paris‘ mayordiedyesterday • Severalsolutions • Indexing • Surface Form • Graph-based

  41. ED: Problem Definition • John Petrucci was born in New York. :John_Petruccidbo:birthPlace :New_York . • [John Petrucci, PER] was born in [New York, LOC]. • bornIn ([John Petrucci, PER], [New York, LOC]).

  42. ED: Indexing • More retrievalthandisambiguation • Similartodictionary-basedapproaches • Idea • Index all labels in referenceknowledgebase • Given an inputlabel, retrieveall entitieswith a similarlabel • Poor recall (unknownsurface form, e.g., „Mme Curie“ für „Marie Curie“) • Low precision (Paris = Paris Hilton, Paris (France), …)

  43. ED: Type Disambiguation • Extension ofindexing • Index all labels • Infer type information • Retrievelabelsfromentitiesofthegiven type • Same recallaspreviousapproach • Higher precision • Paris[LOC] != Paris[PER] • Still, Paris (France) vs. Paris (Ontario) • Need forcontext

  44. ED: Spotlight • Knownsurfaceforms (http://dbpedia.org/spotlight) • Based on DBpedia + Wikipedia • Usessupplementaryknowledgeincludingdisambiguationpages, redirects, wikilinks • Threemainsteps • Spotting: FindingpossiblementionsofDBpediaresources, e.g.,John Petrucci was born in New York. • CandidateSelection: Find possible URIs, e.g.,John Petrucci :JohnPetrucciNew York  :New_York, :New_York_County, … • Disambiguation: MapcontexttovectorforeachresourceNew York  :New_York

  45. ED: YAGO2 • Joint Disambiguation ♬ Mississippi, one of Bob’s later songs, was first recorded by Sheryl on her album.

  46. ED: YAGO2 sim(cxt(ml ),cxt(ei)) coh(ei,ej) Entity Candidates Mentions of Entities Mississippi (Song) Mississippi (State) Bob Dylan Songs Sheryl Cruz Sheryl Lee Sheryl Crow prior(ml ,ei ) Objective: Maximize objective function (e.g., total weight) Constraint: Keep at least one entity per mention

  47. ED: FOX • Generic Approach • A-priori score (a): Popularityof URIs • Similarity score (s): Similarityofresourcelabelsandtext • Coherence score (z): Correlationbetween URIs a|s z a|s

  48. ED:FOX • Allowstheuseofseveralalgorithms • HITS • Pagerank • Apriori • Propagation Algorithms • …

  49. ED: Summary • Difficultproblemevenforhumans • Severalapproaches • Simple search • Search withrestrictions • Knownsurfaceforms • Graph-based • Improved F-Score forDBpedia (70-80%) • Low F-Score forgenericknowledgebases • Intrinsicallydifficult • Still a lotto do

  50. Overview • Problem Definition • NamedEntity Recognition • Algorithms • Ensemble Learning • Relation Extraction • General approaches • OpenIEapproaches • EntityDisambiguation • URI Lookup • Disambiguation • Conclusion

More Related