Knowledge
This presentation is the property of its rightful owner.
Sponsored Links
1 / 104

Gerhard Weikum Max Planck Institute for Informatics mpi-inf.mpg.de/~weikum/ PowerPoint PPT Presentation


  • 108 Views
  • Uploaded on
  • Presentation posted in: General

Knowledge Harvesting f rom Text and Web Sources. Part 3: Knowledge Linking. Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~weikum/. Quiz Time. How many days do you need to visit all Shangri -La places on this planet?. Source: geonames.org.

Download Presentation

Gerhard Weikum Max Planck Institute for Informatics mpi-inf.mpg.de/~weikum/

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Gerhard weikum max planck institute for informatics mpi inf mpg de weikum

KnowledgeHarvesting

from Text and Web Sources

Part 3: Knowledge Linking

Gerhard Weikum

Max Planck Institute forInformatics

http://www.mpi-inf.mpg.de/~weikum/


Quiz time

Quiz Time

Howmanydays do youneedtovisit

all Shangri-La places on this planet?

Source: geonames.org

Answer: 365

3-2


Quiz time1

Quiz Time

Howmanydays do youneedtovisit

all Shangri-La places on this planet?

3-3


Linkied data rdf triples on the web

Linkied Data: RDF Triples on the Web

30 Bio. triples

500 Mio. links

http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png


Linked rdf triples on the web

Linked RDF Triples on the Web

yago/wordnet: Artist109812338

rdf:subclassOf

rdf:subclassOf

yago/wordnet:Actor109765278

rdf:type

yago/wikicategory:ItalianComposer

rdf:type

imdb.com/name/nm0910607/

dbpedia.org/resource/Ennio_Morricone

prop:actedIn

prop: composedMusicFor

imdb.com/title/tt0361748/

dbpprop:citizenOf

dbpedia.org/resource/Rome

owl:sameAs

owl:sameAs

rdf.freebase.com/ns/en.rome

data.nytimes.com/51688803696189142301

owl:sameAs

geonames.org/5134301/city_of_rome

Coord

N 43° 12' 46'' W 75° 27' 20''


Linked rdf triples on the web1

Linked RDF Triples on the Web

yago/wordnet: Artist109812338

rdf:subclassOf

rdf:subclassOf

yago/wordnet:Actor109765278

rdf:type

yago/wikicategory:ItalianComposer

rdf:type

imdb.com/name/nm0910607/

dbpedia.org/resource/Ennio_Morricone

prop:actedIn

prop: composedMusicFor

imdb.com/title/tt0361748/

dbpprop:citizenOf

dbpedia.org/resource/Rome

?

?

owl:sameAs

owl:sameAs

rdf.freebase.com/ns/en.rome_ny

data.nytimes.com/51688803696189142301

Referentialdataquality?

Hand-craftedsameAs links?

generatedsameAs links?

?

owl:sameAs

geonames.org/5134301/city_of_rome

Coord

N 43° 12' 46'' W 75° 27' 20''


Rdf entities on the web

http://sig.ma

RDF Entities on the Web


Rdf entities on the web1

RDF Entities on the Web

http://sig.ma


Entity name ambiguity

Entity-Name Ambiguity

http://sameas.org


Entities in html

Entities in HTML

http://sindice.com


Entity markup in html towards standardized microformats

Entity Markup in HTML: Towards Standardized Microformats

http://schema.org/


Entity markup in html towards standardized microformats1

Entity Markup in HTML: Towards Standardized Microformats

http://schema.org/


Web page in standard html

Web Page in Standard HTML

http://schema.org/

Jane Doe

<imgsrc="janedoe.jpg" />

Professor20341 Whitworth Institute405 WhitworthSeattle WA 98052(425) 123-4567<a href="mailto:[email protected]">[email protected]</a>

Jane's home page:<a href="http://www.janedoe.com">janedoe.com</a>

Graduate students:<a href="http://www.xyz.edu/students/alicejones.html">Alice Jones</a><a href="http://www.xyz.edu/students/bobsmith.html">Bob Smith</a>


Web page in html with microdata

Web Page in HTML with Microdata

http://schema.org/

<div itemscopeitemtype="http://schema.org/Person">

  <span itemprop="name">Jane Doe</span>

  <imgsrc="janedoe.jpg" itemprop="image" />

  <span itemprop="jobTitle">Professor</span>

  <div itemprop="address" itemscopeitemtype="http://schema.org/PostalAddress">

    <span itemprop="streetAddress">

      20341 Whitworth Institute

      405 N. Whitworth

    </span>

    <span itemprop="addressLocality">Seattle</span>,

    <span itemprop="addressRegion">WA</span>

    <span itemprop="postalCode">98052</span>

  </div>

  <span itemprop="telephone">(425) 123-4567</span>

  <a href="mailto:[email protected]" itemprop="email">

    [email protected]</a>

  Jane's home page:

  <a href="http://www.janedoe.com" itemprop="url">janedoe.com</a>

  Graduate students:

  <a href="http://www.xyz.edu/students/alicejones.html" itemprop="colleague">

    Alice Jones</a>

  <a href="http://www.xyz.edu/students/bobsmith.html" itemprop="colleague">

    Bob Smith</a>

</div>


Web of data vs web of contents

Web-of-Data vs. Web-of-Contents

  • Critical forknowledgelinkage:

  • entitynameambiguity

    •  morestructureddatacombinedwithtext

    •  boostedbyknowledgeharvestingmethods


Embedding rdfa in web contents

Embedding RDFa in Web Contents

<html … May 2, 2011

<div typeof=event:music>

<span id="Maestro_Morricone">

Maestro Morricone

<a rel="sameAs"

resource="dbpedia…/Ennio_Morricone "/>

</span>

<span property = "event:location" >

Smetana Hall </span>

<span property="rdf:type"

resource="yago:performance">

The concert</span> will feature

<span property="event:date"

content="14-07-2011"></span>

July 1

</div>

May 2, 2011

Maestro Morricone will perform

on the stage of the Smetana Hall

to conduct the Czech National

Symphony Orchestra and Choir.

The concert will featureboth

Classicalcompositionsand

soundtracks such as

the Ecstasy of Gold.

In programme two concerts for

July 14th and 15th.

RDF dataand Web contentsneedtobeinterconnected

RDFa & microformatsprovidethemechanism

Need waysofcreatingmoreembedded RDF triples!


Outline

Outline

Motivation

Entity-Name Disambiguation

Mapping QuestionsintoQueries

EntityLinkage

Wrap-up

...


Named entity disambiguation

Named-Entity Disambiguation

Harry fought with you know who. He defeats the dark lord.

Dirty

Harry

Harry

Potter

Prince Harry

of England

The Who

(band)

Lord

Voldemort

Three NLP tasks:

1) named-entity detection: segment & label by HMM or CRF

(e.g. Stanford NER tagger)

2) co-reference resolution: link to preceding NP

(trained classifier over linguistic features)

3) named-entity disambiguation:

map each mention (name) to canonical entity (entry in KB)


Named entity disambiguation1

Named Entity Disambiguation

Eli (bible)

Sergio talked to

Ennio about

Eli‘s role in the

Ecstasy scene.

This sequence on

the graveyard

was a highlight in

Sergio‘s trilogy

of western films.

KB

Eli Wallach

Ecstasy (drug)

?

Benny Goodman

Ecstasy of Gold

Sergio means Sergio_Leone

Sergio means Serge_Gainsbourg

Ennio means Ennio_Antonelli

Ennio means Ennio_Morricone

Eli means Eli_(bible)

Eli means ExtremeLightInfrastructure

Eli means Eli_Wallach

Ecstasy means Ecstasy_(drug)

Ecstasy means Ecstasy_of_Gold

trilogy means Star_Wars_Trilogy

trilogy means Lord_of_the_Rings

trilogy means Dollars_Trilogy

… … …

Benny Andersson

Star Wars Trilogy

Lord of the Rings

Dollars Trilogy

Entities

(meanings)

Mentions

(surface names)

D5 Overview May 30, 2011


Mention entity graph

Mention-Entity Graph

weighted undirected graph with two types of nodes

bag-of-words or

language model:

words, bigrams,

phrases

Sergio talked to

Ennio about

Eli‘s role in the

Ecstasy scene.

This sequence on

the graveyard

was a highlight in

Sergio‘s trilogy

of western films.

Eli (bible)

Eli Wallach

Ecstasy (drug)

Ecstasy of Gold

Star Wars

Lord of the Rings

Dollars Trilogy

  • Popularity

  • (m,e):

  • freq(e|m)

  • length(e)

  • #links(e)

  • Similarity

  • (m,e):

  • cos/Dice/KL

  • (context(m),

  • context(e))

KB+Stats


Mention entity graph1

Mention-Entity Graph

weighted undirected graph with two types of nodes

Sergio talked to

Ennio about

Eli‘s role in the

Ecstasy scene.

This sequence on

the graveyard

was a highlight in

Sergio‘s trilogy

of western films.

Eli (bible)

Eli Wallach

Ecstasy (drug)

joint

mapping

Ecstasy of Gold

Star Wars

Lord of the Rings

Dollars Trilogy

  • Popularity

  • (m,e):

  • freq(e|m)

  • length(e)

  • #links(e)

  • Similarity

  • (m,e):

  • cos/Dice/KL

  • (context(m),

  • context(e))

KB+Stats


Mention entity graph2

Mention-Entity Graph

weighted undirected graph with two types of nodes

Sergio talked to

Ennio about

Eli‘s role in the

Ecstasy scene.

This sequence on

the graveyard

was a highlight in

Sergio‘s trilogy

of western films.

Eli (bible)

Eli Wallach

Ecstasy(drug)

Ecstasy of Gold

Star Wars

Lord of the Rings

Dollars Trilogy

  • Popularity

  • (m,e):

  • freq(m,e|m)

  • length(e)

  • #links(e)

  • Similarity

  • (m,e):

  • cos/Dice/KL

  • (context(m),

  • context(e))

  • Coherence

  • (e,e‘):

  • dist(types)

  • overlap(links)

  • overlap

  • (anchor words)

KB+Stats

22 / 20


Mention entity graph3

Mention-Entity Graph

weighted undirected graph with two types of nodes

American Jews

film actors

artists

Academy Award winners

Sergio talked to

Ennio about

Eli‘s role in the

Ecstasy scene.

This sequence on

the graveyard

was a highlight in

Sergio‘s trilogy

of western films.

Eli (bible)

Eli Wallach

Ecstasy (drug)

Metallica songs

Ennio Morricone songs

artifacts

soundtrack music

Ecstasy of Gold

Star Wars

spaghetti westerns

film trilogies

movies

artifacts

Lord of the Rings

Dollars Trilogy

  • Popularity

  • (m,e):

  • freq(m,e|m)

  • length(e)

  • #links(e)

  • Similarity

  • (m,e):

  • cos/Dice/KL

  • (context(m),

  • context(e))

  • Coherence

  • (e,e‘):

  • dist(types)

  • overlap(links)

  • overlap

  • (anchor words)

KB+Stats

23 / 20


Mention entity graph4

Mention-Entity Graph

weighted undirected graph with two types of nodes

http://.../wiki/Dollars_Trilogy

http://.../wiki/The_Good,_the_Bad, _the_Ugly

http://.../wiki/Clint_Eastwood

http://.../wiki/Honorary_Academy_Award

Sergio talked to

Ennio about

Eli‘s role in the

Ecstasy scene.

This sequence on

the graveyard

was a highlight in

Sergio‘s trilogy

of western films.

Eli (bible)

Eli Wallach

Ecstasy (drug)

http://.../wiki/The_Good,_the_Bad,_the_Ugly

http://.../wiki/Metallica

http://.../wiki/Bellagio_(casino)

http://.../wiki/Ennio_Morricone

Ecstasy of Gold

Star Wars

http://.../wiki/Sergio_Leone

http://.../wiki/The_Good,_the_Bad,_the_Ugly

http://.../wiki/For_a_Few_Dollars_More

http://.../wiki/Ennio_Morricone

Lord of the Rings

Dollars Trilogy

  • Popularity

  • (m,e):

  • freq(m,e|m)

  • length(e)

  • #links(e)

  • Similarity

  • (m,e):

  • cos/Dice/KL

  • (context(m),

  • context(e))

  • Coherence

  • (e,e‘):

  • dist(types)

  • overlap(links)

  • overlap

  • (anchor words)

KB+Stats

24 / 20


Mention entity graph5

Mention-Entity Graph

weighted undirected graph with two types of nodes

The Magnificent Seven

The Good, the Bad, and the Ugly

Clint Eastwood

University of Texas at Austin

Sergio talked to

Ennio about

Eli‘s role in the

Ecstasy scene.

This sequence on

the graveyard

was a highlight in

Sergio‘s trilogy

of western films.

Eli (bible)

Eli Wallach

Ecstasy (drug)

Metallica on Morricone tribute

Bellagio water fountain show

Yo-Yo Ma

Ennio Morricone composition

Ecstasy of Gold

Star Wars

For a Few Dollars More

The Good, the Bad, and the Ugly

Man with No Name trilogy

soundtrack by Ennio Morricone

Lord of the Rings

Dollars Trilogy

  • Popularity

  • (m,e):

  • freq(m,e|m)

  • length(e)

  • #links(e)

  • Similarity

  • (m,e):

  • cos/Dice/KL

  • (context(m),

  • context(e))

  • Coherence

  • (e,e‘):

  • dist(types)

  • overlap(links)

  • overlap

  • (anchor words)

KB+Stats

25 / 20


Gerhard weikum max planck institute for informatics mpi inf mpg de weikum

Different Approaches

  • Combine Popularity, Similarity, andCoherence Features

  • (Cucerzan: EMNLP‘07, Milne/Witten: CIKM‘08):

    • forsim (context(m), context(e)):

    • considersurroundingmentions

    • andtheircandidateentities

    • usetheirtypes, links, anchors

    • asfeaturesofcontext(m)

    • set m-e edgeweightsaccordingly

    • usegreedymethodsforsolution

  • Collective Learning with Prob. Factor Graphs

  • (Chakrabarti et al.: KDD‘09):

    • model P[m|e] by similarity and P[e1|e2] by coherence

    • consider likelihood of P[m1 … mk | e1 … ek]

    • factorize by all m-e pairs and e1-e2 pairs

    • use hill-climbing, LP, etc. for solution


Joint mapping

Joint Mapping

50

50

30

20

30

10

10

90

100

30

20

80

90

90

100

30

5

  • Build mention-entity graph or joint-inference factor graph

  • from knowledge and statistics in KB

  • Compute high-likelihood mapping (ML or MAP) or

  • dense subgraph such that:

  • each m is connected to exactly one e (or at most one e)


Mention entity popularity weights

Mention-Entity Popularity Weights

[Milne/Witten 2008, Spitkovsky/Chang 2012]

  • Need dictionarywithentities‘ names:

  • fullnames: Arnold Alois Schwarzenegger, Los Angeles, Microsoft Corporation

  • shortnames: Arnold, Arnie, Mr. Schwarzenegger, New York, Microsoft, …

  • nicknames & aliases: Terminator, City of Angels, Evil Empire, …

  • acronyms: LA, UCLA, MS, MSFT

  • rolenames: the Austrian actionhero, Californiangovernor, the CEO of MS, …

  • plus genderinfo (usefulforresolvingpronouns in context):

  • Bill and Melinda metat MS. Theyfell in loveandhekissedher.

  • Collecthyperlinkanchor-text / link-targetpairsfrom

  • Wikipediaredirects

  • Wikipedia links betweenarticles

  • Interwiki links betweenWikipediaeditions

  • Web links pointingtoWikipediaarticles

  • Buildstatisticstoestimate P[entity | name]


Mention entity similarity edges

Mention-Entity Similarity Edges

Precomputecharacteristickeyphrases qforeachentity e:

anchortextsornounphrases in e pagewithhigh PMI:

„Metallicatributeto Ennio Morricone“

Matchkeyphrase q ofcandidate e in contextofmention m

Extent of partial matches

Weight of matched words

The Ecstasy piece was coveredbyMetallica on the Morricone tributealbum.

Computeoverallsimilarityofcontext(m) andcandidate e


Entity entity coherence edges

Entity-Entity Coherence Edges

Precomputeoverlapofincoming links forentities e1 and e2

Alternativelycomputeoverlapofanchortextsfor e1 and e2

oroverlapofkeyphrases, orsimilarityofbag-of-words, or …

Optionallycombinewithtype distanceof e1 and e2

(e.g., Jaccardindexfor type instances)

Forspecialtypesof e1 and e2 (locations, people, etc.)

usespatialor temporal distance


Coherence graph algorithm

Coherence Graph Algorithm

[J. Hoffart et al.: EMNLP‘11]

140

50

50

30

180

20

30

10

10

90

50

100

470

30

20

80

90

145

90

100

30

5

230

  • Compute dense subgraph to

  • maximize min weighted degree among entity nodes

  • such that:

  • each m is connected to exactly one e (or at most one e)

  • Greedy approximation:

  • iteratively remove weakest entity and its edges

  • Keep alternative solutions, then use local/randomized search


Coherence graph algorithm1

Coherence Graph Algorithm

[J. Hoffart et al.: EMNLP‘11]

140

140

50

50

30

170

180

30

10

90

50

100

470

470

30

80

90

145

145

90

100

30

5

230

210

  • Compute dense subgraph to

  • maximize min weighted degree among entity nodes

  • such that:

  • each m is connected to exactly one e (or at most one e)

  • Greedy approximation:

  • iteratively remove weakest entity and its edges

  • Keep alternative solutions, then use local/randomized search


Coherence graph algorithm2

Coherence Graph Algorithm

[J. Hoffart et al.: EMNLP‘11]

140

30

170

120

90

100

460

460

30

80

90

145

145

90

100

30

5

210

210

  • Compute dense subgraph to

  • maximize min weighted degree among entity nodes

  • such that:

  • each m is connected to exactly one e (or at most one e)

  • Greedy approximation:

  • iteratively remove weakest entity and its edges

  • Keep alternative solutions, then use local/randomized search


Coherence graph algorithm3

Coherence Graph Algorithm

[J. Hoffart et al.: EMNLP‘11]

30

120

90

100

380

90

145

90

100

210

  • Compute dense subgraph to

  • maximize min weighted degree among entity nodes

  • such that:

  • each m is connected to exactly one e (or at most one e)

  • Greedy approximation:

  • iteratively remove weakest entity and its edges

  • Keep alternative solutions, then use local/randomized search


Alternative random walks

Alternative: Random Walks

0.5

50

50

0.83

0.3

30

0.2

20

0.23

30

0.1

10

10

0.17

0.7

90

0.77

100

0.25

30

0.2

20

0.4

80

0.75

90

90

0.75

0.96

100

0.15

30

5

0.04

  • foreachmentionrunrandomwalkswithrestart

  • (likepersonalized PR withjumpstostartmention(s))

  • rank candidateentitiesbystationaryvisitingprobability

  • veryefficient, decentaccuracy


Gerhard weikum max planck institute for informatics mpi inf mpg de weikum

AIDA: Accurate Online Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/aida/


Gerhard weikum max planck institute for informatics mpi inf mpg de weikum

AIDA: Accurate Online Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/aida/


Gerhard weikum max planck institute for informatics mpi inf mpg de weikum

AIDA: VeryDifficultExample

http://www.mpi-inf.mpg.de/yago-naga/aida/


Gerhard weikum max planck institute for informatics mpi inf mpg de weikum

AIDA: VeryDifficultExample

http://www.mpi-inf.mpg.de/yago-naga/aida/


Gerhard weikum max planck institute for informatics mpi inf mpg de weikum

AIDA: Accurate Online Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/aida/


Gerhard weikum max planck institute for informatics mpi inf mpg de weikum

AIDA: Accurate Online Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/aida/


Gerhard weikum max planck institute for informatics mpi inf mpg de weikum

AIDA: Accurate Online Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/aida/


Gerhard weikum max planck institute for informatics mpi inf mpg de weikum

AIDA: Accurate Online Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/aida/


Some ned online tools for

Some NED Online Tools for

  • J. Hoffart et al.: EMNLP 2011, VLDB 2011

  • https://d5gate.ag5.mpi-sb.mpg.de/webaida/

  • P. Ferragina, U. Scaella: CIKM 2010

  • http://tagme.di.unipi.it/

  • R. Isele, C. Bizer: VLDB 2012

  • http://spotlight.dbpedia.org/demo/index.html

  • Reuters Open Calais

  • http://viewer.opencalais.com/

  • S. Kulkarni, A. Singh, G. Ramakrishnan, S. Chakrabarti: KDD 2009

  • http://www.cse.iitb.ac.in/soumen/doc/CSAW/

  • D. Milne, I. Witten: CIKM 2008

  • http://wikipedia-miner.cms.waikato.ac.nz/demos/annotate/

  • perhapsmore

  • someuse Stanford NER taggerfordetectingmentions

  • http://nlp.stanford.edu/software/CRF-NER.shtml


Ned experimental evaluation

NED: Experimental Evaluation

  • Benchmark:

  • Extended CoNLL 2003 dataset: 1400 newswirearticles

  • originallyannotatedwithmentionmarkup (NER),

  • nowwith NED mappingstoYagoandFreebase

  • difficulttexts:

  • … AustraliabeatsIndia … Australian_Cricket_Team

  • … White House talksto Kreml …  President_of_the_USA

  • … EDS made a contractwith … HP_Enterprise_Services

Results:

Best: AIDA methodwithprior+sim+coh + robustnesstest

82% precision @100% recall, 87% meanaverageprecision

Comparisontoothermethods, seepaper

J. Hoffart et al.: Robust DisambiguationofNamedEntities in Text, EMNLP 2011

http://www.mpi-inf.mpg.de/yago-naga/aida/


Ongoing research remaining challenges

Ongoing Research & Remaining Challenges

  • More efficient graph algorithms (multicore, etc.)

  • Allow mentions of unknown entities, mapped to null

  • Leverage deep-parsing structures,

  • leverage semantic types

  • Example: Page played Kashmir on his Gibson

obj

subj

mod

  • Short and difficult texts:

    • tweets, headlines, etc.

    • fictional texts: novels, song lyrics, etc.

    • incoherent texts

  • Structured Web data: tablesandlists

  • Disambiguationbeyondentitynames:

    • coreferences: pronouns, paraphrases, etc.

    • commonnouns, verbal phrases (general WSD)


General word sense disambiguation

General Word Sense Disambiguation

{songwriter,

composer}

{cover, perform}

{cover, report, treat}

Which

song writers

covered

ballads

written by

the Stones ?

{cover, help out}


Handling out of wikipedia entities

Handling Out-of-Wikipedia Entities

wikipedia.org/Good_Luck_Cave

Cave

composed

haunting

songslike

Hallelujah,

O Children,

andthe

Weeping Song.

wikipedia.org/Nick_Cave

wikipedia/Hallelujah_Chorus

wikipedia/Hallelujah_(L_Cohen)

last.fm/Nick_Cave/Hallelujah

wikipedia/Children_(2011 film)

last.fm/Nick_Cave/O_Children

wikipedia.org/Weeping_(song)

last.fm/Nick_Cave/Weeping_Song


Handling out of wikipedia entities1

Handling Out-of-Wikipedia Entities

GunungMulu National Park

Sarawak Chamber

largestundergroundchamber

wikipedia.org/Good_Luck_Cave

Bad Seeds

No More ShallWe Part

Murder Songs

Cave

composed

haunting

songslike

Hallelujah,

O Children,

andthe

Weeping Song.

wikipedia.org/Nick_Cave

Messiahoratorio

George Frideric Handel

wikipedia/Hallelujah_Chorus

Leonard Cohen

Rufus Wainwright

Shrekand Fiona

wikipedia/Hallelujah_(L_Cohen)

eerieviolin

Bad Seeds

No More ShallWe Part

last.fm/Nick_Cave/Hallelujah

wikipedia/Children_(2011 film)

South Korean film

Nick Cave & Bad Seeds

Harry Potter 7 movie

hauntingchoir

last.fm/Nick_Cave/O_Children

wikipedia.org/Weeping_(song)

Dan Heymann

apartheidsystem

last.fm/Nick_Cave/Weeping_Song

Nick Cave

Murder Songs

P.J. Harvey

Nick andBlixaduet


Handling out of wikipedia entities2

Handling Out-of-Wikipedia Entities

[J. Hoffart et al.: CIKM‘12]

  • Characterize all entities (andmentions) bysetsofkeyphrases

  • Entitycoherencethenbecomes:

  • keyphrasesoverlap, noneedforhref link data

  • Foreachmentionadd a „self“ candidate:

  • out-of-KB entitywithkeyphrasescomputedby Web search

wpqmin(p(w), q(w))

PO(p,q) =

withword

weights

wpqmax(p(w), q(w))

phrasesp,q

pe,qfPO(p,q)2  min(e(p), f(q))

withphrase

weights

KORE (e,f) =

pee(p) +qff(q)

entitiese,f

Efficientcomparisonoftwokeyphrase-sets

 two-stagehashing, using min-hashsketchesand LSH


Variants of ned at web scale

Variants of NED at Web Scale

Tools canmapshorttextontoentities in a fewseconds

  • Howtorunthis on bigbatchof 1 Mio. inputtexts?

    •  partitioninputsacrossdistributedmachines,

    • organizedictionaryappropriately, …

    •  exploitcross-documentcontexts

  • Howto deal withinputsfromdifferent time epochs?

    •  consider time-dependentcontexts,

    • maptoentitiesof proper epoch

    • (e.g. harvestedfrom Wikipedia history)

  • Howto handle Web-scaleinputs (100 Mio. pages)

  • restrictedto a setofinterestingentities?

  • (e.g. trackingpoliticiansandcompanies)


Outline1

Outline

Motivation

Entity-Name Disambiguation

Mapping QuestionsintoQueries

EntityLinkage

Wrap-up

...


Word sense disambiguation for question to query translation

Word Sense Disambiguation forQuestion-to-Query Translation

QA system

DEANNA

[M. Yahya et al.:

EMNLP‘12]

“Who played in Casablanca and was married to a writer born in Rome?”

Question

Translation

with WSD

Select ?p Where {

?p typeperson.

?p actedInCasablanca_(film).

?p isMarriedTo ?w.

?w typewriter.

?w bornInRome . }

SPARQL

KB

?p

?w

www.mpi-inf.mpg.de/

yago-naga/deanna/

Answer


Deanna in a nutshell

DEANNA in a Nutshell

Question

Phrase detection

DEANNA

Phrase

mapping

Dependency

detection

SPARQL

Joint

Disambig.

KB

Query

Generation

Answers


Deanna in a nutshell1

DEANNA in a Nutshell

Question

Phrase detection

DEANNA

Phrase

mapping

Dependency

detection

SPARQL

Joint

Disambig.

KB

Query

Generation

Answers


Deanna in a nutshell2

DEANNA in a Nutshell

Question

Phrase detection

DEANNA

Phrase

mapping

Dependency

detection

SPARQL

Joint

Disambig.

KB

Query

Generation

Answers


Deanna in a nutshell3

DEANNA in a Nutshell

Question

Phrase detection

DEANNA

Phrase

mapping

Dependency

detection

SPARQL

Joint

Disambig.

KB

Query

Generation

Answers


Deanna components

DEANNA Components

Question

Phrase detection

1

DEANNA

Phrase

mapping

2

Dependency

detection

3

SPARQL

Joint

Disambig.

KB

4

Query

Generation

Answers


Phrase detection

Phrase Detection

  • Concepts: entities & classes:

  • dictionary-based

  • Relations:

  • mainly use Reverb [Fader et al: EMNLP’11]: V | VP | VW*P

  • … was/VBD married/VBN to/TO a/DT…

a writer

Casablanca

played

played in

Who

married

married to

was married to


Deanna components1

DEANNA Components

Question

Phrase detection

1

DEANNA

Phrase

mapping

2

Dependency

detection

3

SPARQL

KB

Joint

Disambig.

4

Query

Generation


Phrase mapping

Phrase Mapping

  • Concepts: entities & classes: dictionary-based

  • Relations: Dictionary -based

Casablanca

e:White_House

e:Casablanca

played

e:Casablanca_(film)

played in

e:Played_(film)

r:actedIn

r:hasMusicalRole


Deanna components2

DEANNA Components

Question

Phrase detection

1

DEANNA

Phrase

mapping

2

Dependency

detection

3

SPARQL

KB

Joint

Disambig.

4

Query

Generation


Dependency detection

Dependency Detection

Look for specific patterns in dependency graph

[de Marneffe et al. LREC’06]

e:Rome

Rome

e:Sydne_Rome

e:Born_(film)

born

e:Max_Born

writer

q1

partmod

was born

r:bornOnDate

r:bornInPlace

a writer

c:writer

born

prep

in

pobj

Rome


Disambiguation graph

Disambiguation Graph

Semantic nodes

Phrase-nodes

e:Rome

Rome

e:Sydne_Rome

e:Born_(film)

born

q-nodes

e:Max_Born

q1

was born

r:bornOnDate

r:bornInPlace

a writer

c:writer

Casablanca

e:White_House

e:Casablanca

played

q2

e:Casablanca_(film)

played in

e:Played_(film)

Who

r:actedIn

r:hasMusicalRole

married

c:person

married to

q3

e:Married_(series)

c: married_person

was married to

r:isMarriedTo


Deanna components3

DEANNA Components

Question

Phrase detection

1

DEANNA

Phrase

mapping

2

Dependency

detection

3

SPARQL

KB

Joint

Disambig.

4

Query

Generation


Joint disambiguation ilp

Joint Disambiguation - ILP

  • ILP: Integer Linear Programming

  • maximizeαΣi,jwi,jYi,j+ βΣk,lvk,lZk,l+ …

  • Subject to:

    • No token in multiple phrases,

    • Triples observe type constraints, …


Joint disambiguation objective

Joint Disambiguation – Objective

Semantic nodes

SimilarityEdges

Phrase nodes

Coherence Edges

e:Rome

αΣi,jwi,jYi,j+ βΣk,lvk,lZk,l

Rome

e:Sydne_Rome

e:Born_(film)

q-nodes

born

e:Max_Born

q1

was born

r:bornOnDate

r:bornInPlace

a writer

c:writer

Prior


Joint disambiguation objective1

Joint Disambiguation – Objective

Semantic nodes

SimilarityEdges

Phrase nodes

Coherence Edges

e:Rome

αΣi,jwi,jYi,j+ βΣk,lvk,lZk,l

Rome

e:Sydne_Rome

e:Born_(film)

q-nodes

born

e:Max_Born

q1

was born

r:bornOnDate

r:bornInPlace

a writer

c:writer

Coherence


Joint disambiguation constraints

Joint Disambiguation – Constraints

A phrase node can be assigned to only one semantic node:

Semantic nodes

Ya,1

1

Phrase nodes

e:White_House

Ya,2

a

2

Casablanca

e:Casablanca

3

Ya,3

e:Casablanca_(film)

  • αΣi,jwi,jYi,j+ βΣk,lvk,lZk,l


Joint disambiguation constraints1

Joint Disambiguation – Constraints

Classes translate to type-constrained variables

 Every semantic triple should have a class to join & project!

person actedInCasablanca_(film)

?x typeperson . ?x actedInCasablanca_(film)

Semantic nodes

Phrase nodes

e:Rome

Rome

e:Sydne_Rome

q-nodes

r:bornOnDate

q1

was born

r:bornInPlace

e:The_Writer (magazine)

a writer

c:writer


Deanna components4

DEANNA Components

Question

Phrase detection

1

DEANNA

Phrase

mapping

2

Dependency

detection

3

SPARQL

KB

Joint

Disambig.

4

Query

Generation


Structured query generation

Structured Query Generation

Rome

e:Rome

was born

r:bornIn

q1

a writer

c:writer

q2

Casablanca

e:Casablanca_(film)

played in

r:actedIn

q3

Who

c:person

was married to

r:isMarriedTo

SELECT ?p WHERE {

?w typewriter.

?w bornInRome .

?p typeperson.

?p actedInCasablanca_(film).

?p isMarriedTo ?w }


Outline2

Outline

Motivation

Entity-Name Disambiguation

Mapping QuestionsintoQueries

EntityLinkage

Wrap-up

...


Entity linkage for the web of data

Entity Linkage for the Web of Data

30 Bio. triples

500 Mio. links

yago/wordnet: Artist109812338

rdf:subclassOf

rdf:subclassOf

yago/wordnet:Actor109765278

rdf:type

yago/wikicategory:ItalianComposer

rdf:type

imdb.com/name/nm0910607/

dbpedia.org/resource/Ennio_Morricone

prop:actedIn

prop: composedMusicFor

imdb.com/title/tt0361748/

dbpprop:citizenOf

dbpedia.org/resource/Rome

?

?

owl:sameAs

owl:sameAs

rdf.freebase.com/ns/en.rome_ny

data.nytimes.com/51688803696189142301

?

sameAs links ?

Where? How?

owl:sameAs

geonames.org/5134301/city_of_rome

Coord

N 43° 12' 46'' W 75° 27' 20''


Record linkage entity resolution

Record Linkage (Entity Resolution)

record 1

record 2

record 3

record N

Susan B. Davidson

O.P. Buneman

P. Baumann

Y. Davidson

Peter Buneman

S. Davison

S. Davidson

Sean Penn

Yi Chen

Y. Chen

Cheng Y.

S. Chen

University of

Pennsylvania

U Penn

Penn State

Penn Station

Issues in …

Issues in …

Issues in …

Issues in …

Int. Conf. on Very

Large Data Bases

VLDB Conf.

PVLDB

XLDB

Conference

  • Find equivalenceclassesofentities, andrecords, based on:

    • similarityofvalues (editdistance, n-gram overlap, etc.)

    • jointagreementoflinkage

  • similarityjoins, grouping/clustering, collectivelearning, etc.

  • oftendomain-specificcustomization (similaritymeasures etc.)

Halbert L. Dunn: Record Linkage. American Journal of Public Health. 1946

H.B. Newcombe et al.: Automatic Linkageof Vital Records. Science, 1959.

I.P. Fellegi, A.B. Sunter: A Theory of Record Linkage. J. of American Statistical Soc., 1969.


Entity linkage via markov logic

Entity Linkage via Markov Logic

record 1

record 2

record 3

record N

Susan B. Davidson

O.P. Buneman

P. Baumann

Y. Davidson

Peter Buneman

S. Davison

S. Davidson

Sean Penn

Yi Chen

Y. Chen

Cheng Y.

S. Chen

University of

Pennsylvania

U Penn

Penn State

Penn Station

  • prob. / uncertain rules:

  • sameTitle(x,y)  sameAuths(x,y)  sameVenue(x,y)  sameAs(x,y)

  • sameTitle(x,y)  sameAuths(x,y)  sameAffil(x,y)  sameAs(x,y)

  • overlapAuths(x,y)  sameAffil(x,y)  sameAuths(x,y)

  • sameAs(rec1.auth1, rec2.auth1) [0.2]

  • sameAs(rec1.auth1, rec2.auth2) [0.9]

  • specify in Markov Logic or as factor graph

  • generate MRF (or …) and solve by MCMC (or …)

  • (Singla/Domingos: ICDM’06,

  • Hall/Sutton/McCallum:KDD’08)

Issues in …

Issues in …

Issues in …

Issues in …

Int. Conf. on Very

Large Data Bases

VLDB Conf.

PVLDB

XLDB

Conference

  • Find equivalence classes of entities, and records, based on:

    • similarity of values (edit distance, n-gram overlap, etc.)

    • joint agreement of linkage

  •  similarity joins, grouping/clustering, collective learning, etc.

Halbert L. Dunn: Record Linkage. American Journal of Public Health. 1946

H.B. Newcombe et al.: Automatic Linkageof Vital Records. Science, 1959.

I.P. Fellegi, A.B. Sunter: A Theory of Record Linkage. J. of American Statistical Soc., 1969.


Sameas link test across sources

sameAs-Link Test across Sources

LOD source 1

LOD source 2

sameAs?

ej

ei

?

?

?

?

recordlinkage

problem

  • sameAs (ei, ej)

  • sim (ei, ej) ≥ … 

  • x,ycoh(x,y) ≥ …

similarity: sim (ei, ej)

neighborhoods: N(ei), N(ej)

coherence: coh (xN(ei), yN(ej))


Sameas link generation across sources

sameAs-Link Generation across Sources

LOD source 1

LOD source 2

LOD source 3

sameAs?

sameAs?

sameAs?

ek

ej

ei


Sameas link generation across sources1

sameAs-Link Generation across Sources

LOD source 1

LOD source 2

LOD source 3

sameAs?

sameAs?

sameAs?

ek

ej

ei

at Web

scale ???

  • Joint Mapping

  • ILP model

  • or prob. factorgraphor …

  • Useyourfavoritesolver

  • How?

sim(ei, ej): likelihoodofbeingequivalent, mappedto [-1,1]

coh(x, y): likelihoodofbeingmentionedtogether, mappedto [-1,1]

0-1 decision variables: Xij… Xjk… Xik …

constraints:

jXij 1 for all i

(1Xij ) + (1Xjk )  (1Xik)

for all i, j, k

objectivefunction:

ij(Xijsim(ei,ej) + XijxNi, yNjcoh(x,y))

+ jk(…) + ik(…) = max!


Similarity flooding

Similarity Flooding

  • Graph withrecord / entitypairsasnodes (sameAscandidates)

  • andedgesconnectingrelatedpairs:

    • R(x,y) and S(u,w) andsameAscandidates (x,u), (y,w)

    •  edgebetween (x,u) and (y,w)

  • Nodeweights: belief strength in sameAs(x,u)

  • Edge weights: degreeofrelatedness

  • Iterateuntilconvergence:

    • propagatenodeweightstoneighbors

    • newnodeweightis linear combinationofinputs

Relatedto

belief propagationalgorithms,

labelpropagation, etc.


Blocking of match candidates

Blocking of Match Candidates

Avoidcomputing O(n2) similaritiesbetweenrecords / entities

  • Group potentiallymatchingrecords

  • Run moreaccurate & more expensive method per group

  • atriskofmissingsomematches

Group byzipcode:

{1,4,5} and {2,3}

 sameAs(4,5), sameAs(2,3)

Name Zip Email

1 John Doe 49305 [email protected]

2 John Doe 94305 [email protected]

3 Jon Foe 94305 [email protected]

4 Jane Foe 12345 [email protected]

5 Jane Fog 12345 [email protected]

Group by 1st charoflastname:

{1,2} and {3,4,5}

 sameAs(1,2), sameAs(4,5)

  • Iterative Blocking:

  • distributefoundmatchestootherblocks,

  • thenrepeat per-block runs

  • Multi-type Joint Resolution

  • blocksof different recordtypes (author, venue, etc.)

  • propagatematchestoothertypes, thenrepeatruns


Iterative blocking for joint resolution with multiple entity types

Iterative Blocking for Joint Resolution with Multiple Entity Types

[Whang et al. 2012]

Publications

Authors

Venues

after

round 1

after

round 2

heuristicsforconstructing

efficientexecutionplans

exploiting „influencegraph“


Rimom method

RiMOM Method

[Juanzi Li et al.;TKDE‘09]

RiskMinimizationBasedOntologyMatchingMethod

forjointmatchingofconcepts (entities, classes) & properties (relations)

  • Strategiesusingvariety

  • ofmatchingcriteria:

  • Linguistic-based:

    • editdistance

    • contextvector

    • distance

  • Structure-based:

    • similarityflooding

keg.cs.tsinghua.edu.cn/project/RiMOM/


Coma framework

COMA++ Framework

[E. Rahm et al.]

  • Jointschemaalignmentandentitymatching

  • Comprehensivearchitecturewithmanyplug-ins

  • forcustomizingtospecificapplication

  • Blockedmatchersparallelizableon Map-Reduceplatform

dbs.uni-leipzig.de/Research/coma.html/


Paris method

PARIS Method

[F. Suchanek et al. 2012]

ProbabilisticAlignmentof Relations, Instances, and Schema:

jointreasoning on sameEntity, sameRelation, sameClass

withdirectprobabilisticassessment

P[literal1  literal2] = …

same constantvalue

P[r1  r2] = …

sub-relation

P[e1  e2] = …

same entity

P[c1  c2] = …

sub-class

Iteratethroughprobabilisticequations

Empiricallyconvergestofixpoint

MatchingentitiesofDBpediawith YAGO:

90% precision, 73% recall, after 4 iterations, 5 h run-time

webdam.inria.fr/paris/


Paris method1

PARIS Method

[F. Suchanek et al. 2012]

P[literal1  literal2] = …

based on similarity

andco-occurrence

P[Shanri-La  Zhongdian] =

… fun(bornIn-1) P[Jet Li  Li Lianjie]

P[x  y] =

same entity

ifrelationswere

alreadyaligned

(1  r(x,u),r(y,w) (1 fun(r1)P[uw]))

 r(x,u) (1 fun(r)  r(y,w) (1  P[uw])))

considering

negative evidence

where

#x: y: r(x,y)

degreetowhich r

Is a function

fun(r) =

#x,y: r(x,y))

webdam.inria.fr/paris/


Paris method2

PARIS Method

[F. Suchanek et al. 2012]

P[s  r]:

sub-relation

#x,u: s(x,u)  r(x,u)

ifentitieswere

alreadyresolved

P[s  r] =

#x,u: s(x,u)

s(x,u) (1  r(y,w) (1 P[xy]P[uw]))

P[s  r] =

s(x,u) (1  y,w (1 P[xy]P[uw]))

with same-entity

probabilities

webdam.inria.fr/paris/


Paris method3

PARIS Method

[F. Suchanek et al. 2012]

P[x  y] =

same entity

revisited

with

sub-relation

probabily

(1  s(x,u),r(y,w) (1 P[s  r]fun(s1)P[uw]) 

(1 P[s  r]fun(r1)P[uw]))

 s(x,u),r(y,w) (1 P[s  r] fun(s)  r(y,w) (1  P[uw])) 

considering

negative evidence

(1 P[s  r] fun(r)  r(y,w) (1  P[uw]))

webdam.inria.fr/paris/


Paris method4

PARIS Method

[F. Suchanek et al. 2012]

P[c  d]:

sub-class

#x type(x,c))  type(x,d)

ifentitieswere

alreadyresolved

P[c  d] =

#x: type(x,c)

x:type(x,c) (1  y:type(y,d) (1 P[xy]))

P[c  d] =

#x: type(x,c)

with same-entity

probabilities

webdam.inria.fr/paris/


Partitioned mln method

Partitioned MLN Method

V. Rastogi et al. 2011]

  • UseMarkovLogic Network forentityresolution

  • Partition MLN withreplicationofnodes so that:

  • Eachnodehasitsneighborhood in the same partition

  • Repeat

    • localcomputation:

    • run MLN inference via MCMC on eachpartition (in parallel)

    • messagepassing:

    • exchangebeliefs (on sameAs) amongpartitions

    • withoverlappingnodesets

  • Untilconvergence

R1: sim(x,y)  sameAuthor(x,y)

R2: sim(x,y)  coAuthor(x,a) 

coAuthor(y,b) s ameAuthor(a,b)

 sameAuthor(x,y)


Linda linked data alignment at scale

LINDA: Linked Data Alignment at Scale

[C. Böhm et al. 2012]

  • usescontextsimandjointinferencetoprocess

  • sameAsmatrixwithtransitivityandotherconstraints

    • alternatesbetweensettingsameAsandrecomputingsim

    • puts promising candidatepairs in priorityqueue

  • queueispartitionedandprocessingparallelized

Input

Queue Q

Input

Entity Graph G

Node 1

Node n

Q-part 1

Q-part n

(2) notify

distribute

eiej y

distribute

eiej y

ekel y

(3) update

  • Experiment

  • with BTC+ dataset:

  • 3 Bio. quads

  • 345 Mio. triples

  • 95 Mio. URIs

  • Result after

  • 30 h run-time:

  • 12.3 Mio. sameAs

  • 66% precision

  • > 80% for

  • Dbpedia-Yago

eiek y

ei

ekel y

G-part 1

G-part n

(1) accept

ek

ei

ej

ek

el

(4) register

el

distribue

read

read

Queue

Updates

Result

Matrix X

e1 … em

ej

eiejy‘

em… e1

eieky‘


Cross lingual linking

Cross-Lingual Linking

+ simpler thanmonolingual: naturalequivalences, interwiki links

 harderthanmonolingual: different terminologies & structures

baike.baidu.com:

4 Mio. articles

en.wikipedia.org:

3.5 Mio. articles

Source: Z. Wang et al.: WWW‘12

Z. Wang et al. WWW‘12: factor-graph learning 200,000 sameAs

T. Nguyen et al. VLDB‘12: simfeatures & LSI infoboxmappings


Challenges remaining

Challenges Remaining

Entitylinkageisattheheartofsemanticdataintegration !

More than 50 yearsofresearch, still somewaytogo!

  • Highlyrelatedentitieswithambiguousnames

  • George W. Bush (jun.) vs. George H.W. Bush (sen.)

  • Out-of-Wikipediaentitieswithsparsecontext

  • Enterprise data (perhapscombinedwith Web2.0 data)

  • Records withcomplex DB / XML / OWL schemas

  • Entitieswithverynoisycontext (in socialmedia)

  • Benchmarks:

  • OAEI OntologyAlignment & Instance Matching: oaei.ontologymatching.org

  • TAC KBP Entity Linking: www.nist.gov/tac/2012/KBP/

  • TREC Knowledge Base Acceleration: trec-kba.org


Trec task knowledge base acceleration

TREC Task: Knowledge Base Acceleration

  • Goal: assistWikipedia / KB editors

  • recommendkeycitationsasevidenceoftruth

  • recommendinfoboxstructureandcategories

  • recommendentity links andexternal links

http://trec-kba.org


Trec task knowledge base acceleration1

TREC Task: Knowledge Base Acceleration

+

http://trec-kba.org


Outline3

Outline

Motivation

Entity-Name Disambiguation

Mapping QuestionsintoQueries

EntityLinkage

Wrap-up

...


Take home lessons

Take-Home Lessons

Web ofLinked Data isgreat

100‘s ofKB‘swith 30 Bio. triplesand 500 Mio. links

mostlyreferencedata, dynamicmaintenanceisbottleneck

connectionwith Web of Contents needsimprovement

Entitydetectionanddisambiguationiskey

forcreatingsameAs links in text (RDFa, microformats)

formachinereading, semanticauthoring,

knowledgebaseacceleration, …

NED methodscomecloseto human quality

combinepopularity, similarity, andcoherence

extendtowardsgeneral WSD (e.g. for QA)

Linking entitiesacrossKB‘sisadvancing

Integrated methodsforaligningentities, classesandrelations


Open problems and grand challenges

Open Problems and Grand Challenges

Entitynamedisambiguation in difficultsituations

Short andnoisytextsaboutlong-tailentities in socialmedia

Robust disambiguationofentities, relationsandclasses

Relevant forquestionanswering & question-to-querytranslation

Key building block for KB buildingandmaintenance

Combine algorithmsandcrowdsourcingfor NED & ER

withactivelearning, minimizing human effortorcost/accuracy

Automatic andcontinuouslymaintainedsameAs links

for Web ofLinked Data withhighaccuracy & coverage


End of part 3 questions

End of Part 3Questions?


Recommended readings disambiguation

Recommended Readings: Disambiguation

  • J. Hoffart, M. A. Yosef, I. Bordino, et al.: Robust Disambiguation of Named Entities in Text. EMNLP 2011

  • J. Hoffart et al.: KORE: KeyphraseOverlapRelatednessforEntityDisambiguation. CIKM 2012

  • R.C. Bunescu, M. Pasca: UsingEncyclopedicKnowledgeforNamedentityDisambiguation. EACL 2006

  • S. Cucerzan: Large-ScaleNamedEntityDisambiguationBased on Wikipedia Data. EMNLP 2007

  • D.N. Milne, I.H. Witten: Learning to link with wikipedia. CIKM 2008

  • S. Kulkarni et al.: Collective annotation of Wikipedia entities in web text. KDD 2009

  • G.Limaye et al: Annotating and Searching Web Tables Using Entities, Types and Relationships. PVLDB 2010

  • A. Rahman, V. Ng: Coreference Resolution with World Knowledge. ACL 2011

  • L. Ratinov et al.: Local and Global Algorithms for Disambiguation to Wikipedia. ACL 2011

  • M. Dredze et al.: Entity Disambiguation for Knowledge Base Population. COLING 2010

  • P. Ferragina, U. Scaiella: TAGME: on-the-fly annotation of short text fragments. CIKM 2010

  • X. Han, L. Sun, J. Zhao: Collective entity linking in web text: a graph-based method. SIGIR 2011

  • M. Tsagkias, M. de Rijke, W. Weerkamp.: Linking Online News andSocial Media. WSDM 2011

  • J. Du et al.: Towards High-Quality SemanticEntityDetectionover Online Forums. SocInfo 2011

  • V.I. Spitkovsky, A.X. Chang: A Cross-Lingual Dictionary for English Wikipedia Concepts, LREC 2012

  • J.R. Finkel, T. Grenager, C. Manning. Incorporating Non-local Information into Information Extraction

  • Systems by Gibbs Sampling. ACL 2005

  • V. Ng: Supervised Noun Phrase Coreference Research: The First Fifteen Years. ACL 2010

  • S. Singh, A. Subramanya, F.C.N. Pereira, A. McCallum: Large-Scale Cross-Document Coreference

  • Using Distributed Inference and Hierarchical Models. ACL 2011

  • T . Lin et al.: No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities. EMNLP 2012

  • A. Rahman, V. Ng: Inducing Fine-Grained Semantic Classes via Hierarchical Classification. COLING 2010

  • X. Ling, D.S. Weld: Fine-Grained Entity Recognition. AAAI 2012

  • R. Navigli: Word sense disambiguation: A survey. ACM Comput. Surv. 41(2), 2009

  • M. Yahya et al.: Natural Language Questions for the Web of Data. EMNLP 2012

  • S. Shekarpour: Automatically Transforming Keyword Queries to SPARQL on Large-Scale KBs. ISWC 2011


Recommended readings linked data and entity linkage

Recommended Readings: Linked Data and Entity Linkage

  • T. Heath, C. Bizer: Linked Data: Evolving the Web into a Global Data Space. Morgan&Claypool, 2011

  • A. Hogan, et al.: An empirical survey of Linked Data conformance. J. Web Sem. 14, 2012

  • H. Glaser, A. Jaffri, I.C. Millard: Managing Co-Reference on the Semantic Web. LDOW 2009

  • J. Volz, C.Bizer, M.Gaedke, G.Kobilarov : Discovering and Maintaining Links on the Web of Data. ISWC 2009

  • F. Naumann, M. Herschel: An Introduction to Duplicate Detection. Morgan&Claypool, 2010

  • H.Köpcke et al: Learning-Based Approaches for Matching Web Data Entities. IEEE Internet Computing 2010

  • H. Köpcke et al.: Evaluation of entity resolution approaches on real-world match problems. PVLDB 2010

  • S. Melnik, H. Garcia-Molina, E. Rahm: Similarity Flooding: A Versatile Graph Matching Algorithm and its

  • Application to Schema Matching. ICDE 2002

  • S. Chaudhuri, V. Ganti, R. Motwani: Robust Identification of Fuzzy Duplicates. ICDE 2005

  • S.E. Whang et al.: Entity Resolution with Iterative Blocking. SIGMOD 2009

  • S.E. Whang, H. Garcia-Molina: Joint Entity Resolution. ICDE 2012

  • L. Kolb, A. Thor, E. Rahm: Load Balancing for MapReduce-based Entity Resolution. ICDE 2012

  • J.Li, J.Tang, Y.Li, Q.Luo: RiMOM: A dynamic multistrategy ontology alignment framework. TKDE 21(8), 2009

  • P. Singla, P. Domingos: Entity Resolution with Markov Logic. ICDM 2006

  • I.Bhattacharya, L. Getoor: Collective Entity Resolution in Relational Data. TKDD 1(1), 2007

  • R. Hall, C.A. Sutton, A. McCallum: Unsupervised deduplication using cross-field dependencies. KDD 2008

  • V. Rastogi, N. Dalvi, M. Garofalakis: Large-Scale Collective Entity Matching. PVLDB 2011

  • F. Suchanek et al.: PARIS: Probabilistic Alignment of Relations, Instances, and Schema. PVLDB 2012

  • Z. Wang, J. Li, Z. Wang, J. Tang: Cross-lingual knowledge linking across wiki knowledge bases. WWW 2012

  • T. Nguyen et al.: Multilingual Schema Matching for Wikipedia Infoboxes. PVLDB 2012

  • A.Hogan et al.: Scalable and distributed methods for entity matching. J. Web Sem. 10, 2012

  • C. Böhm et al.: LINDA: Distributed Web-of-Data-ScaleEntityMatching. CIKM 2012

  • J. Wang, T. Kraska, M. Franklin, J. Feng: CrowdER: Crowdsourcing Entity Resolution. PVLDB 2012


Knowledge harvesting overall take home lessons

Knowledge Harvesting:Overall Take-Home Lessons

KB‘saregreatopportunity in thebig-dataera:

reviveoldAIvision, makeit real & large-scale !

challenging, but high pay-off

Strong successstory on entitiesandclasses

Goodprogress on relational facts

Methodsforopen-domainrelationdiscovery

Manyopportunitiesremaining:

temporal knowledge, spatial, visual, commonsense

verticaldomains: health, music, travel, …

Searchandranking:

Combine facts (SPO triples) withwitnesstext

Extend SPARQL, LM‘sforranking, UI unclear

Entitylinking:

Fromnames in texttoentities in KB

sameAsbetweenentities in different KB‘s / DB‘s


Knowledge harvesting research opportunities challenges

Knowledge Harvesting: ResearchOpportunities & Challenges

Explore & exploitsynergiesbetween

semantic, statistical, & social Web methods:

statisticalevidence +

logicalconsistency+

wisdomofthecrowd !

  • ForDB / AI / IR / NLP / Web researchers:

    • efficiency& scalability

    • consistencyconstraints& reasoning

    • searchandranking

    • deeplinguisticpatterns& statistics

    • text (& speech) disambiguation

    • killerappforuncertaindatamanagement

    • knowledge-baselife-cycle

    • andmore


Gerhard weikum max planck institute for informatics mpi inf mpg de weikum

cmn: 非常谢谢你

yue: 唔該

wu: 谢谢侬

dai: ขอบคุณ

tib: ཐུགས་རྗེ་ཆེ་།

en: thankyou

expression of

gratitude

de: vielen Dank

fr: Merci beaucoup

es: muchas gracias

ru: Большое спасибо


  • Login