Slide1 l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 53

In collaboration with Giorgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath, Ralf Schenkel, Fabian Suchanek, Martin Theobald PowerPoint PPT Presentation


  • 120 Views
  • Uploaded on
  • Presentation posted in: General

In collaboration with Giorgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath, Ralf Schenkel, Fabian Suchanek, Martin Theobald. Vision. Opportunity: Turn the Web (and Web 2.0 and Web 3.0 ...) into the world‘s most comprehensive knowledge base (semantic DB).

Download Presentation

In collaboration with Giorgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath, Ralf Schenkel, Fabian Suchanek, Martin Theobald

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Slide1 l.jpg

In collaboration with Giorgiana Ifrim, Gjergji Kasneci,

Josiane Parreira, Maya Ramanath, Ralf Schenkel,

Fabian Suchanek, Martin Theobald


Vision l.jpg

Vision

Opportunity:

Turn the Web (and Web 2.0 and Web 3.0 ...) into the

world‘s most comprehensive knowledge base (semantic DB)

Challenge: seize opportunity and make it happen!

  • Approach: combine and exploit synergies of

    • hand-crafted, high-quality knowledge sources

    •  Semantic Web

    • automatic knowledge extraction

    •  Statistical Web

    • social networks and human computing

    •  Social Web

Gerhard Weikum ADBIS Oct 1, 2007


Proof of relevance l.jpg

Proof of Relevance

Vannevar Bush:

As We May Think,

1945.

There is a growing mountain of research.

A memex is a device in which an individual stores all his books,

records, and communications, and which is mechanized so that

it may be consulted with exceeding speed and flexibility.

It is an enlarged intimate supplement to his memory.

Gerhard Weikum ADBIS Oct 1, 2007


Proof of relevance4 l.jpg

Proof of Relevance

Tim Berners-Lee: In the Semantic Web information is

given well-defined meaning.

Jim Gray: … system can answer questions

about the text as precisely and quickly

as a human expert.

Brewster Kahle: The goal of universal access to

our cultural heritage is within our grasp.

Jimmy Wales: Our big-picture vision is to share

knowledge with all of humanity.

Al Gore: The future will be better tomorrow.

Gerhard Weikum ADBIS Oct 1, 2007


Proof of relevance5 l.jpg

Proof of Relevance ?

To know that we know what we know,

and that we do not know what we do not know,

that is true knowledge. 

You cannot open a book without learning something.

A journey of a thousand miles begins with a single step.

Confucius, 551-479 BC

Ignorance is less remote from the truth

than prejudice.

When science, art, literature, and philosophy are simply the

manifestation of personality,

they can make a man's name live for thousands of years.

Sentences are like sharp nails, which force truth upon our memories.

Denis Diderot, 1713-1784

Gerhard Weikum ADBIS Oct 1, 2007


Why google and wikipedia are not enough l.jpg

Why Google and Wikipedia Are Not Enough

Turn the Web, Web2.0, and Web3.0 into the world‘s

most comprehensive knowledge base („semantic DB/graph“) !

Answer „knowledge queries“ such as:

proteins that inhibit both protease and some other enzyme

neutron stars with Xray bursts > 1040 erg s-1 & black holes in 10‘‘

differences in Rembetiko music from Greece and from Turkey

connection between Thomas Mann and Goethe

market impact of Web2.0 technology in December 2006

sympathy or antipathy for Germany from May to August 2006

Nobel laureate who survived both world wars and his children

drama with three women making a prophecy

to a British nobleman that he will become king

Gerhard Weikum ADBIS Oct 1, 2007


Outline l.jpg

Outline

Introduction: Search for Knowledge

  • Harvesting Knowledge

    • Leibniz Approach

    • Planck Approach

    • Darwin Approach

Conclusion

Gerhard Weikum ADBIS Oct 1, 2007


Three roads to knowledge l.jpg

Three Roads to Knowledge

Leibniz Approach:

Handcrafted High-Quality Knowledge Sources

(Semantic Web)

Planck Approach:

Large-scale Information Extraction & Harvesting

(Statistical Web)

Darwin Approach:

Social Wisdom from Web 2.0 Communities

(Social Web)

Gerhard Weikum ADBIS Oct 1, 2007


Slide9 l.jpg

Leibniz Approach (Semantic Web)

  • Handcrafted High-Quality Knowledge:

  • Ontologies and other Lexical Sources

  • Build on Rigorous Knowledge Atoms

  • („Characteristica Universalis“)

Gottfried Wilhelm Leibniz

(1646 - 1716)

Gerhard Weikum ADBIS Oct 1, 2007


High quality knowledge sources l.jpg

High-Quality Knowledge Sources

General-purpose ontologies for Semantic Web: SUMO, Cyc, etc.

Gerhard Weikum ADBIS Oct 1, 2007


High quality knowledge sources11 l.jpg

High-Quality Knowledge Sources

General-purpose thesauri and concept networks: WordNet family

woman, adult female – (an adult female person)

=> amazon, virago – (a large strong and aggressive woman)

=> donna -- (an Italian woman of rank)

=> geisha, geisha girl -- (...)

=> lady (a polite name for any woman)

...

=> wife – (a married woman, a man‘s partner in marriage)

=> witch – (a being, usually female, imagined to

have special powers derived from the devil)

Gerhard Weikum ADBIS Oct 1, 2007


High quality knowledge sources12 l.jpg

High-Quality Knowledge Sources

General-purpose thesauri and concept networks: WordNet family

  • 200 000 concepts and relations;

  • can be cast into

  • description logics or

  • graph, with weights for relation strengths

  • (derived from co-occurrence statistics)

enzyme -- (any of several complex proteins that are produced by cells and

act as catalysts in specific biochemical reactions)

=> protein -- (any of a large group of nitrogenous organic compounds

that are essential constituents of living cells; ...)

=> macromolecule, supermolecule

...

=> organic compound -- (any compound of carbon

and another element or a radical)

...

=> catalyst, accelerator -- ((chemistry) a substance that initiates or

accelerates a chemical reaction

without itself being affected)

=> activator -- ((biology) any agency bringing about activation; ...)

Gerhard Weikum ADBIS Oct 1, 2007


High quality knowledge sources13 l.jpg

High-Quality Knowledge Sources

Domain ontologies (UMLS, GeneOntology, etc.)

  • 1 Mio. biomedical concepts, 135 categories,

  • 54 relationships (e.g. virus causes (disease | symptom) )

Gerhard Weikum ADBIS Oct 1, 2007


High quality knowledge sources14 l.jpg

High-Quality Knowledge Sources

Wikipedia and other lexical sources

  • 2 Mio. articles

  • 40 Mio. hyperlinks

  • many 1000‘s of categories and lists

  • more than 100 languages

  • growing very fast

Gerhard Weikum ADBIS Oct 1, 2007


Exploit hand crafted knowledge l.jpg

Exploit Hand-Crafted Knowledge

Wikipedia, WordNet, and other lexical sources

{{Infobox_Scientist

| name = Max Planck

| birth_date = [[April 23]], [[1858]]

| birth_place = [[Kiel]], [[Germany]]

| death_date = [[October 4]], [[1947]]

| death_place = [[Göttingen]], [[Germany]]

| residence = [[Germany]]

| nationality = [[Germany|German]]

| field = [[Physicist]]

| work_institution = [[University of Kiel]]</br>

[[Humboldt-Universität zu Berlin]]</br>

[[Georg-August-Universität Göttingen]]

| alma_mater = [[Ludwig-Maximilians-Universität München]]

| doctoral_advisor = [[Philipp von Jolly]]

| doctoral_students =

[[Gustav Ludwig Hertz]]</br>

| known_for = [[Planck's constant]],

[[Quantum mechanics|quantum theory]]

| prizes = [[Nobel Prize in Physics]] (1918)

Gerhard Weikum ADBIS Oct 1, 2007


Yago yet another great ontology f suchanek g kasneci g weikum www 07 l.jpg

YAGO: Yet Another Great Ontology[F. Suchanek, G. Kasneci, G. Weikum: WWW‘07]

  • Turn Wikipedia into explicit knowledge base (semantic DB)

  • Exploit hand-crafted categoriesand templates

  • Represent facts as explicit knowledge triples:

  • relation (entity1, entity2)

  • (in FOL, compatible with RDF, OWL-lite, XML, etc.)

  • Map (and disambiguate) relations into WordNet concept DAG

relation

entity1

entity2

Examples:

bornIn

isInstanceOf

City

Max_Planck

Kiel

Kiel

Gerhard Weikum ADBIS Oct 1, 2007


Yago knowledge representation l.jpg

YAGO Knowledge Representation

Knowledge Base # Facts

KnowItAll 30 000

SUMO 60 000

WordNet 200 000

OpenCyc 300 000

Cyc 5 000 000

YAGO 6 000 000

Accuracy  97%

Entity

subclass

subclass

Person

concepts

Location

subclass

Scientist

subclass

subclass

subclass

subclass

City

Country

Biologist

Physicist

instanceOf

instanceOf

Erwin_Planck

Nobel Prize

bornIn

Kiel

hasWon

FatherOf

individuals

diedOn

bornOn

October 4, 1947

Max_Planck

April 23, 1858

means

means

means

“Max Karl Ernst Ludwig Planck”

“Dr. Planck”

“Max Planck”

words

Online access and download at http://www.mpi-inf.mpg.de/~suchanek/yago/

Gerhard Weikum ADBIS Oct 1, 2007


Yago disambiguation uncertainty l.jpg

YAGO Disambiguation & Uncertainty

capture

confidence

value

for each fact

Entity

subclass

1.0

subclass

1.0

Person

Location

subclass

1.0

subclass

1.0

1.0

subclass

subclass

Mythological Figure

City

Country

Celebrity

0.7

instanceOf

0.8

instanceOf

0.9

instanceOf

1.0

0.4

instanceOf

locatedIn

Paris(Myth.)

Paris(France)

France

Paris Hilton

0.95

means

0.7

means

0.1

means

means

0.9

0.2

means

0.05

“Paris”

“La Grande Nation”

“France”

additional harvesting of relations from natural-language texts

by info-extraction tools

Gerhard Weikum ADBIS Oct 1, 2007


Naga graph ir on yago g kasneci et al www 07 l.jpg

NAGA: Graph IR on YAGO [G. Kasneci et al.: WWW‘07]

Graph-based search on YAGO-style knowledge bases

with built-in ranking based on statistical language model

discovery queries

hasWon

diedOn

Nobel prize

$a

$x

bornIn

isa

Kiel

$x

scientist

>

hasSon

diedOn

$y

$b

connectedness queries

isa

*

German novelist

Thomas Mann

Goethe

queries with regular expressions

hasFirstName | hasLastName

isa

Ling

$x

scientist

(coAuthor

| advisor)*

worksFor

locatedIn*

$y

Zhejiang

Beng Chin Ooi

Gerhard Weikum ADBIS Oct 1, 2007


Naga searching knowledge l.jpg

NAGA: Searching Knowledge

q: Fisher isa scientist

Fisher isa $x

[email protected] = Ronald_Fisher

[email protected] = scientist_109871938

$X = alumnus_109165182

[email protected] = Irving_Fisher

[email protected] = scientist_109871938

$X = social_scientist_109927304

[email protected] = James_Fisher

[email protected] = scientist_10981938

$X = ornithologist_109711173

[email protected] = Ronald_Fisher

[email protected] = scientist_109871938

$X = theorist_110008610

[email protected] = Ronald_Fisher

[email protected] = scientist_109871938

$X = colleague_109301221

[email protected] = Ronald_Fisher

[email protected] = scientist_109871938

$X = organism_100003226

mathematician_109635652   —subClassOf—>   scientist_109871938 Alumni_of_Gonville_and_Caius_College,_Cambridge   —subClassOf—>   alumnus_109165182 "Fisher"   —familyNameOf—>   Ronald_Fisher Ronald_Fisher   —type—>   Alumni_of_Gonville_and_Caius_College,_Cambridge Ronald_Fisher   —type—>   20th_century_mathematicians "scientist"   —means—>   scientist_109871938

Gerhard Weikum ADBIS Oct 1, 2007


Naga searching ranking knowledge l.jpg

NAGA: Searching & Ranking Knowledge

q: Fisher isa scientist

Fisher isa $x

[email protected] = Ronald_Fisher

[email protected] = scientist_109871938

$X = mathematician_109635652

[email protected] = Ronald_Fisher

[email protected] = scientist_109871938

$X = statistician_109958989

[email protected] = Ronald_Fisher

[email protected] = scientist_109871938

$X = president_109787431

[email protected] = Ronald_Fisher

[email protected] = scientist_109871938

$X = geneticist_109475749

[email protected] = Ronald_Fisher

[email protected] = scientist_109871938

$X = scientist_109871938

Score: 7.184462521168058E-13 mathematician_109635652   —subClassOf—>   scientist_109871938 "Fisher"   —familyNameOf—>   Ronald_FisherRonald_Fisher   —type—>   20th_century_mathematicians "scientist"   —means—>   scientist_109871938 20th_century_mathematicians   —subClassOf—>   mathematician_109635652

Online access at http://www.mpi-inf.mpg.de/~kasneci/naga/

Gerhard Weikum ADBIS Oct 1, 2007


Ranking factors l.jpg

Ranking Factors

  • Confidence:

  • Prefer results that are likely to be correct

    • Certainty of IE

    • Authenticity and Authority

      of Sources

bornIn (Max Planck, Kiel) from

„Max Planck was born in Kiel“

(Wikipedia)

livesIn (Elvis Presley, Mars) from

„They believe Elvis hides on Mars“

(Martian Bloggeria)

  • Informativeness:

  • Prefer results that are likely important

  • May prefer results that are likely new to user

    • Frequency in answer

    • Frequency in corpus (e.g. Web)

    • Frequency in query log

q: isa (Einstein, $y)

isa (Einstein, scientist)

isa (Einstein, vegetarian)

q: isa ($x, vegetarian)

isa (Einstein, vegetarian)

isa (Al Nobody, vegetarian)

  • Compactness:

  • Prefer results that are

  • tightly connected

    • Size of answer graph

vegetarian

Tom Cruise

isa

isa

bornIn

Einstein

won

1962

won

Nobel Prize

Bohr

diedIn

Gerhard Weikum ADBIS Oct 1, 2007


Summary of leibniz approach l.jpg

Summary of Leibniz Approach

Hand-crafted knowledge sources are great assets,

but expensive, partial, and isolated

Great mileage even from informal & semiformal sources

Connecting & reconciling different sources gives added value

(and sometimes is not even that hard)

  • Challenge:

  • Develop methods for comprehensive, highly accurate

  • mappings across many knowledge sources

    • Cross-lingual

    • Cross-temporal

    • Scalable

Gerhard Weikum ADBIS Oct 1, 2007


Slide24 l.jpg

Planck Approach (Statistical Web)

  • Information Extraction & Harvesting:

  • Gather Entities, Relations, Facts

  • Live with Uncertainty

Max Planck

(1858 - 1947)

Gerhard Weikum ADBIS Oct 1, 2007


Slide25 l.jpg

Information Extraction (IE): Text to Records

Person BirthDate BirthPlace ...

Max Planck 4/23, 1858 Kiel

Albert Einstein 3/14, 1879 Ulm

Mahatma Gandhi 10/2, 1869 Porbandar

Person ScientificResult

Max Planck Quantum Theory

Person Collaborator

Max Planck Albert Einstein

Max Planck Niels Bohr

Constant Value Dimension

Planck‘s constant 6.2261023 Js

combine NLP, pattern matching, lexicons, statistical learning

Gerhard Weikum ADBIS Oct 1, 2007


Ie technology rules patterns learning l.jpg

IE Technology: Rules, Patterns, Learning

<lecture>

NP

NP

NN

IN

DT

NP

VB

IN

DT

ADJ

NN

PP

NP

IN

CD

  • For natural-language text and for heterogeneous sources:

  • NLP techniques (parser, PoS tagging) for tokenization

  • identify patterns (e.g. regular expressions) as features

  • train statistical learners for segmentation and labeling

  • use learned model to automatically tag newly seen input

Training data:

The WWW conference takes place in Banff in Canada.

Today‘s keynote speaker is Dr. Berners-Lee from W3C.

The panel in Edinburgh, chaired by Ron Brachman from Yahoo!, …

<location>

<organization>

<person>

<event>

Ian Foster, father of the Grid, talks at the GES conference in Germany on 05/02/07.

<person>

<event>

<location>

<date>

Gerhard Weikum ADBIS Oct 1, 2007


Knowledge acquisition from the web l.jpg

Knowledge Acquisition from the Web

  • Learn Semantic Relations from Entire Corpora at Large Scale

  • (as exhaustively as possible but with high accuracy)

  • Examples:

  • all cities, all basketball players, all composers

  • headquarters of companies, CEOs of companies, synonyms of proteins

  • birthdates of people, capitals of countries, rivers in cities

  • which musician plays which instruments

  • who discovered or invented what

  • which enzyme catalyzes which biochemical reaction

Existing approaches and tools use

almost-unsupervised pattern matching and learning:

seeds (known facts)  patterns (in text)  (extraction) rule  (new) facts

Gerhard Weikum ADBIS Oct 1, 2007


Methods for web scale fact extration l.jpg

Methods for Web-Scale Fact Extration

Assessment of facts & generation of rules based on statistics

Rules can be more sophisticated:

playing NN: (ADJ|ADV)* NP & class(head(NP))=person

 plays (head(NP), NN)

seeds  text  rules  new facts

Example:

city (Seattle) in downtown Seattle

city (Seattle) Seattle and other towns

city (Las Vegas) Las Vegas and other towns

plays (Zappa, guitar) playing guitar: … Zappa

plays (Davis, trumpet) Davis … blows trumpet

Example:

city (Seattle) in downtown Seattle in downtown X

city (Seattle) Seattle and other towns X and other towns

city (Las Vegas) Las Vegas and other townsX and other towns

plays (Zappa, guitar) playing guitar: … Zappaplaying Y: … X

plays (Davis, trumpet) Davis … blows trumpetX … blows Y

Example:

city (Seattle) in downtown Seattle in downtown X

city (Seattle) Seattle and other towns X and other towns

city (Las Vegas) Las Vegas and other towns X and other towns

plays (Zappa, guitar) playing guitar: … Zappaplaying Y: … X

plays (Davis, trumpet) Davis … blows trumpetX … blows Y

Example:

city (Seattle) in downtown Seattlein downtown X

city (Seattle) Seattle and other townsX and other towns

city (Las Vegas) Las Vegas and other townsX and other towns

plays (Zappa, guitar) playing guitar: … Zappaplaying Y: … X

plays (Davis, trumpet) Davis … blows trumpetX … blows Y

in downtown Delhicity(Delhi)

Coltrane blows saxplays(C., sax)

city(Delhi)

plays(Coltrane, sax)

city(Delhi) old center of Delhi

plays(Coltrane, sax) sax player Coltrane

city(Delhi) old center of Delhiold center of X

plays(Coltrane, sax) sax player ColtraneY player X

Gerhard Weikum ADBIS Oct 1, 2007


Performance of web ie l.jpg

Performance of Web-IE

State-of-the-art precision/recall results:

relationprecision recall corpus systems

countries80% 90% Web KnowItAll

cities80% ??? Web KnowItAll

scientists60% ??? WebKnowItAll

CEOs80% 50% News Snowball, LEILA

birthdates80% 70% Wikipedia LEILA

instanceOf40% 20% WebText2Onto, LEILA

precision value-chain: entities 80%, attributes 70%, facts 60%, events 50%

Anecdotic evidence:

invented (A.G. Bell, telephone)

married (Hillary Clinton, Bill Clinton)

isa (yoga, relaxation technique)

isa (zearalenone, mycotoxin)

contains (chocolate, theobromine)

contains (Singapore sling, gin)

invented (Johannes Kepler, logarithm tables)

married (Segolene Royal, Francois Hollande)

isa (yoga, excellent way)

isa (your day, good one)

contains (chocolate, raisins)

plays (the liver, central role)

makes (everybody, mistakes)

Gerhard Weikum ADBIS Oct 1, 2007


Beyond surface learning with leila l.jpg

Beyond Surface Learning with LEILA

(Cairo, Rhine), (Rome, 0911), (, [0..9]*), …

(Cologne, Rhine), (Cairo, Nile), …

NP

VP

PP

NP

NP

PP

NP

NP

NP

PP

NP

VP

NP

PP

NP

NP

NP

Cologne lies on the banks of the Rhine

People in Cairo like wine from the Rhine valley

Mp

Js

Os

AN

Ss

MVp

DMc

Mp

Dg

Jp

Js

Sp

Mvp

Ds

Js

NP

VP

VP

PP

NP

NP

PP

NP

NP

Paris was founded on an island in the Seine

(Paris, Seine)

Ss

Pv

MVp

Ds

DG

Js

Js

MVp

We visited Paris last summer. It has many museums along the banks of the Seine.

Learning to Extract Information by Linguistic Analysis [F. Suchanek et al.: KDD’06]

Limitation of surface patterns:

who discovered or invented what

“Tesla’s work formed the basis of AC electric power”

“Al Gore funded more work for a better basis of the Internet”

Almost-unsupervised Statistical Learning with Dependency Parsing

LEILA outperforms other Web-IE methods in precision and recall,

but dependency parser is slow

Gerhard Weikum ADBIS Oct 1, 2007


Ie efficiency and accuracy tradeoffs l.jpg

IE Efficiency and Accuracy Tradeoffs

[see also tutorials by Cohen, Doan/Ramakrishnan/Vaithyanathan, Agichtein/Sarawagi]

IE is cool, but what‘s in it for DB folks?

  • precision vs. recall: two-stage processing (filter pipeline)

    • recall-oriented harvesting

    • precision-oriented scrutinizing

  • preprocessing

    • indexing: NLP trees & graphs, N-grams, PoS-tag patterns ?

    • exploit ontologies? exploit usage logs ?

  • turn crawl&extract into set-oriented query processing

    • candidate finding

    • efficient phrase, pattern, and proximity queries

    • optimizing entire text-mining workflows [Ipeirotis et al.: SIGMOD‘06]

Gerhard Weikum ADBIS Oct 1, 2007


Summary of planck approach l.jpg

Summary of Planck Approach

Human text (and speech) is diverse and produced at

higher rate than manual high-quality annotations

IE offers reasonably robust and scalable methods for

harvesting named entities and binary relations

? Deep NLP and advanced ML are computational bottleneck

? Disambiguation (entity matching, record linkage) needed

„Joe Hellerstein (UC Berkeley)“ = „Prof. Joseph M. Hellerstein, California)“

„Max Planck Institute“ = „MPI“ ≠ „MPI“ = „Message Passing Institute“

  • Challenge:

  • Achieve Web-scale IE throughput that can

  • sustain rate of new content production (e.g. blogs)

  • (may need large-scale P2P/Grid)

  • with > 90% accuracy and Wikipedia-like coverage

Gerhard Weikum ADBIS Oct 1, 2007


Slide33 l.jpg

Darwin Approach (Social Web)

  • Social Wisdom & Natural Selection:

  • Evolution of (Web 2.0) species

  • Survival of the fittest

Charles Darwin

(1809 - 1882)

Gerhard Weikum ADBIS Oct 1, 2007


Slide34 l.jpg

„Wisdom of Crowds“ at Work on Web 2.0

  • Information enrichment & knowledge extraction by humans:

  • Collaborative Recommendations & QA

    • Amazon (product ratings & reviews, recommended products)

    • Netflix: movie DVD rentals  $ 1 Mio. Challenge

    • answers.yahoo, iknow.baidu, etc.

  • Social Tagging and Folksonomies

    • del.icio.us: Web bookmarks and tags

    • flickr: photo annotation, categorization, rating

    • YouTube: same for video

  • Human Computing in Game Form

    • ESP and Google Image Labeler: image tagging

    • Peekaboom: image segmenting and tagging

    • Verbosity: facts from natural-language sentences

  • Online Communities

    • dblife.cs.wisc.edu for database research

    • www.lt-world.org for language technology

    • Yahoo! Groups, Myspace, Facebook, etc. etc.

Gerhard Weikum ADBIS Oct 1, 2007


Social tagging example flickr 1 l.jpg

Social Tagging: Example Flickr (1)

Gerhard Weikum ADBIS Oct 1, 2007


Social tagging example flickr 2 l.jpg

Social Tagging: Example Flickr (2)

Gerhard Weikum ADBIS Oct 1, 2007


Social tagging example flickr 3 l.jpg

Social Tagging: Example Flickr (3)

Gerhard Weikum ADBIS Oct 1, 2007


Slide38 l.jpg

Social-Tagging Community

> 1 Mio. users

> 100 Mio. photos

> 1 Bio. tags

30% monthly growth

Gerhard Weikum ADBIS Oct 1, 2007

Source: www.flickr.com


Slide39 l.jpg

ESP Game [Luis von Ahn et al. 2004 ]

played against random, anonymous partner on Internet

taboo:

pyramid

Louvre

museum

Paris

art

  • Game with a purpose

  • Collects annotations (wisdom)

  • Can exploit tag statistics (crowds)

  • Attracts people, fun to play, some play hours

  • ESP game collected > 10 Mio. tags from > 20000 users

  • 5000 people could tag all photos on the Web in 4 weeks

  • (human computing)

your partner has suggested:

3 labels

your partner has suggested:

7 labels

your partner has suggested:

11 labels

your partner has suggested:

17 labels

my labels:

my labels:

reflection

my labels:

reflection

water

my labels:

reflection

water

Mitterand

Mona Lisa

my labels:

reflection

water

Mitterand

Mona Lisa

metro lignes 7, 14

my labels:

reflection

water

Mitterand

Mona Lisa

metro lignes 7, 1

Da Vinci code

Congratulations!

You scored 1 point!

Gerhard Weikum ADBIS Oct 1, 2007


More human computing l.jpg

More Human Computing

  • Verbosity [von Ahn 2006]:

  • Collect common-knowledge facts (relation instances)

  • 2 players: Narrator (N) and Guessor (G)

  • N gives stylized clues:

  • is a kind of …, is used for …, is typically near/in/on …, is the opposite of …

  • random pairing for independence,

  • can build statistics over many games for same concept

Peekaboom, Phetch, etc.:

locating & tagging objects in images, finding images, etc.

  • incentives to play ?

  • game design for moving up the value-chain ?

Gerhard Weikum ADBIS Oct 1, 2007


Slide41 l.jpg

Dark Side of Social Wisdom

  • Spam (Web spam – not just for email anymore):

    • lucky online casino, easy MBA diploma, cheap V!-4-gra, etc.;

    • law suits about „appropriate Google rank“

  • Truthiness:

    • degree to which something is truthy (not necessarily facty);

    • truthy := property of something you know from your guts

  • Disputes:

    • editorial fights over critical Wikipedia articles;

    • Citizendium: new endeavor with "gentle expert oversight"

  • Dishonesty, Bias, …

Gerhard Weikum ADBIS Oct 1, 2007


Slide42 l.jpg

Gerhard Weikum ADBIS Oct 1, 2007


Slide43 l.jpg

Gerhard Weikum ADBIS Oct 1, 2007


Slide44 l.jpg

The Wisdom of Crowds: PageRank

Authority (page q) =

stationary prob. of visiting q

PageRank (PR): links are endorsements & increase page authority,

authority is higher if links come from high-authority pages

Social Ranking

with

and

equivalent to

principal eigenvector:

random walk: uniformly random choice of links + random jumps;

add bias to transitions and jumps for personal PR, TrustRank, etc.

Gerhard Weikum ADBIS Oct 1, 2007


Slide45 l.jpg

The Wisdom of Crowds: Beyond PR

users

tags

docs

Typed graphs: data items, users, friends, groups,

postings, ratings, queries, clicks, …

with weighted edges spectral analysis of various graphs

Evolving over time  tensor analysis

Gerhard Weikum ADBIS Oct 1, 2007


Decentralized graph analysis l.jpg

Decentralized Graph Analysis

global graph

local

subgraph 1

local subgraph 3

local sub-

graph 2

  • Graph spectral analysis applied to:

  • pages, sites, tags, users, groups, queries, clicks, opinions, etc. as nodes

  • assessment and interaction relations as weighted edges

  • can compute various notions of authority, reputation, trust, quality

Decentralized computation in peer-to-peer network

with arbitrary, a-priori unknown overlaps of graph fragments

Gerhard Weikum ADBIS Oct 1, 2007


Jxp algorithm j x parreira g weikum webdb 05 vldb 06 l.jpg

JXP Algorithm [J.X. Parreira, G. Weikum: WebDB’05, VLDB’06]

Decentralized, asynchronous, peer-to-peer algorithm

based on theory of Markov-chain aggregation (state lumping)

[P.J. Courtois 1977, C.D. Meyer 1988]

  • each peer aggregates non-local part of global graph into „world node“

  • peers meet randomly,

  • exchange data about their local computations, and

  • iterate their local computations

Theorem:

authority scores

from local computations

converge to global scores

supported by Minerva system

http://www.mpi-inf.mpg.de/departments/d5/software/minerva/index.html

Gerhard Weikum ADBIS Oct 1, 2007


Summary of darwin approach l.jpg

Summary of Darwin Approach

Social tagging and social networks (Web 2.0) are

potentially valuable knowledge sources

Games (human computing) are an interesting way

of enticing „knowledge input“ and collecting statistics

Spectral analysis is a highly versatile tool for rating & ranking

that can be extended and scaled by decentralized algorithms

  • Challenges:

  • Design a game that intrigues serious scientists

  • to „semantically“ annotate their scholarly work

  • Develop an analysis method that identifies the „best“ facts,

  • resilient to egoistic and malicious behaviors (incl. coalitions)

Gerhard Weikum ADBIS Oct 1, 2007


Outline49 l.jpg

Outline

Introduction: Search for Knowledge

  • Harvesting Knowledge

    • Leibniz Approach

    • Planck Approach

    • Darwin Approach

Conclusion

Gerhard Weikum ADBIS Oct 1, 2007


Summary l.jpg

Summary

  • Harvesting knowledge & organizing in semantic DB/graph for

    • scholarly Web,

    • digital libraries,

    • enterprise know-how,

    • online communities, etc.

  • Three roads to knowledge:

  • Leibniz / Semantic Web: ontologies, encyclopedia, etc.

  • Planck / Statistical Web: large-scale IE from text, speech, etc.

  • Darwin / Social Web: wisdom of crowds, tagging, folksonomies

  • Not covered here: search and ranking

    • graph IR (for ER graphs, RDF, cross-linked XML, etc.)

    • new ranking models (e.g. statistical LM for graphs)

    • efficient and scalable query processing

Gerhard Weikum ADBIS Oct 1, 2007


Major challenges l.jpg

Major Challenges

  • Generalize YAGO approach (Wikipedia + WordNet)

  • Methods for comprehensive, highly accurate

  • mappings across many knowledge sources

    • cross-lingual, cross-temporal

    • scalable in size, diversity, number of sources

  • Pursue DB support towards efficient IE (andNLP)

  • Achieve Web-scale IEthroughput that can

    • sustain rate of new content production (e.g. blogs)

    • with > 90% accuracy and Wikipedia-like coverage

  • Integrate handcrafted knowledge with NLP/ML-based IE

  • Incorporate social tagging and human computing

Gerhard Weikum ADBIS Oct 1, 2007


Potential synergies among leibniz planck and darwin l.jpg

Potential Synergies amongLeibniz, Planck, and Darwin

knowledge core

bootstrap

Leibniz

Semantic Web

Planck

Statistical Web

emerge

validate

statistics &

feedback

Darwin

Social Web

communities

Gerhard Weikum ADBIS Oct 1, 2007


Slide53 l.jpg

Thank you !

Gerhard Weikum ADBIS Oct 1, 2007


  • Login