slide1
Download
Skip this Video
Download Presentation
The Intelligence in Wikipedia Project

Loading in 2 Seconds...

play fullscreen
1 / 39

TheIntelligence in Wikipedia Project - PowerPoint PPT Presentation


  • 360 Views
  • Uploaded on

The Intelligence in Wikipedia Project Daniel S. We ld Department of Computer Science & Engineering University of Washington Seattle, WA, USA Joint Work with Fei Wu, Eytan Adar, Saleema Amershi, Oren Etzioni, James Fogarty, Raphael Hoffmann, Kayur Patel,

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'TheIntelligence in Wikipedia Project' - emily


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

The

Intelligence in Wikipedia

Project

Daniel S. Weld

Department of Computer Science & Engineering

University of Washington

Seattle, WA, USA

Joint Work with

Fei Wu, Eytan Adar, Saleema Amershi, Oren Etzioni, James Fogarty, Raphael Hoffmann, Kayur Patel,

Stef Schoenmackers & Michael Skinner

wikipedia for ai
Wikipedia for AI

Benefit to AI

  • Fantastic Corpus
  • Dynamic Environment
    • E.g., Bot Framework
  • Powers Reasoning
    • Semantic Distance Measure [Ponzetto&Strube07]
    • Word-Sense Disambiguation [Bunescu&Pasca06, Mihalcea07]
    • Coreference Resolution [Ponzetto&Strube06, Yang&Su07]
    • Ontology / Taxonomy [Suchanek07, Muchnik07]
    • Multi-Lingual Alignment [Adafre&Rijke06]
    • Question Answering [Ahn et al.05, Kaisser08]
  • Basis of Huge KB[Auer et al.07]
ai for wikipedia
AI for Wikipedia

Benefit to Wikipedia: Tools

Internal link maintenance

Infobox Creation

Schema Management

Reference suggestion & fact checking

Disambiguation page maintenance

Translation across languages

Vandalism Alerts

ComscoreMediaMetrix – August 2007

motivating vision

Albert Einstein was a German-born theoretical physicist …

Einstein was a guest lecturer at the Institute for Advanced Study in New Jersey

New Jersey is a state in the Northeastern region of the United States

Motivating Vision

Next-Generation Search = Information Extraction

+ Ontology

+ Inference

Which German Scientists Taught at US Universities?

next generation search
Next-Generation Search

Information Extraction

<Einstein, Born-In, Germany>

<Einstein, ISA, Physicist>

<Einstein, Lectured-At, IAS>

<IAS, In, New-Jersey>

<New-Jersey, In, United-States> …

Ontology

Physicist (x)  Scientist(x) …

Inference

Einstein = Einstein …

Albert Einstein was a German-born theoretical physicist …

New Jersey is a state in the Northeastern region of the United States

Scalable

Means

Self-Supervised

why mine wikipedia
Why Mine Wikipedia?

Pros

Cons

Natural-language

Missing data

Inconsistent

Low redundancy

  • High-quality, comprehensive
  • UID for key concepts
  • First sentence as definition
  • Infoboxes
  • Categories & lists
  • Redirection pages
  • Disambiguation pages
  • Revision history
  • Multilingual corpus
slide7

The

Intelligence in Wikipedia

Project

Outline

1. Self-supervised extraction from Wikipedia text

(and the greater Web)

2. Automatic ontology generation

3. Scalable probabilistic inference for Q/A

slide8

Outline

1. Self-supervised extraction from Wikipedia text

2. Automatic ontology generation

3. Scalable probabilistic inference for Q/A

slide9

Building on

    • SNOWBALL [Agichtein&Gravano 00]
    • MULDER [Kwok et al. TOIS01]
    • AskMSR [Brill et al. EMNLP02]
    • KnowItAll [Etzioni et al. AAAI04]

Outline

1. Self-supervised extraction from Wikipedia text

2. Automatic ontology generation

3. Scalable probabilistic inference for Q/A

slide10

Kylin: Self-Supervised

Information Extraction from Wikipedia

[Wu & Weld CIKM 2007]

From infoboxes to a training set

Clearfield County was created in 1804 from parts of Huntingdon and Lycoming Counties but was administered as part of Centre County until 1812.

Its county seat is Clearfield.

2,972 km² (1,147 mi²) of it is land and 17 km²  (7 mi²) of it (0.56%) is water.

As of 2005, the population density was 28.2/km².

slide12

Preliminary Evaluation

  • Kylin Performed Well on Popular Classes:
  • Precision: mid 70% ~ high 90%
  • Recall: low 50% ~ mid 90%
  • ... Floundered on Sparse Classes – Little Training Data

82% < 100 instances; 40% <10 instances

slide13

Shrinkage?

person

(1201)

.birth_place

performer

(44)

.location

.birthplace

.birth_place

.cityofbirth

.origin

actor

(8738)

comedian

(106)

slide14

Outline

1. Self-Supervised Extraction from Wikipedia Text

  • Training on Infoboxes
  • Improving Recall – Shrinkage, Retraining, Web Extraction
  • Community Content Creation

2. Automatic Ontology Generation

3. Scalable Probabilistic Inference for Q/A

subsumption detection
Subsumption Detection
  • Binary Classification Problem
  • Nine Complex Features

E.g., String Features

… IR Measures

… Mapping to Wordnet

… Hearst Pattern Matches

… Class Transitions in Revision History

  • Learning Algorithm

SVM & MLN Joint Inference

Person

6/07: Einstein

Scientist

Physicist

schema mapping
Schema Mapping

Performer

Person

birth_date

birth_place

name

other_names

birthdate

location

name

othername

  • Heuristics
    • Edit History
    • String Similarity
  • Experiments
    • Precision: 94% Recall: 87%
  • Future
    • Integrated Joint Inference
slide19

Outline

1. Self-Supervised Extraction from Wikipedia Text

  • Training on Infoboxes
  • Improving Recall – Shrinkage, Retraining, Web Extraction
  • Community Content Creation

2. Automatic Ontology Generation

3. Scalable Probabilistic Inference for Q/A

improving recall on sparse classes wu et al kdd 08
Improving Recall on Sparse Classes [Wu et al. KDD-08]

Shrinkage

Extra Training Examples from Related Classes

How Weight New Examples?

Retraining

Compare Kylin Extractions with Ones from Textrunner [Banko et al. IJCAI-07]

Additional Positive Examples

Eliminate False Negatives

Extraction from Broader Web

person

(1201)

performer

(44)

actor

(8738)

comedian

(106)

slide22

Effect of Shrinkage & Retraining

1755% improvement

for a sparse class

13.7% improvement

for a popular class

related work on ontology driven information extraction
Related Work on Ontology Driven Information Extraction
  • SemTag and Seeker

[Dill WWW03]

  • PANKOW

[Cimiano WWW05]

  • OntoSyphon

[McDowell & Cafarella ISWC06]

improving recall on sparse classes wu et al kdd 0824
Improving Recall on Sparse Classes [Wu et al. KDD-08]

Shrinkage

Retraining

Extract from Broader Web

44% of Wikipedia Pages = “stub”

Extractor quality irrelevant

Query Google & Extract

How maintain high precision?

Many Web pages noisy, describe multiple objects

How integrate with Wikipedia extractions?

slide26

Outline

1. Self-Supervised Extraction from Wikipedia Text

  • Training on Infoboxes
  • Improving Recall – Shrinkage, Retraining, Web Extraction
  • Community Content Creation

2. Automatic Ontology Generation

3. Scalable Probabilistic Inference for Q/A

problem
Problem
  • Information Extraction is Imprecise
    • Wikipedians Don’t Want 90% Precision
  • How Improve Precision?
    • People!
contributing as a non primary task
Contributing as a Non-Primary Task
  • Encourage contributions
  • Without annoying or abusing readers
    • Compared 5 different interfaces
adwords deployment study
Adwords Deployment Study

[Hoffman et al. 2008]

  • 2000 articles containing writer infobox
  • Query for “ray bradbury” would show
  • Redirect to mirror with injected JavaScript
  • Round-robin interface selection:
    • baseline, popup, highlight, icon
  • Track clicks, load, unload, and show survey
results
Results
  • Contribution Rate
    • 1.6%  13%
  • 90% of positive labels were correct
slide33

Outline

1. Self-Supervised Extraction from Wikipedia Text

2. Automatic Ontology Generation

3. Scalable Probabilistic Inference for Q/A

scalable probabilistic inference
Scalable Probabilistic Inference

[Schoenmackeret al. 2008]

  • Eight MLN Inference Rules
    • Transitivity of predicates, etc.
  • Knowledge-Based Model Construction
  • Tested on 100 Million Tulples
    • Extracted by Textrunner from Web
cost of inference
Cost of Inference

Approximately Pseudo-Functional Relations

slide37

Conclusion

  • Wikipedia is a Fantastic Platform & Corpus
  • Self-Supervised Extraction from Wikipedia

Training on Infoboxes

Works well on popular classes

Improving Recall – Shrinkage, Retraining, Web Extraction

High precision & recall - even on sparse classes, stub articles

Community Content Creation

  • Automatic Ontology Generation

Probabilistic Joint Inference

  • Scalable Probabilistic Inference for Q/A

Simple Inference - Scales to Large Corpora

Tested on 100 M Tuples

slide38

Future Work

  • Improved Ontology Generation
    • Joint Schema Mapping
    • Incorporate Freebase, etc.
  • Multi-Lingual Extraction
  • Automatically Learn Inference Rules
  • Make Available as Web Service
    • Integrate Back Into Wikipedia
ad