SOFIE: A Self-Organizing Framework for Information Extraction

A Self-Organizing Framework

for Information Extraction

Fabian M. Suchanek, Mauro Sozio, Gerhard Weikum

(Max-Planck-Institute for Informatics, Saarbrücken, Germany)

Ontologies

Entity

subclassOf

subclassOf

Singer

Country

type

DBpedia,

YAGO,

KYLIN,

...

type

Wikipedia

bornInPlace

USA

?

birth-place: USA

"Elvis died in England"

Internet

Information Extraction

Goal:

Extract ontological information from natural language documents

diedInPlace

England

"Elvis died in England"

Previous approaches:

Espresso, DIPRE, LEILA, Snowball, TextRunner, Alice, and many more

ر May deliver non-canonic relations

died in, perished in, was killed in,...

ر May deliver non-canonic entities

England, UK, Great Britain, ...

ر May deliver inconsistent facts

diedInPlace(Elvis,England)

diedInPlace(Elvis,Germany)

Pitfalls of Information Extraction

Ontology

Web page

Elvis died in England.

diedInPlace

France

Louis XIV died in France.

If a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation.

"died in" = diedInPlace

Pitfalls of Information Extraction

Ontology

Web page

Elvis died in England.

Louis XIV died in France.

If a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation.

"died in" = diedInPlace

If a meaningful pattern occurs with two entities, then the entities stand in the relation.

diedInPlace

"Elvis"

"England"

Pitfalls of Information Extraction

Ontology

Web page

?

Taxidophobist

Elvis died in England.

Louis XIV died in France.

If a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation.

"died in" = diedInPlace

If a meaningful pattern occurs with two entities, then the entities stand in the relation.

diedInPlace

"Elvis"

"England"

Pitfalls of Information ExtractionIf a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation.

Web page

Reasoning Problem

Elvis died in England.

Taxidophobist

Louis XIV died in France.

"died in" = diedInPlace

If a meaningful pattern occurs with two entities, then the entities stand in the relation.

diedInPlace

"Elvis"

"England"

Pitfalls of Information ExtractionIf a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation.

Web page

Reasoning Problem

Elvis died in England.

Taxidophobist

Louis XIV died in France.

Disambiguation Problem

"died in" = diedInPlace

If a meaningful pattern occurs with two entities, then the entities stand in the relation.

Pitfalls of Information Extraction

Pattern Matching Problem

Reasoning Problem

Taxidophobist

Elvis died in England.

Louis XIV died in France.

"died in" = diedInPlace ?

Disambiguation Problem

Information Extraction as Formulas

Reasoning Problem

Taxidophobist

type(Elvis,Taxidophobist).

type(X,Taxidophobist)

& bornInPlace(X,Y)

=> diedInPlace(X,Z) [0.8]

Information Extraction as Formulas

Pattern Matching Problem

Reasoning Problem

type(Elvis,Taxidophobist).

Elvis died in England.

type(X,Taxidophobist)

& bornInPlace(X,Y)

=> diedInPlace(X,Z)

Louis XIV died in France.

"died in" = diedInPlace ?

Disambiguation Problem

Information Extraction as Formulas

Assumptions:

رIn one document, the same word has always the same meaning

رThe ontology already knows all important meanings of proper names

possibleMeaning(Elvis@D15, ElvisPresley). [0.7]

Disambiguation Problem

Information Extraction as Formulas

Assumptions:

رIn one document, the same word has always the same meaning

رThe ontology already knows all important meanings of proper names

possibleMeaning(Elvis@D15, ElvisPresley). [0.7]

Prior estimation for the likelihood of this meaning.

A word in context (wic).

Here: The word "Elvis" in document D15

| words(D15) ∩ rel(ElvisPresley)|

One possible meaning of "Elvis" as given by the ontology

| words(D15) |

Information Extraction as Formulas

Assumptions:

رIn one document, the same word has always the same meaning

رThe ontology already knows all important meanings of proper names

possibleMeaning(Elvis@D15, ElvisPresley). [0.7]

possibleMeaning(X,Y) => means(X,Y)

means(X,Y) & YZ => means(X,Z)

Information Extraction as Formulas

Pattern Matching Problem

Reasoning Problem

type(Elvis,Taxidophobist).

Elvis died in England.

type(X,Taxidophobist)

& bornInPlace(X,Y)

=> diedInPlace(X,Z)

Louis XIV died in France.

"died in" = diedInPlace ?

Disambiguation Problem

meaning(Elvis@D15,

ElvisPresley). [0.7]

Information Extraction as Formulas

Pattern Matching Problem

occurs("died in",

Elvis@D15,

England@D15). [14]

Elvis died in England.

Louis XIV died in France.

"died in" = diedInPlace ?

occurs(P,Wic1,Wic2) & means(Wic1,X) & means(Wic2,Y) & R(X,Y)

=> mapsTo(P,R)

occurs(P,Wic1,Wic2) & means(Wic1,X) & means(Wic2,Y) & mapsTo(P,R)

=> R(X,Y)

Information Extraction as Formulas

Pattern Matching Problem

Reasoning Problem

type(Elvis,Taxidophobist).

occurs("died in",

Elvis@D15,

England@D15). [14]

type(X,Taxidophobist)

& bornInPlace(X,Y)

=> diedInPlace(X,Z)

Find truth assignments to hypotheses so that the weight of satisfied formulas is maximized

means(Elvis@D15, ElvisPresley) ?

mapsTo("died In", diedInPlace) ?

diedIn(ElvisPresley, England) ?

Disambiguation Problem

meaning(Elvis@D15,

ElvisPresley). [0.7]

Weighted MAX SAT Problem

Weighted MAX SAT Problem

Find truth assignments to hypotheses so that the weight of satisfied formulas is maximized

Problems:

رThe Weighted MAX SAT Problem is NP-hard

رOur instance of the problem is huge

ر The most popular linear approximation algorithm (Johnson's)

does not work well with our type of formulas

bornInPlace(X,Y) => bornInPlace(X,Z)

A v B

A v C

B v C

Johnson's cannot approximate better than 2/3

FMS Algorithm

The Functional MAX SAT Algorithm considers only unit clauses.

Formulas

Hypotheses

A v B [w1]

A v B [w2]

B v C [w3]

C [w4]

= false

A

B

C

= false

= true

The Functional MAX SAT Algorithm propagates Dominating Unit Clauses

A v B [10]

A [10]

A [30]

30 > 10+10

A = true

FMS Algorithm

Polynomial time

FMS Algorithm

FOR i=1 TO 42

...

NEXT i

Approximation Guarantee

Experiments show better performance in practice than Johnson's algorithm in our setting .

FMS Algorithm

Elvis died in England

r(X,Y) & s(Y) => t(X,Y)

FMS Algorithm

FOR i=1 TO 42

...

NEXT i

FMS Algorithm

Elvis died in England

r(X,Y) & s(Y) => t(X,Y)

type(Elvis,Taxidophobist)=1

diedIn(Elvis,England)=0

FMS Algorithm

FOR i=1 TO 42

...

NEXT i

means(Elvis@D15,Elvis)=0

means(Elvis@D15,...)=1

diedIn

England

St. Elvis

FMS Algorithm

r(X,Y) & s(Y) => t(X,Y)

FMS Algorithm

FOR i=1 TO 42

...

NEXT i

diedIn

England

St. Elvis

Conclusion

SOFIE unifies the tasks of

رentity disambiguation

رpattern extraction

رsemantic constraint reasoning

in a single framework, delivering

رcanonicalized facts

رof high precision (experiments show 90% precision)

died in England...

but is alive!

SOFIE rules!

R(X,Y)

/\ R(X,Z)

/\ type(R,function)

=> Y = Z

occurs(P,WX,WY)

/\ refersTo(WX.X)

/\ refersTo(WY,Y)

/\ R(X,Y)

=> expresses(P,R)

occurs(P,WX,WY)

/\ expressed(P,R)

/\ refersTo(WX.X)

/\ refersTo(WY,Y)

/\ range(R,D1)

/\ domain(R,D2)

/\ type(X,D1)

/\ type(Y,D2)

=> R(X,Y)

disambiguationPrior(W,X) => refersTo(W,X)

R(X,Y)

bornInYear(X,B) /\ diedInYear(X,D) => B<D

Corpus:

3700 biography documents downloaded from the Web

Goal:

Extract bornIn, bornOnDate, diedIn, diedOnDate, politicianOf

Results: (precision in %)

Runtime: (summed over 5 batches)

Parsing 7:05h

Hypothesis Generation 6:15h

Solving 2:30h

Total 15:50h

87 87 13 98 95

90

bornIn bornOnD diedIn diedOnD polOf

SOFIE: Relation to Markov Logic

Number of satisfied instances of the ith formula

Weight of the ith formula

r(x,y) /\ s(x,z) => t(x,z) [w]

...

P(X) ~ e sat(i,X) wi

max X e sat(i,X) wi

P

max X log( e sat(i,X) wi )

max X sat(i,X) wi

false true

bornIn(Nicholas, Patras)

~~~~> Weighted MAX SAT problem

Grounding

r(X,Y) & s(Y) => t(X,Y)

Immutable, complete facts (e.g. pattern occurrences)

{ r(X,Y), s(Y), t(X,Y) }

r(a,a)

Entities={a,b}

r(a,b)

r(b,a)

r(b,b)

{ r(a,a), s(a), t(a,a) }

{ r(a,b), s(b), t(a,b) }

{ r(b,a), s(a), t(b,a) }

{ r(b,b), s(b), t(b,b) }

Grounding

r(X,Y) & s(Y) => t(X,Y)

Immutable, complete facts (e.g. pattern occurrences)

{ r(X,Y), s(Y), t(X,Y) }

r(a,a) [w]

r(a,b)

r(b,a)

r(b,b)

{ s(a), t(a,a) } [w]

Grounding

{ s(a), t(a,a) } [w1]

{p(c,d), q(e), } [w2]

Find truth assignments to hypotheses so that the weight of satisfied formulas is maximized

means(Elvis@D15, ElvisPresley) = true ?

mapsTo("died In", diedInPlace) = true ?

diedIn(ElvisPresley, England) = true ?

