Unified Probability and Logic for Information Extraction

PLUIE: Probability and Logic Unified for Information Extraction Stuart Russell Patrick Gallinari, Patrice Perny

Project Goals • “Open” information extraction • Construct knowledge bases from the web • Learn new classes, relations, linguistic patterns • Learn new predictive regularities • Integrate facts, entities across multiple documents • Support question answering • Accuracy, consistency, integration, and utility; not scale for its own sake

Approach • Probabilistic inference with the Web as evidence • Generative models when available World Web

Approach, contd. • Open-universe probability models (e.g., BLOG) • First-order expressive power (objects, relations, functions, quantifiers, equality, etc.) • Allow for uncertainty about existence, identity of objects • Generative model consists of • What might be true in the world • Who might choose to say what • How they might choose to say it

Approach contd. • Rigorous ontological framework • Standard taxonomic hierarchy that supports distinctions needed for language • E.g., mass nouns (water) vs count nouns (lake) • Proper treatment of events and time; avoid deficient “facts” such as • Man Utd beat Chelsea; Chelsea beat Man Utd(PowerSet) • Hank Paulson is the CEO of Goldman Sachs (NELL)

Open questions • Efficient inference • What is extracted? Posterior over possible worlds? • How to identify new categories and relations • HCI: Presenting infinite heterogeneous posterior distributions: Who wrote what when when“who,” “what” and “when” vary across worlds? • Making use of partially extracted or unextracted information – “data spaces” (Franklin, Halevy) • Adversarial data: game-theoretic analysis?

Plan • Reading group • Weekly meeting (day and time?) • Participants take turns presenting • Reading list at www.cs.berkeley.edu/~russell/pluie/readings.html • Formal project (ANR) runs 1/1/13 to 8/31/14 • Will continue indefinitely • Hiring two postdocs • Possible collaborations • Tom Mitchell’s NELL project (CMU) • Andrew McCallum (UMass) • Kevin Murphy (Google’s Knowledge Graph project)

Unified Probability and Logic for Information Extraction