1 / 38

Scalable Probabilistic Databases with Factor Graphs and MCMC

Scalable Probabilistic Databases with Factor Graphs and MCMC. Outline. Background of research Key contributions FACTORIE language Models for information extraction MCMC with database “assist” Experimental results Implications for information extraction more generally.

nellis
Download Presentation

Scalable Probabilistic Databases with Factor Graphs and MCMC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable Probabilistic Databases with Factor Graphs and MCMC

  2. Outline • Background of research • Key contributions • FACTORIE language • Models for information extraction • MCMC with database “assist” • Experimental results • Implications for information extraction more generally

  3. Background of research • McCallum an ML researcher crossing bridge to DB • Mostly tools and apps (incl. IE) for undirected models • “Probabilistic databases” undergoing significant evolution (see survey by Dalvi et al, CACM, 2009): • Early PDB systems attached probabilities to tuples: • 0.7: Employs(IBM,John) • 0.95 Employs(IBM,Mary) etc • Aggregation queries etc. under global independence • Around 2005, model-based approaches took over, but faced the same issues (expressive power, complexity) as in AI

  4. Key contributions • Increasingly sophisticated CRF-like models for extraction, entity resolution, schema mapping, etc. • FACTORIE for model construction and inference • Efficient MCMC inference on relational worlds • Handles very large models without blowing up • Efficient local computation for each MC step • Integration with database technology: • Possible world = database, MC step = database update • Query evaluation directly on database • Incremental re-evaluation after each MC step

  5. Key contributions • Increasingly sophisticated CRF-like models for extraction, entity resolution, schema mapping, etc. • FACTORIE for model construction and inference • Efficient MCMC inference on relational worlds • Handles very large models without blowing up • Efficient local computation for each MC step • Integration with database technology: • Possible world = database, MC step = database update • Query evaluation directly on database • Incremental re-evaluation after each MC step

  6. Factor graphs • Nodes are variables and factors (potentials on sets of variables) • Links connect variables to factors that include them • P(x1,…,xn) = ΠjFj(sj)/Zand (in this paper) Fj(sj) = exp(ϕj(sj) θj)w/ features ϕj • FACTORIE uses loops in a way analogous to BUGS (plates)

  7. MCMC (Metropolis-Hastings) • Worlds x, evidence e, posterior π(x) = P(x | e) = P(x,e)/P(e) • Proposal distribution q(x’ | x) determines neighborhood of x • MH samples x’ from q(x’ | x), accepts with probability • α(x’ | x) = min(1, π(x’)q(x | x’) /π(x)q(x’ | x) ) = min(1, P(x’,e)q(x | x’) /P(x,e) q(x’ | x) ) • For graphical models (and BLOG), P(x,e) is a product of local conditional probabilities (or potentials) • If the change from xto x’ is local (e.g., a single tuple becomes true or false), almost all terms in P(x,e) and P(x’,e) cancel out • Hence the per-step computation cost is independent of model size

  8. MCMC on values Earthquake(Rb) Earthquake(Rb) Earthquake(Rb) Earthquake(Rb) Earthquake(Ra) Earthquake(Ra) Earthquake(Ra) Earthquake(Ra) B(H3) B(H3) B(H3) B(H3) B(H5) B(H5) B(H5) B(H5) B(H4) B(H4) B(H4) B(H4) B(H1) B(H2) B(H1) B(H1) B(H1) B(H2) B(H2) B(H2) A(H5) A(H5) A(H5) A(H5) A(H4) A(H4) A(H4) A(H4) A(H3) A(H3) A(H3) A(H3) A(H2) A(H2) A(H2) A(H2) A(H1) A(H1) A(H1) A(H1)

  9. MCMC on values Earthquake(Rb) Earthquake(Rb) Earthquake(Rb) Earthquake(Rb) Earthquake(Ra) Earthquake(Ra) Earthquake(Ra) Earthquake(Ra) B(H3) B(H3) B(H3) B(H3) B(H5) B(H5) B(H5) B(H5) B(H4) B(H4) B(H4) B(H4) B(H1) B(H2) B(H1) B(H1) B(H1) B(H2) B(H2) B(H2) A(H5) A(H5) A(H5) A(H5) A(H4) A(H4) A(H4) A(H4) A(H3) A(H3) A(H3) A(H3) A(H2) A(H2) A(H2) A(H2) A(H1) A(H1) A(H1) A(H1)

  10. MCMC on values Earthquake(Rb) Earthquake(Rb) Earthquake(Rb) Earthquake(Rb) Earthquake(Ra) Earthquake(Ra) Earthquake(Ra) Earthquake(Ra) B(H3) B(H3) B(H3) B(H3) B(H5) B(H5) B(H5) B(H5) B(H4) B(H4) B(H4) B(H4) B(H1) B(H2) B(H1) B(H1) B(H1) B(H2) B(H2) B(H2) A(H5) A(H5) A(H5) A(H5) A(H4) A(H4) A(H4) A(H4) A(H3) A(H3) A(H3) A(H3) A(H2) A(H2) A(H2) A(H2) A(H1) A(H1) A(H1) A(H1)

  11. MCMC on values Earthquake(Rb) Earthquake(Rb) Earthquake(Rb) Earthquake(Rb) Earthquake(Ra) Earthquake(Ra) Earthquake(Ra) Earthquake(Ra) B(H3) B(H3) B(H3) B(H3) B(H5) B(H5) B(H5) B(H5) B(H4) B(H4) B(H4) B(H4) B(H2) B(H1) B(H1) B(H1) B(H2) B(H2) B(H2) B(H1) A(H5) A(H5) A(H5) A(H5) A(H4) A(H4) A(H4) A(H4) A(H3) A(H3) A(H3) A(H3) A(H2) A(H2) A(H2) A(H2) A(H1) A(H1) A(H1) A(H1)

  12. MCMC on values Earthquake(Rb) Earthquake(Rb) Earthquake(Rb) Earthquake(Rb) Earthquake(Ra) Earthquake(Ra) Earthquake(Ra) Earthquake(Ra) B(H3) B(H3) B(H3) B(H3) B(H5) B(H5) B(H5) B(H5) B(H4) B(H4) B(H4) B(H4) B(H1) B(H2) B(H1) B(H1) B(H2) B(H2) B(H2) B(H1) A(H5) A(H5) A(H5) A(H5) A(H4) A(H4) A(H4) A(H4) A(H3) A(H3) A(H3) A(H3) A(H2) A(H2) A(H2) A(H2) A(H1) A(H1) A(H1) A(H1)

  13. Integration with DB technology • Databases are designed for • storing lots of data • efficient processing of queries on lots of data • How much can we borrow from DB technology to help with probabilistic IE?

  14. Optimizing query evaluation • In databases, running a query can be expensive, especially if it involves scanning all the data: • Aggregation, e.g., #{x,y: R(x,y) ^ R(y,x)} • Quantifier alternation, chains of literals, etc. • A materialized viewis a cached database table representation of any query result • Incremental view maintenance recomputes the materialized view whenever any tuple changes • E.g., if R(A,B) is set to true, check R(B,A) and add 1 • So query can be re-evaluated much faster after each MC step

  15. Drawbacks of black-box DB technology • Modifying tuples in a disk-resident DB is expensive • DB technology designed mostly for atomic transactions; 500/second on $10K system • Difficult to add new types of optimization, e.g., maintaining efficient summaries (min, etc.) • Not suitable for some data types, e.g., images • A “database” sounds like a “possible world”, but only under Herbrand semantics

  16. Experiments - NER • Skip-chain CRF includes links between labels for identical tokens (but not across docs!!)

  17. Experiments - NER • Proposal distribution: • Choose up to five documents at random • Choose one label variable at random among these • Choose a label at random • Data: 1788 NYT articles • Query # B-PER labels (evaluate every 10k MC steps) 17650 plus/minus 50 Essentially each B-PER decision is independent; Too many parameters, too little context, no parameter uncertainty!

  18. Summary • A serious attempt to create scalable, nontrivial probability models and inference technology for IE • Experiments not totally convincing: • Efficiency: documents are independent! • Reasonableness of answers: counts far too precise • Not clear if FACTORIE is “elegantly” usable to create very complex models • Some continuing work….

More Related