1 / 71

Graph-Based Methods for “Open Domain” Information Extraction

Graph-Based Methods for “Open Domain” Information Extraction. William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon University. Joint work with Richard Wang.

ikia
Download Presentation

Graph-Based Methods for “Open Domain” Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Graph-Based Methods for “Open Domain” Information Extraction William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon University Joint work with Richard Wang

  2. Goal: recognize people, places, companies, times, dates, … in NL text. Supervised learning from corpus completely annotated with target entity class (e.g. “people”) Linear-chain CRFs Language- and genre-specific extractors Goal: recognize arbitrary entity sets in text Minimal info about entity class Example 1: “ICML, NIPS” Example 2: “Machine learning conferences” Semi-supervised learning from very large corpora (WWW) Graph-based learning methods Techniques are largely language-independent (!) Graph abstraction fits many languages Traditional IE vs Open Domain IE

  3. Outline • History • Open-domain IE by pattern-matching • The bootstrapping-with-noise problem • Bootstrapping as a graph walk • Open-domain IE as finding nodes “near” seeds on a graph • Set expansion - from a few clean seeds • Iterative set expansion – from many noisy seeds • Relational set expansion • Multilingual set expansion • Iterative set expansion – from a concept name alone

  4. History: Open-domain IE by pattern-matching (Hearst, 92) • Start with seeds: “NIPS”, “ICML” • Look thru a corpus for certain patterns: • … “at NIPS, AISTATS, KDD and other learning conferences…” • Expand from seeds to new instances • Repeat….until ___ • “on PC of KDD, SIGIR, … and…”

  5. NIPS SNOWBIRD “…at NIPS, AISTATS, KDD and other learning conferences…” “For skiiers, NIPS, SNOWBIRD,… and…” AISTATS SIGIR KDD … “on PC of KDD, SIGIR, … and…” “… AISTATS,KDD,…” shorter paths ~ earlier iterationsmany paths ~ additional evidence Bootstrapping as graph proximity

  6. Set Expansion for Any Language (SEAL) – (Wang & Cohen, ICDM 07) • Basic ideas • Dynamically build the graph using queries to the web • Constrain the graph to be as useful as possible • Be smart about queries • Be smart about “patterns”: use clever methods for finding meaningful structure on web pages

  7. Pentax • Sony • Kodak • Minolta • Panasonic • Casio • Leica • Fuji • Samsung • … System Architecture • Canon • Nikon • Olympus • Fetcher: download web pages from the Web that contain all the seeds • Extractor: learn wrappers from web pages • Ranker: rank entities extracted by wrappers

  8. The Extractor • Learn wrappers from web documents and seeds on the fly • Utilize semi-structured documents • Wrappers defined at character level • Very fast • No tokenization required; thus language independent • Wrappers derived from doc d applied to d only • See ICDM 2007 paper for details

  9. .. Generally <a ref=“finance/ford”>Ford</a> sales … compared to <a ref=“finance/honda”>Honda</a> while <a href=“finance/gm”>General Motors</a> and <a href=“finance/bentley”>Bentley</a> …. • Find prefix of each seed and put in reverse order: • ford1: /ecnanif”=fer a> yllareneG … • Ford2: >”drof/ /ecnanif”=fer a> yllareneG … • honda1: /ecnanif”=fer a> ot derapmoc … • Honda2: >”adnoh/ /ecnanif”=fer a> ot … • Organize these into a trie, tagging each node with a set of seeds: yllareneG … {f1} {f1,h1} /ecnanif”=fer a> ot derapmoc … {h1} >” drof/ /ecnanif”=fer a> yllareneG.. {f2} {f1,f2,h1,h2} {f2,h2} adnoh/ /ecnanif”=fer a> ot .. {h2}

  10. .. Generally <a ref=“finance/ford”>Ford</a> sales … compared to <a ref=“finance/honda”>Honda</a> while <a href=“finance/gm”>General Motors</a> and <a href=“finance/bentley”>Bentley</a> …. Find prefix of each seed and put in reverse order: Organize these into a trie, tagging each node with a set of seeds. A left contextfor a valid wrapper is a node tagged with one instance of each seed. yllareneG … {f1} {f1,h1} /ecnanif”=fer a> ot derapmoc … {h1} >” drof/ /ecnanif”=fer a> yllareneG.. {f2} {f1,f2,h1,h2} {f2,h2} adnoh/ /ecnanif”=fer a> ot .. {h2}

  11. .. Generally <a ref=“finance/ford”>Ford</a> sales … compared to <a ref=“finance/honda”>Honda</a> while <a href=“finance/gm”>General Motors</a> and <a href=“finance/bentley”>Bentley</a> …. Find prefix of each seed and put in reverse order: Organize these into a trie, tagging each node with a set of seeds. A left contextfor a valid wrapper is a node tagged with one instance of each seed. The corresponding right contextis the longest common suffix of the corresponding seed instances. “> ”>Ford</a> sales … yllareneG … {f1} {f1,h1} /ecnanif”=fer a> ot derapmoc … ”>Honda</a> while … {h1} >” drof/ /ecnanif”=fer a> yllareneG.. {f2} {f1,f2,h1,h2} {f2,h2} adnoh/ /ecnanif”=fer a> ot .. {h2} </a>

  12. Nice properties: • There are relatively few nodes in the trie: • O((#seeds)*(document length)) • You can tag every node with the complete set of seeds that it covers • You can rank of filter nodes by any predicate over this set of seeds you want: e.g., • covers all seed instances that appear on the page? • covers at least one instance of each seed? • covers at least k instances, instances with weight > w, … “> ”>Ford</a> sales … yllareneG … {f1} {f1,h1} /ecnanif”=fer a> ot derapmoc … ”>Honda</a> while … {h1} >” drof/ /ecnanif”=fer a> yllareneG.. {f2} {f1,f2,h1,h2} {f2,h2} adnoh/ /ecnanif”=fer a> ot .. {h2} </a>

  13. I am noise Me too!

  14. Differences from prior work • Fast character-level wrapper learning • Language-independent • Trie structure allows flexibility in goals • Cover one copy of each seed, cover all instances of seeds, … • Works well for semi-structured pages • Lists and tables, pull-down menus, javascript data structures, word documents, … • High-precision, low-recall data integrationvs. High-precision, low-recall information extraction

  15. The Ranker • Rank candidate entity mentions based on “similarity” to seeds • Noisy mentions should be ranked lower • Random Walk with Restart (GW) • …?

  16. Google’s PageRank Inlinks are “good” (recommendations) Inlinks from a “good” site are better than inlinks from a “bad” site but inlinks from sites with many outlinks are not as “good”... “Good” and “bad” are relative. web site xxx web site xxx web site xxx web site a b c d e f g web site pdq pdq .. web site yyyy web site a b c d e f g web site yyyy

  17. Google’s PageRank web site xxx • Imagine a “pagehopper” that always either • follows a random link, or • jumps to random page web site xxx web site a b c d e f g web site pdq pdq .. web site yyyy web site a b c d e f g web site yyyy

  18. Google’s PageRank(Brin & Page, http://www-db.stanford.edu/~backrub/google.html) web site xxx • Imagine a “pagehopper” that always either • follows a random link, or • jumps to random page • PageRank ranks pages by the amount of time the pagehopper spends on a page: • or, if there were many pagehoppers, PageRank is the expected “crowd size” web site xxx web site a b c d e f g web site pdq pdq .. web site yyyy web site a b c d e f g web site yyyy

  19. Personalized PageRank (aka Random Walk with Restart) web site xxx • Imagine a “pagehopper” that always either • follows a random link, or • jumps to particular page web site xxx web site a b c d e f g web site pdq pdq .. web site yyyy web site a b c d e f g web site yyyy

  20. Personalize PageRankRandom Walk with Restart web site xxx • Imagine a “pagehopper” that always either • follows a random link, or • jumps to a particular page P0 • this ranks pages by the total number of paths connecting them to P0 • … with each path downweighted exponentially with length web site xxx web site a b c d e f g web site pdq pdq .. web site yyyy web site a b c d e f g web site yyyy

  21. The Ranker • Rank candidate entity mentions based on “similarity” to seeds • Noisy mentions should be ranked lower • Random Walk with Restart (GW) • On what graph?

  22. Building a Graph • A graph consists of a fixed set of… • Node Types: {seeds, document, wrapper, mention} • Labeled Directed Edges: {find, derive, extract} • Each edge asserts that a binary relation r holds • Each edge has an inverse relation r-1 (graph is cyclic) • Intuition: good extractions are extracted by many good wrappers, and good wrappers extract many good extractions, “ford”, “nissan”, “toyota” Wrapper #2 find northpointcars.com extract curryauto.com derive “chevrolet” 22.5% “volvo chicago” 8.4% Wrapper #1 “honda” 26.1% Wrapper #3 Wrapper #4 “acura” 34.6% “bmw pittsburgh” 8.4%

  23. Differences from prior work • Graph-based distances vs. bootstrapping • Graph constructed on-the-fly • So it’s not different? • But there is a clear principle about how to combine results from earlier/later rounds of bootstrapping • i.e., graph proximity • Fewer parameters to consider • Robust to “bad wrappers”

  24. Evaluation Datasets: closed sets

  25. Evaluation Method • Mean Average Precision • Commonly used for evaluating ranked lists in IR • Contains recall and precision-oriented aspects • Sensitive to the entire ranking • Mean of average precisions for each ranked list Prec(r) = precision at rank r (a) Extracted mention at r matches any true mention (b) There exist no other extracted mention at rank less than r that is of the same entity as the one at r where L = ranked list of extracted mentions, r = rank • Evaluation Procedure(per dataset) • Randomly select threetrue entities and use their first listed mentions as seeds • Expand the three seeds obtained from step 1 • Repeat steps 1 and 2 five times • Compute MAP for the five ranked lists # True Entities = total number of true entities in this dataset

  26. Experimental Results: 3 seeds • Vary: [Extractor] + [Ranker] + [Top N URLs] • Extractor: • E1: Baseline Extractor (longest common context for all seed occurrences) • E2: Smarter Extractor (longest common context for 1 occurrence of each seed) • Ranker: { EF: Baseline (Most Frequent), GW: Graph Walk } • N URLs: { 100, 200, 300 }

  27. Side by side comparisons Telukdar, Brants, Liberman, Pereira, CoNLL 06

  28. Side by side comparisons EachMovie vs WWW NIPS vs WWW Ghahramani & Heller, NIPS 2005

  29. Why does SEAL do so well? Free-text wrappers are only 10-15% of all wrappers learned: “Used [...] Van Pricing" “Used [...] Engines" “Bell Road [...] " “Alaska [...] dealership" “www.sunnyking[...].com"" “engine [...] used engines" “accessories, [...] parts" “is better [...] or" • Hypotheses: • More information appears in semi-structured documents than in free text • More semi-structured documents can be (partially) understood with character-level wrappers than with HTML-level wrappers

  30. Comparing character tries to HTML-based structures

  31. Outline • History • Open-domain IE by pattern-matching • The bootstrapping-with-noise problem • Bootstrapping as a graph walk • Open-domain IE as finding nodes “near” seeds on a graph • Set expansion - from a few clean seeds • Iterative set expansion – from many noisy seeds • Iterative set expansion – from a concept name alone • Multilingual set expansion • Relational set expansion

  32. A limitation of the original SEAL

  33. Proposed Solution: Iterative SEAL (iSEAL)(Wang & Cohen, ICDM 2008) • Makes several calls to SEAL, each call… • Expands a couple of seeds • Aggregates statistics • Evaluate iSEAL using… • Two iterative processes • Supervised vs. Unsupervised (Bootstrapping) • Two seeding strategies • Fixed Seed Size vs. Increasing Seed Size • Five ranking methods

  34. ISeal (Fixed Seed Size, Supervised) Initial Seeds • …Finally rank nodes byproximity to seeds in the full graph • Refinement (ISS): Increase size of seed set for each expansion over time: 2,3,4,4,… • Variant (Bootstrap): use high-confidence extractions when seeds run out

  35. Ranking Methods Random Graph Walk with Restart • H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with restart and its application. In ICDM, 2006. PageRank • L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. 1998. Bayesian Sets (over flattened graph) • Z. Ghahramani and K. A. Heller. Bayesian sets. In NIPS, 2005. Wrapper Length • Weights each item based on the length of common contextual string of that item and the seeds Wrapper Frequency • Weights each item based on the number of wrappers that extract the item

  36. Little difference between ranking methods for supervised case (all seeds correct); large differences when bootstrapping Increasing seed size {2,3,4,4,…} makes all ranking methods improve steadily in bootstrapping case

  37. Outline • History • Open-domain IE by pattern-matching • The bootstrapping-with-noise problem • Bootstrapping as a graph walk • Open-domain IE as finding nodes “near” seeds on a graph • Set expansion - from a few clean seeds • Iterative set expansion – from many noisy seeds • Relational set expansion • Multilingual set expansion • Iterative set expansion – from a concept name alone

  38. Relational Set Expansion[Wang & Cohen, EMNLP 2009] • Seed examples are pairs: • E.g., audi::germany, acura::japan, • Extension: find wrappers in which pairs of seeds occur • With specific left & right contexts • In specific order (audi before germany, …) • With specific string between them • Variant of trie-based algorithm

  39. Results First iteration Tenth iteration

  40. Outline • History • Open-domain IE by pattern-matching • The bootstrapping-with-noise problem • Bootstrapping as a graph walk • Open-domain IE as finding nodes “near” seeds on a graph • Set expansion - from a few clean seeds • Iterative set expansion – from many noisy seeds • Relational set expansion • Multilingual set expansion • Iterative set expansion – from a concept name alone

  41. Multilingual Set Expansion

  42. Multilingual Set Expansion • Basic idea: • Expand in language 1 (English) with seeds s1,s2 to S1 • Expand in language 2 (Spanish) with seeds t1,t2 to T1. • Find first seed s3 in S1 that has a translation t3 in T1. • Expand in language 1 (English) with seeds s1,s2,s3 to S2 • Find first seed t4 in T1 that has a translation s4 in S2. • Expand in language 2 (Sp.) with seeds t1,t2,t3 to T2. • Continue….

  43. Multilingual Set Expansion • What’s needed: • Set expansion in two languages • A way to decide if s is a translation of t

  44. Multilingual Set Expansion • Submit s as a query and ask for results in language T. • Find chunks in language T in the snippets that frequently co-occur with s • Bounded by change in character set (eg English to Chinese) or punctuation • Rank chunks by combination of proximity & frequency • Consider top 3 chunks t1, t2, t3 as likely translations of s.

  45. Multilingual Set Expansion

More Related