1 / 32

Mining Reference Tables for Automatic Text Segmentation

Mining Reference Tables for Automatic Text Segmentation. Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research. Scenarios. Importing unformatted strings into a target structured database Data warehousing Data integration

Download Presentation

Mining Reference Tables for Automatic Text Segmentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research

  2. Scenarios • Importing unformatted strings into a target structured database • Data warehousing • Data integration • Requires each string to be segmented into the target relation schema • Input strings are prone to errors (e.g., data warehousing, data exchange)

  3. Current Approaches • Rule-based • Hard to develop, maintain, and deploy comprehensive sets of rules for every domain • Supervised • E.g., [BSD01] • Hard to obtain comprehensive datasets needed to train robust models

  4. Our Approach • Exploit large reference tables • Learn domain-specific dictionaries • Learn structure within attribute values • Challenges • Order of attribute concatenation in future test input is unknown • Robustness to errors in test input after training on clean and standardized reference tables

  5. Problem Statement • Target schema: R[A1,…,An] • For a given string s (a sequence of tokens) • segments into s1,…,sn substrings at token boundaries • maps1,…,sn to Ai1,…,Ain • maximize P(Ai1|s1)*…*P(Ain|sn) among all possible segmentations of s • Product combination function handles arbitrary concatenation order of attribute values • P(Ai|x) that a string x belongs to Ai estimated by an Attribute Recognition Model ARMi • ARMs are learned from a reference relation r[A1,…,An]

  6. Segmentation Architecture

  7. ARMs • Design goals • Accurately distinguish an attribute value from other attributes • Generalize to unobserved/new attribute values • Robust to input errors • Able to learn over large reference tables

  8. ARM: Instantiation of HMMs • Purpose: Estimate probabilities of token sequences belonging to attributes • ARM: instantiation of HMMs (sequential models) • Acceptance probability: product of emission and transition probabilities

  9. Instantiating HMMs • Instantiation has to define • Topology: states & transitions • Emission & transition probabilities • Current automatic approaches for topology search from among a pre-defined class of topologies are based on cross validation [FC00, BSD01] • Expensive • Number of states in the ARM is small to keep the search space tractable

  10. Intuition behind ARM Design • Street address examples • [nw 57th St], [Redmond Woodinville Rd] • Album names • [The best of eagles], [The fury of aquabats], [Colors Soundtrack] • Large dictionaries (e.g., aquabats, soundtrack, st…) to exploit • Begin and end tokens are very important to distinguish values of an attribute (nw, st, the,…) • Can learn patterns on tokens (e.g., 57thgeneralizes to *th) • Need robustness to input errors • [Best of eagles] for [The best of eagles], [nw 57th] for [nw 57th st]

  11. Large Number of States • Associate a state per token: Each state only emits a single base token • More accurate transition probabilities • Model sizes for many large reference tables are still within a few megabytes • Not a problem with current main memory sizes! • Prune the number of states (say, remove low frequency tokens) to limit the ARM size

  12. BMT Topology: Relax Positional Specificity A single state per distinct symbol within a category -- emission probability of a symbol within a category is same

  13. Feature Hierarchy: Relax Token Specificity [BSD01]

  14. Example ARM for Address

  15. Robustness Operations: Relax Sequential Specificity • Make ARMs robust to common errors in the input, i.e., maintain high probability of acceptance despite these errors • Common types of errors [HS98] • Token deletions • Token insertions • Missing values • Intuition: Simulate the effects of such erroneous values over each ARM

  16. Robustness Operations Simulating the effect of token insertions: token and corresponding transition probabilities are copiedfrom BEGIN to MIDDLE state

  17. Transition Probabilities • Transitions from BM and BT and MM and MT allowed • Learned from examples in reference table • Transition probabilities are also weighted by their ability to distinguish an attribute • A transition “*”  “*” which is common across many attributes gets low weight

  18. Summary of ARM Instantiation • BMT topology • Token hierarchy to generalize observed patterns • Robustness operations on HMMs to address input errors • One state per token in reference table to exploit large dictionaries

  19. Attribute Order Determination • If attribute order is known • Can use dynamic programming algorithm to segment [Rabiner89] • If attribute order is unknown • Can ask the user to provide attribute order • Can discover attribute order • Naïve expensive strategy: evaluate all concatenation orders and segmentations for each input string • Consistent Attribute Order Assumption: the attribute order is the same across a batch of input tuples • Several datasets on the web satisfy this assumption • Allows us to efficiently • Determine the attribute order over a batch of tuples • Segment input strings (using dynamic programming)

  20. Segmentation Algorithm (runtime)

  21. Experimental Evaluation • Reference relations from several domains • Addresses: 1,000,000 tuples • [Name, #1, #2, Street Address, City, State, Zip] • Media: 280,000 tuples • [ArtistName, AlbumName, TrackName] • Bibliography: 100,000 tuples • [Title, Author, Journal, Volume, Month, Year] • Compare CRAM (our system) with DataMold [BSD01]

  22. Test Datasets • Naturally erroneous datasets: unformatted input strings seen in operational databases • Media • Customer addresses • Controlled error injection: • Clean reference table tuples  [Inject errors]  Concatenate to generate input strings • Evaluate whether a segmentation algorithm recovered the original tuple • Accuracy Measure: % of attribute values correctly recognized

  23. Overall Accuracy DBLP Addresses

  24. Topology & Robustness Operations Addresses

  25. Training on Hypothetical Error Models

  26. Exploiting Dictionaries Accuracy vs Reference Table size

  27. Conclusions • Reference tables leveraged for segmentation • Combining ARMs based on independence allows segmenting input strings with unknown attribute order • ARM models learned over clean reference relations can accurately segment erroneous input strings • BMT topology • Robustness operations • Exploiting large dictionaries

  28. Model Sizes & Pruning Accuracy #States & Transitions Model Size in MB

  29. Order Determination Accuracy

  30. Topology Media

  31. Specificities of HMM Models • Model “specificity” restricts accepted token sequences • Positional specificity • Number ending in ‘th|st’ can only be the 2nd token in an address value • Token specificity • Last state only accepts “st, rd, wy, blvd” • Sequential specificity • “st, rd, wy, blvd” have to follow a number in ‘st|th’

  32. Robustness Operations Token insertion Token deletion Missing values

More Related