1 / 51

Joint Unsupervised Structure Discovery and Information Extraction

This paper presents a new approach for information extraction by combining unsupervised structure discovery with information extraction in an automatic setting. The proposed method detects the structure of each record and extracts attribute values without any user intervention. It integrates a structure discovery algorithm into the information extraction process, resulting in accurate and efficient extraction of attribute values from semi-structured data records.

Download Presentation

Joint Unsupervised Structure Discovery and Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Joint Unsupervised Structure Discovery and Information Extraction Eli Cortez, Daniel Oliveira, Altigran S. da Silva, Edleno S. de Moura Alberto H. F. Laender Univ. Fed. de Minas Gerais (UFMG) Brazil Univ. Fed. do Amazonas (UFAM) Brazil Presented by Eli Cortez ACM SIGMOD Conference Athens, Greece - June 2011

  2. The IETS Problem • Information Extraction by Text Segmentation • Goal: • To extract attribute values occurring in implicit semi-structured data records • Current IETS methods are able to accurately predict a sequence of labels to be assigned to a sequence of text segments corresponding to attribute values • HMM – Borkar et al. (SIGMOD01) • CRF – Laferty et al. (ICML01) • ONDUX – Cortez et. al (SIGMOD10)

  3. Examples – Delimited Records Product Descriptions Apple iPad 2 Wi-Fi + 3G 64 GB - Apple iOS 4 1 GHz - Black $589 LG - 32LE5300 - 32" LED-backlit LCD TV - 1080p (FullHD) - $400 Samsung - UN55D7000 - 55" Class ( 54.6" viewable ) LED-backlit LCD ... $2,048 Mixter Max Accessory Plasma TV Rack Tilt Bracket 248-A05 $65 HP Deskjet 3050 All-in-One Color Ink-jet - Printer / copier / scanner $50 Bibliographic Citations L. Barbosa and J. Freire. Using Latent-structure to Detect … In Proc. of the 13th WeDB, pages 1–6, 2010. A. Doan et. al. Information Extraction Challenges in Managing .. SIGMOD Record, 37(4):14–20, 2008. J. Pearl and G. Shafer. Probabilistic reasoning in intelligent systems: Morgan Kaufmann, 1988. Classified Ads $1106 / 2br - Luxury 2 BR, 1 BA apartment loaded with amenities - (Bothell) $1945 / 2br - Beautiful HighPoint Community "Built Green" 2 BR 2.5 Bth Town Home! - (West Seattle) $735 / 1br - Top floor 1 bedroom apt available just minutes from downtown!! - (Seattle,Burien,Highline) $820 / 1br - Lovely 1 bedroom 1k sq ft! Nearly a 2 bdrm! - (Federal Way,Edgewood,Milton, Tacoma) $895 / 2br - ****Lovely 2-Bedroom/2-Bathroom Condo with a View! FREE RENT!!!**** - (Monroe)

  4. Example Non-delimited Records Chocolate Cake Recipe 1/2 cup butter 2 eggs 4 cups white sugar ground cinnamon 2 tablespoons dark rum 6 chopped pecans 1/2 cup milk 1 1/2 cups applesauce 2 cups all-purpose flour 1/4 cup cocoa powder 2 teaspoons baking soda 1/8 teaspoon salt 1 cup raisins 1/4 cup dark rum

  5. Current IETS Methods • Assume input records are already separated • e.g., manually by a user or using HTML-based heuristics • Unfeasible in fully automatic settings 1/2 cup butter 2 eggs 4 cups white sugar ground cinnamon 2 tablespoons dark rum 6 chopped pecans 1/2 cup milk 1 1/2 cups applesauce … 1/2 cup butter 2 eggs 4 cups white sugar ground cinnamon 2 tablespoons dark rum 6 chopped pecans 1/2 cup milk 1 1/2 cups applesauce …

  6. JUDIE • Structure Discovery + Information Extraction • Jointly carried out in an unsupervised way • Suitable for fully automatic settings: raw text streaming, crawler output, micro-blogs, etc 1/2 cup butter 2 eggs 4 cups white sugar ground cinnamon 2 tablespoons dark rum 6 chopped pecans 1/2 cup milk 1 1/2 cups applesauce … JUDIE

  7. JUDIE • Joint Unsupervised Structure Discovery and Information Extraction • Introduces a new Structure Discovery Algorithm • Detects the structure of each individual record being extracted without any user intervention • Looks for frequent patterns of label repetitions or cycles • Integrates this algorithm in the IE process • Accomplished by successive refinement steps that alternate information extraction and structure discovery

  8. Related Work – IETS Approaches/Methods • Probabilistic – Supervised • Hidden Markov Models (HMM) • Borkar et al.@SIGMOD’01;McCallum et al.@AAAI‘00 • Conditional Random Fields (CRF) • Lafferty et al.@ICML’01;McCallum et al.@IPM‘06) • Require training instances labeled on each input text <Neighboorhood>Regent Square </Neighboorhood> <Price> $228,900 </Price> <No>1028 </No><Street>Mifflin Ave, </Street> <Bed>6 Bedrooms </Bed> <Bath> 2 Bathrooms </Bath> <Phone>412-638-7273 </Phone>

  9. Related Work - IETS Approaches / Methods • Probabilistic – Unsupervised • Rely on previously built datasets • Unsup. HMM (Agichtein et al.@SIGKDD ‘04) • Rely on records in references tables • Batches of fixed-order records as input • Unsup. CRF (Zhao et al. @SIAM ICDM’08) • Also reference tables • Batches of fixed-order records as input • ONDUX (Cortez et al. @SIGMOD’10) • Knowledge-base: sets of typical values per attribute – no records • All of them require one input record at time • No structure discovery

  10. JUDIE Overview 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla

  11. JUDIE Overview 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla 1st IE Step: Structure-free Labeling

  12. JUDIE Overview 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla 1st SD Step: Structure Sketching

  13. JUDIE Overview 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U U I U I Q U I Q I U I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla 2nd IE Step: Structure-aware Labeling

  14. JUDIE Overview 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U U I U I Q U I Q I U I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U U I U I Q U I Q I Q I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla 2nd SD Step: Structure Refinement

  15. JUDIE – Structure-free Labeling • What is the best label for each segment? • No structural information is available • Initially labels potential values with attribute names • No information on the structure of the data records • Resort only to content-related features • Learned from the pre-existing KB 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla

  16. Features – Content Related • Features Considered: KB Bayes. Noisy OR A1 Attribute Vocabulary A2 Ingredient Value Range White sugar A3 Value Format

  17. JUDIE – Structure-free Labeling • Initially labels potential values with attribute names • No information on the structure of the data records • Resort only to content-related features • Learned from the pre-existing KB 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Limitations: Label Fault : “Tbsp” Misassignment : “a little”

  18. JUDIE – Structure Sketching • Organizes the labeled candidate values into records • Induces a structure on the unstructured text input • Outputs labeled values grouped into records • Uses a novel algorithm called Structure Discovery (SD) Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla

  19. The SD Algorithm • Uncover the structure of implicit records from the input text. • Used in the Structure Sketching and Structure Refinement • Takes as input a sequence of labels and generates the structure of each record • Assumption: It is possible to identify patterns of sequences by looking for cycles into a graph (Adjacency Graph) that models the ordering of labels

  20. The SD Algorithm TitleConferenceYear Author Author TitleConferenceYear Author TitleConferenceYear … Author TitleJournalIssueYear Author TitleJournalIssueYear Author Author JournalIssueYearTitleYear … Author TitleConferenceYear Author Author Author TitleJournalIssueYear Conference Title Year Author Journal Issue

  21. The SD Algorithm Exploits the occurrence of cycles in the adjacency graph [Author, Title, Conference, Year] [Author, Title, Journal, Issue, Year] [Title,Conference, Year] Conference Title Year Author Journal Issue

  22. The SD Algorithm Coincident Cycles TitleConferenceYear Author Author TitleConferenceYear Author TitleConferenceYear … Author TitleJournalIssueYear Author TitleJournalIssueYear Author Author JournalIssueYearTitleYear … Author TitleConferenceYear Author Author Author TitleJournalIssueYear Viable Cycle Conference Title Year Author Journal Issue

  23. The SD Algorithm • Dominant Cycles • Given the set of Coincident cycles that are also viable, the Dominant Cycle are most frequent in the input • Finally, the algorithm works by first identifying all dominant cycles in the adjacency graph and then processing each of these cycles • In our given examples, the dominant cycles are: • [Author, Title, Journal, Issue, Year] • [Author, Title, Conference, Year] • [Author, Journal, Issue, Year] • [Title,Conference, Year] • [Title, Year]

  24. JUDIE – Structure Sketching • Organizes the labeled candidate values into records • Induces a structure on the unstructured text input • Outputs labeled values grouped into records • Uses a novel algorithm called Structure Discovery (SD) Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla

  25. JUDIE – Structure-aware Labeling • Now, what is the best label for each segment? • We already know some structural information • Re-labels segments considering content-related features and structure-based features • Structure-based features learned using a graphical model (PSM) Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla

  26. Positioning and Sequencing Model (PSM) • Built from the Structure Sketching output • States: attribute labels • Likelihood of: • absolute position of labels within text segments • relative position considering other labels 5% 80% UNIT 90% 10% 95% START END QUANTITY INGREDIENT 20% 100%

  27. JUDIE – Structure-aware Labeling KB Content-related features Bayes. Noisy OR Quantity A little

  28. JUDIE – Structure-aware Labeling • Labels textual values considering: • Uses a graphic model representing the likelihood of attribute transitions within the input text • Content-related features and structure-based features Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U U I U I Q U I Q I Q I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla

  29. JUDIE – Structure Refinement • Applies again the SD algorithm • Considers the output of the structure-aware labeling • Fixes structural problems • Structure-aware labeling produces more precise results Q U I Q U U I U I Q U I Q I Q I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U U I U I Q U I Q I Q I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla

  30. JUDIE Overview Phase 1 Structure-free Labeling Structure Sketching Phase 2 Structure-aware Labeling Structure Refinement

  31. Experiments • Datasets previously used in other papers • Only 3 of the domains are discussed in this presentation. More results on the paper.

  32. Metrics • F-Measure • Harmonic mean between precision and recall • Attribute-Level • Results considering values of a single attribute in all output records • Record-Level • Results considering all attributes in a single record • Average of all records results. • T-Test for the statistical validation of the results

  33. Evaluation – Attribute Level - Recipes • High-quality results for all attributes even in Phase 1 • Structural information in Phase 2 led to gains above 5% on average

  34. Evaluation – Attribute Level - CORA • Title and Journal have a large term overlap • Phase 2 was able to correct the mismatches from Phase 1

  35. Evaluation – Attribute Level – Web Ads • Input strings from several websites • Still, F = 0.84 on average • Value range feature was useful for Phone, etc.

  36. Evaluation – Record Level • Phase 1: acceptable (F≈0.7) • Phase 2: positive impact (Gains>9%) • In CORA, gains higher than 19% • Structural information led to significant improvements

  37. Structure Diversity Impact • How our method deals with a heterogeneous dataset in terms of structure • CORA has 33 distinct styles were identified L. Barbosa and J. Freire. Using Latent-structure to Detect … In Proc. of the 13th WeDB, pages 1–6, 2010. A. Doan et. al. Information Extraction Challenges in Managing .. SIGMOD Record, 37(4):14–20, 2008. J. Pearl and G. Shafer. Probabilistic reasoning in intelligent systems: Morgan Kaufmann, 1988.

  38. Structure Diversity Impact • Perfect Labeling: all segments are corrected labeled

  39. Comparison with baselines – Attribute Level • Results very close to ONDUX and even better than U-CRF • Recall: JUDIE faces a harder task CORA Web Ads

  40. Knowledge Base Impact Achieves results comparable with baselines for a task considerably harder JUDIE is more dependent of the KB: Input does not contain structural information # of common terms between the KB the input

  41. Conclusions • Novel method for extracting semi-structured data records in the form of continuous text • Detects the structure of records being extracted • Integrates information extraction and structure discovery • Achieved good results in comparison with state-of-art methods while demanding less user effort • Suitable for fully automatic settings: raw text streaming, crawler output, micro-blogs, etc.

  42. Conclusions • Content-related / Domain-dependent features • Learned from a previous existing KB on the domain • Used for executing a structure-free labeling step • Structure-related / Source-dependent features • Learned from the structure-free labeling over the input text • Content-related features are used to induce structured-based features through successive refinement steps • Thus, no manual training for each input is required

  43. Future Work • Develop methods for automatically generating knowledge bases • Extend the SD algorithm to deal with nested structures

  44. Acknowledgments UFMG

  45. Thank you! Joint Unsupervised Structure Discovery and Information Extraction Eli Cortez, Daniel Oliveira, Altigran S. da Silva, Edleno S. de Moura Alberto H. F. Laender Univ. Fed. de Minas Gerais (UFMG) Brazil Univ. Fed. do Amazonas (UFAM) Brazil Presented by Eli Cortez ACM SIGMOD Conference Athens, Greece - June 2011

  46. Summary: JUDIE x Previous IETS

  47. Attribute Vocabulary

  48. Value Range

  49. Value Format • Value Format (Style) • First a Markov model is generated for each attribute. • Computes the probability of the input mask sequence represents a path in each Markov model of each attribute. 1.0 [A-Z][a-z]+ 1.0 End Start 0.2 0.8 White sugar [a-z][a-z]+ [A-Z]. [A-Z][a-z]+ [a-z][a-z]+ 1.0

  50. Positioning and Sequencing Model

More Related