1 / 44

ONDUX On-Demand Unsupervised Learning for Information Extraction

A proposal for an unsupervised extraction method based on information retrieval to perform various IETS tasks, eliminating the need for user involvement in source-specific training processes and offering flexibility in extraction styles.

Download Presentation

ONDUX On-Demand Unsupervised Learning for Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ONDUXOn-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigranda Silva and Edleno de Moura Federal University of Amazonas (UFAM) - BRAZIL Marcos Gonçalves Federal University of Minas Gerais (UFMG) - BRAZIL UFMG

  2. Agenda • Introduction • Information Extraction by Text Segmentation • Challenges • Related Work • ONDUX • Experiments • Conclusions and Future Work

  3. Introduction (1) • Abundance of on-line sources of text documents containing implicit semi-structured data records • Addresses • Bibliographic References • Classified Ads • Product Descriptions

  4. Introduction (1I) Classified Ad Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273 Address Dr. Robert A. Jacobson, 8109 Harford Road, Baltimore, MD 21214 Bibliographic Reference Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani. Link-based similarity measures for the classication of Web documents. JASIST, v. 57 n.2, p. 208-221, January 2006

  5. Introduction (III) • Why extracting information? • Database Storage, Query… • Data Mining • Record Linkage. <Neighboorhood> : Regent Square <Price> : $228,900 <No.> : 1028 <Street> : Mifflin Ave, <Bed.> : 6 Bedrooms <Bath..> : 2 Bathrooms <Phone> : 412-638-7273 Classified Ad Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273

  6. IETS – Challenges(I) • Information Extraction by Text Segmentation (IETS) • Borkar@SIGMOD'01, McCallum@ICML'01, Agichtein@SIGKDD'04, Mansuri@ICDE'06, Zhao@SICDM'08, Cortez@JASIST'09 • Diversity of templates and styles • Attribute Ordering • Capitalization • Abbreviations. • Different applications share similar domains • Ex.: Address and Ads • Records from both domains contain address information

  7. IETS – Challenges(II) • Diversity of templates and styles • Attribute Ordering; Capitalization; Abbreviations. Link-based similarity measures for the classication of Web documents. Pável Calado. Journal of the American Society for the Information Science and Technology – 57(2) 2006 HomePage Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno Silva de Moura, Berthier A. Ribeiro-Neto, Nivio Ziviani. Link-based similarity measures for the classication of Web documents. JASIST 57 (2) 208-221(2006) DBLP Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani. Link-based similarity measures for the classication of Web documents. JASIST, v. 57 n.2, p. 208-221, January 2006 ACM

  8. IETS – Challenges(III) • Existing approaches deal with this problem use Machine Learning techniques • Hidden Markov Models (HMM) • Conditional Random Fields (CRF) • Structured Support Vector Machines (SSVM) • (semi) Supervised approaches require a hand-labeled training set created by an expert. • Each generated model is particular to a given application • High computational cost

  9. Related Work • [Borkar et. al @ SIGMOD 2001] • Supervised extraction method based on Hidden Markov Models (HMM) • [McCallum et. al @ ICML 2001] • Proposed the usage of Conditional Random Fields (CRF), a supervised model – (S-CRF) • [Mansuri et. al @ ICDE 2006] • Semi-supervised approach based on CRF models All of these approaches require an expert to create a hand-labeled training set for each application.

  10. Related Work (II) • [Agichtein et. al @ SIGKDD 2004] • Usage of Reference Tables to create an unsupervised model using Hidden Markov Models (HMM) • [Zhao et. al @ SIAM ICDM 2008] • Usage of reference tables to create unsupervised CRF models - (U-CRF) • [Cortez et. al @ JASIST 2009] • Unsupervised method to extract bibliographic information Both models assume single positioning and ordering of attributes in all test instances. (Distinct Orderings ?) Domain-specific heuristics, not general application.

  11. Contributions • Proposal of extraction method based on information retrieval to perform IETS tasks; • Eliminate the need of a user involved in any source specific training process; • Flexible in the sense that do not rely on any particular style to perform the extraction • Unsupervised Reinforcement Phase • Attribute ordering and positioning learned On-Demand • Experimental comparison with the state-of-art information extraction approach (CRF).

  12. Basic Concepts(1) • Given an input string I representing an implicit textual record (e.g. classified ad), the IETS task consists in: • Segmenting • Assigning to each segment a label corresponding to an attribute

  13. Basic Concepts(I1) • Knowledge Base • Set of pairs KB = • Easily built from pre-existing sources • Bibliographic DBs, Freebase, Google Fusion Tables, etc. KB= { (Neighboorhhod, O ), (Street, O ), (Phone, O )} O = { “Regent Square”, “Milenight Park”} O = { “Regent St.”, “Morewood Ave.”, “Square Ave. Park”} O = { “323 462-6252”, “(171) 289-7527”} Neigh. Street Phone Neigh. Street Phone

  14. ONDUX (I) • Three main steps • Blocking • Matching • Reinforcement

  15. ONDUX (II) • General View 1

  16. ONDUX (III) • Blocking • Split the input text in substrings called blocks; • Consider the co-occurrence of consecutive terms based on the KB Left separated (no presence in the KB) Co-occur in the KB (Neighborhood) Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273

  17. ONDUX (IV) • General View 1 2

  18. ONDUX (V) • Matching • Associate each block generated in the previous phase with an attribute according to the Knowledge Base • Use distinct functions to compute the similarity between a block and the know values of the attributes in in the KB

  19. ONDUX (VI) • Matching • Textual Values: FF Function (Field Frequency) • Similarity between the terms on the block and the terms of a given attribute of the KB • Numeric Values : NM Function (Numeric Matching)[Agrawal @ CIDR 2003] • Similarity between the value on the block, the mean and the standard deviation of a numeric attribute in the KB

  20. ONDUX (VI) • Matching Street Price No. ??? Street Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273 Bed. Bath. Phone

  21. ONDUX (VII) • How can we deal with blocks that were incorrectly labeled or were not associated to any attribute? Street Price No. ??? Street Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273 Bed. Bath. Phone

  22. ONDUX (VIII) • Reinforcement • Review the labeling task performed in the Matching step • Unmatchedblocks must receive a label of a given attribute • Mismatchedblocks must be correctly labeled • How to handle these cases? • Using positioning and sequencing information that are obtained On-Demand.

  23. ONDUX (IX) • General View 3 2

  24. ONDUX (X) • Reinforcement • Given the extraction output of the matching step • ONDUX automatically build a graphical structure, the PSM. • PSM: Positioning and Sequencing Model.

  25. ONDUX (XI) In the PSM, each state represents attributes of the KB plus special states start and end • Reinforcement – PSM Edges represent transition probabilities Ordering and Positioning Probabilities are learned On-Demand based on the test instances trough the Matching Phase

  26. ONDUX (XII) • Reinforcement • Remarks • The PSM is automatically learned On-Demand from test instances • No a priori training required • No assumptions regarding a particular order of attribute values • Relies on the very effective strategies deployed in the Matching Step

  27. ONDUX (XIII) • Reinforcement • Once the PSM is built, we combine the matching, positioning and sequencing evidences using the Bayesian operator OR. Matching Sequence Positioning

  28. ONDUX (XIV) • Reinforcement • Extraction Result Street Neighborhood Street Price No. ??? Street Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273 Bed. Bath. Phone

  29. ONDUX (XV) • Overview 3 1 2

  30. Experiments (1) • Setup • Wetestedourproposed approach withseveral sources from 3 distinctdomains: • Addresses • BigBook, Restaurants [RISE] • Bibilographic Data • CORA [Peng@IPM’ 06], PersonalBib [Mansuri@ICDE’ 06] • ClassifiedAds • 7 distinctnewspaper sites[Oliveira@SBBD’ 06] • Welimitedthepresentation to oneexperiment per domain. More onthepaper

  31. Experiments (II) • Evaluation • Metrics • Precision, Recall and F-Measure • T-Test for the statistical validation of the results • Baselines • Conditional Random Fields (CRF) • U-CRF (Unsupervised method) [Zhao@SICDM’ 08] • S-CRF (Classical supervised method) [Peng@IPM’ 06]

  32. Experiments (III) • Extraction Quality U-CRF results similar to Zhao@SICDM (validation) Dataset follows the single order assumption After Reinforcement ONDUX achieved similar quality

  33. Experiments (IV) • Extraction Quality CORA includes a variety of citation styles (conference, journal, books, etc,) S-CRF achieved results higher than U-CRF due to the hand-labeled training In general, ONDUX outperformed CRF models

  34. Experiments (V) • Extraction Quality U-CRF presented a poor performance (very heterogeneous dataset) Due to the Matching Phase and the PSM that is learned On-Demand, ONDUX achieve very high quality results

  35. Experiments (VI) • Varying the number of terms common to test instances and the KB • Determine how dependent the quality of results is from the overlap between the previously known data and the text input. • These experiments were conducted with the BigBook dataset.

  36. Experiments (VII) • Varying the number of shared terms Starting with a batch of 500 input strings, after having an overlap of 500 terms, ONDUX achieved high quality results Even presenting a poor quality in the Matching Phase, the PSM is able to increase ONDUX’s quality in the Reinforcement Step

  37. Experiments (VIII) • Varying the number of shared terms As the number of shared terms increases, the best quality the Mathching phase achieves

  38. Conclusions andFuture Work (I) • New approach for information extraction independent of the style of the data records • ONDUX • Flexible: Do not consider any particular style • Unsupervised: Do not require any human effort to create a training set • On-Demand: Ordering and Positioning Information are learned trough the Matching Phase

  39. Conclusions and Future Work (II) • Proposed strategy achieve good results of precision and recall • Small size of the Knowledge Base • Comparison with the state-of-art • As a Future Work • Investigate different matching functions; • Nested structures?

  40. Acknowledgements UFMG

  41. Questions?

  42. Experimentes • Setup

  43. Experimentes

  44. Experimentes

More Related