1 / 52

On Using Wikipedia to Build Knowledge Bases for Information Extraction by Text Segmentation

On Using Wikipedia to Build Knowledge Bases for Information Extraction by Text Segmentation. Elton Serra, Eli Cortez, Altigran S. da Silva, Edleno S. de Moura. Universidade Federal do Amazonas (UFAM) - Brazil. Presented by Elton Serra and Eli Cortez. SBBD 2011 Florianópolis, Brazil.

rlowe
Download Presentation

On Using Wikipedia to Build Knowledge Bases for Information Extraction by Text Segmentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On Using Wikipedia to Build Knowledge Bases for Information Extraction by Text Segmentation Elton Serra, Eli Cortez, Altigran S. da Silva, Edleno S. de Moura Universidade Federal do Amazonas (UFAM) - Brazil Presented by Elton Serra and Eli Cortez SBBD 2011 Florianópolis, Brazil

  2. Introduction • Information extraction by text segmentation (IETS) • Extracting semi-structured data records by identifying attribute values in continuous text • bibliographic citations, product descriptions, classified ads, etc.

  3. Introduction • Current IETS methods use probabilistic frameworks such as HMM or CRF • Learn a model for extracting data related to a domain • Supervised IETS methods • Require training data from each source <Neighboorhood>Regent Square </Neighboorhood> <Price> $228,900 </Price> <No>1028 </No><Street>Mifflin Ave, </Street> <Bed>6 Bedrooms </Bed> <Bath> 2 Bathrooms </Bath> <Phone>412-638-7273 </Phone>

  4. Supervised IETS Features f1, f2, f3,...,fk g1, g2,g3,...,gl Learning Labeled Segments (Tranining) Model Input Texts Extraction Unlabeled Input Strings Output Labeled Segments

  5. Supervised IETS Text Source 3 Text Source 1 Text Source 2

  6. Supervised IETS Text Source 3 Manual Labeling Required for Each Input source Text Source 1 Text Source 2

  7. Introduction • Unsupervised IETS methods • Learn from datasets • Dictionaries, knowledge bases, references tables, etc. • No need for manual training for each input • Source Independent • State-of-the-art IETS methods • Unsup. CRF (Zhao et al. @SIAM ICDM’08) • ONDUX (Cortez et al. @SIGMOD’10) • JUDIE (Cortez et al. @SIGMOD’11)

  8. Unsupervised IETS Dataset Content Features f1,f2, f3,...,fk Learning g1,g2,g3,...,gl Structure Features Model Model Model Model Model Model Extraction Output Labeled Segments

  9. Unsupervised IETS Dataset f1,f2, f3,...,fk f1,f2, f3,...,fk Content Features Source 2 f1,f2, f3,...,fk Source 1 Source 3

  10. Unsupervised IETS Dataset f1,f2, f3,...,fk f1,f2, f3,...,fk Content Features Source 2 f1,f2, f3,...,fk A SingleDataset for Several Input Sources of a samedomain Source 1 Source 3

  11. Introduction • These datasets are very important for Unsupervised IETS • Currently, in the literature • no proper discussion on how datasets can be obtained • no principled methods for generating them • Experiments in the literature use datasets obtained in different ways: • Personal files, IE benchmarks, etc. Dataset

  12. Introduction • We consider Wikipedia as source for these datasets • High volumes of structured information • Covers a huge diversity of topics and domains • Has been used before as source of encoded knowledge • I.e., open information extraction • Not for IETS KB on Universities KB on Medicines KB on Books KB on Soccer Teams KB on Eletronics KB on Addresses

  13. Introduction • Our contributions • We propose a novel strategy to generate KBs from Wikipedia to support state-of-the-art IETS methods • We show that this strategy is feasible in practice for obtaining many KB related to real IETS tasks • We show that the KB generated using this strategy lead to high-quality extraction results

  14. Agenda • Exploiting Domain Knowledge • Generating Knowledge Bases from Wikipedia • Example Sources • Experiments • Conclusion and Future Work

  15. Exploiting Domain Knowledge • IETS methods are based on learning sequential models such as HMM or CRF • Rely on two types of features: • State or content features: Related to the contents of the tokens/strings • Transition or structure features: Related to the location of tokens/strings in a sequence

  16. Content Features • Can be computed from previously available KB • They are domain-dependent but input-independent • Examples • Vocabulary: Similarly between strings in the input and values of an attribute from the KB • Value Range: how close a numeric string in the input is from the mean value of a set of numeric values of an attribute in the KB • Format: common style often used to represent values of some attributes • URLs, e-mails, telephone numbers, etc.

  17. Strucuture Features • Related to the organization of values in the input • They are input-dependent • Examples • Positioning: position of the values of an attribute within the input • Sequencing: relative order of attribute values within the input • Unsupervised IETS: use content features learned from a KB to bootstrap the learning of structure feature

  18. Wikipedia as a source • How KBs can be generated at a reasonable cost, and, preferably, automatically? • Wikipedia: • High volumes of information structured • Articles, Categories, Infoboxes, Citations • Diversity of topics and domains • Extensively used as a source of knowledge for many methods: Seaching, Text Classification, Clustering, Semantic enrichment and Named Entity Recognition (NER)

  19. Generating Knowledge Bases from Wikipedia

  20. Generating Knowledge Bases • K = {(A1, O1), …, (An, On) } • Ai : an attribute • Oi = {oi,1,…,oi,n } is a set of known values from domain of Ai • Example

  21. Generating Knowledge Bases • Let W be an XML dump of the Wikipedia • We define a sourceS for an attribute A as an XPath expression that generates a set of atomic values in the same domain as A

  22. Generating Knowledge Bases • Three types of sources • Categories: titles of articles in a category C • /mediawiki/page[category=C ]/title • Infobox Fields: values of a field F found in infoboxes of pages in a category C • /mediawiki/page[category=C ]/infobox/F • Citation Fields : values of a field F in citations • /mediawiki/page/citation/F

  23. Category Source

  24. Infobox Source

  25. Citation Source

  26. Citation Source

  27. Mapping sources to KBs • Given a KB, several sources can be mapped for a same attribute • Deciding which source should be mapped to each attribute is an instance of the schema mapping problem.

  28. Mapping sources to KBs Cat:Food ingredients Ingredient Cat:Japanese ingredients Cat:Japanese ingredients Quantity Info:Hotel/Number Restaurants Info:Religious/Dome Quantity Unit Cat:Cooking weights measures

  29. Mapping sources to KBs We do not tackle this problem here. Left as a future work Cat:Food ingredients Ingredient Cat:Japanese ingredients Cat:Japanese ingredients Quantity Info:Hotel/Number Restaurants Info:Religious/Dome Quantity Unit Cat:Cooking weights measures

  30. Example • Generate KB for three distinct domains • Bibliographical Data • Cooking Recipes • Product Offers • Candidate sources were manually found on Wikipedia.

  31. Bibliographic References - Sources

  32. Cooking Recipes - Sources

  33. Product Offers - Sources

  34. Experimental Results

  35. Experiments • Goal: To evaluate the quality of extraction results obtained with KBs generated with data from Wikipedia sources • Test with three stated-of-the art IETS methods • U-CRF [Zhao et al. @ICDM 2008] • ONDUX [Cortez et. al. @SIGMOD10] • JUDIE [Cortez et. al. @SIGMOD11]

  36. Experiments • Methodology: • Carry out the same extraction tasks with KBs generated with our method and with reference KBs, i.e., those used in the original experiments with each method • Evaluate the extraction quality in terms of F-measure

  37. Experiments Input Texts Reference KBs

  38. Experiments – Generated KB • For each domain, we generated several KBs • We used fixed configurations for mapping candidate sources to the attributes of the KB • We call them basic mappings: • Maximal: candidate source containing the highest number of values • Minimal: candidate source containing the lowest number of values • Full: union of values from all candidate sources

  39. Experiments - Details on the KB

  40. Experiments – Metrics • For all performed experiments, we evaluated the extraction results for each individual attribute. • F-Measure • Harmonic mean between precision and recall

  41. Results – Bibliographic Domain

  42. Results – Bibliographic Domain When using a Full mapping, all methods achieved quality results comparable to the results obtained with the reference KB

  43. Results – Bibliographic Domain There were cases where the Maximal mapping led to better results than the Full mapping. This can be explained by the fact that some sources may contain incorrect information, which can negatively impact the IETS methods.

  44. Results – Recipes and Products

  45. Results – Recipes and Products Minimal mappings led to the worst results, since the knowledge bases built using this mapping presents less data to support the learning of content-related features.

  46. Experiments – Random mappings • In the previous experiment, we ran the three IETS methods using the basic mappings. • However, in practice, there could be other approaches for establishing such mappings. • Many other mapping configurations might be used. • One candidate source is randomly selected for each attribute to compose a knowledge base.

  47. Experiments – Random mappings • We call them random mappings • Evaluate the performance of each IETS method when using noisy KBs • 5 different KBs (R1 to R5) were generated for each dataset using random mappings

  48. Experiments - Quality of extraction

  49. Experiments - Quality of extraction Extracting from CORA is harder than extracting from the other two datasets. CORA records have 3 to 7 attribute and 33 dierent citation styles

  50. Experiments - Quality of extraction Good extraction quality with ONDUX and JUDIE.

More Related