1 / 20

Fine-Grained Geographical Relation Extraction from Wikipedia

Fine-Grained Geographical Relation Extraction from Wikipedia. André Blessing Hinrich Schütze University of Stuttgart Institute for Natural Language Processing (IMS). Overview. motivation why are fine-grained relations important? self-annotation automatic annotation using structured data

Download Presentation

Fine-Grained Geographical Relation Extraction from Wikipedia

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fine-Grained Geographical Relation Extraction from Wikipedia André Blessing Hinrich Schütze University of Stuttgart Institute for Natural Language Processing (IMS)

  2. Overview • motivation • why are fine-grained relations important? • self-annotation • automatic annotation using structured data • use this annotation for training classifier • extraction framework • evaluation and conclusion

  3. Geographical data provider • GeoNames • gazetteer • names, type, coordinates • 8 million entries • 2.6 million populated places • community-based • Creative Commons Attribution 3.0 License • Free to share

  4. GeoNames

  5. GeoNames – hierarchical types

  6. GeoNames – missing hierarchical relations

  7. Task Definition • relation definition • R1-2 • ADM3-ADM4 • Landkreis (county)- Gemeinde (municipality) • R0-1 • ADM4-PPL • Gemeinde (municipality) and Ortsteil (suburb) • task • classify all possible binary relations of named entities in one sentence

  8. Example - binary relations between all NEs • Gebroth ist eine Ortsgemeinde im Landkreis Bad Kreuznach in Rheinland-Pfalz (Deutschland). • Gebroth is a municipality in the county Bad Kreuznach in Rheinland-Pfalz (Germany). • binary relations between NEs • (Gebroth,Bad Kreuznach) element of R1_2 • (Gebroth, Rheinland-Pfalz) • (Gebroth, Deutschland) • (Bad Kreuznach, Rheinland-Pfalz) • (Bad Kreuznach, Germany) • (Rheinland-Pfalz, Deutschland)

  9. Requirements for extraction system • fast to develop • requested relation types can change • avoid expensive manual annotation • fine-grained relation types • e.g. simple part-of relation is not sufficient • trained system need no structured data • several input sources (Wikipedia, blogs, twitter, news) • German data

  10. Wikipedia as resource • structured data • templates (e.g. infoboxes), links, categories, tables, lists • unstructured data • written text • high quality • many users • WikiBots • structured data can be used to annotate unstructured data → self-annotation

  11. Self-Annotation - example R1_2(Gebroth, Bad Kreuznach) Gebroth Gebroth ist eine Ortsgemeinde im Landkreis Bad Kreuznach in Rheinland-Pfalz (Deutschland). unstructureddata structureddata Landkreis Bad Kreuznach (county)

  12. Self-annotation - challenges • infoboxes are not always complete/correct/coherent filled • matching with unstructured data • pattern matching not sufficient • orthographic variances • morphology • multi-word expressions • matching need some manual adjustment • only one relation per article

  13. Extraction framework • UIMA (Unstructured Information Management Architecture) • pipeline architecture • easy exchange of components • fast development • extended components • CollectionReader for Wikipedia • linguistic annotation • supervised classifier

  14. Extraction pipeline German Wikipedia FSPar-Engine MaxEnt-Classifier structured data JWPL unstructured text UIMA Pipeline FSPar-Annotator CollectionReader Self-Annotation ClearTK Consumer CollectionReader text GeoNames

  15. Linguistic processing • FSPar engine (Schiehlen 2003) • tokenizer • PoS-tagger (bases on TreeTagger) • chunker • partial dependency parser

  16. Supervised classification • extended ClearTK-Annotator • feature sets • F0: NE distance (baseline) • F1: Window-based (pos, lemma, size=2) • F2: chunks (parent chunks of NEs) • F3: dependency parse (paths between NEs) • MaxEntClassifier

  17. Evaluation • 9000 articles about German municipalities and suburbs • 5300 articles for training • 1800 articles for development • 1800 articles for final evaluation • R1_2 relation is also available from the Federal Statistical Office of Germany • Used for evaluate self-annotation • 99.9 % ( 1 error in 1304 sentences)

  18. Results

  19. Conclusion • text is important resource for context-aware systems • self-annotation • automatic annotation using structured data • Wikipedia is a valuable resource • structured and unstructured data • containing fine-grained relations • UIMA based implementation • fine-grained geographical relation extraction is possible

  20. Questions: ?! www.nexus.uni-stuttgart.de

More Related