fine grained geographical relation extraction from wikipedia n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Fine-Grained Geographical Relation Extraction from Wikipedia PowerPoint Presentation
Download Presentation
Fine-Grained Geographical Relation Extraction from Wikipedia

Loading in 2 Seconds...

play fullscreen
1 / 20

Fine-Grained Geographical Relation Extraction from Wikipedia - PowerPoint PPT Presentation


  • 101 Views
  • Uploaded on

Fine-Grained Geographical Relation Extraction from Wikipedia. André Blessing Hinrich Schütze University of Stuttgart Institute for Natural Language Processing (IMS). Overview. motivation why are fine-grained relations important? self-annotation automatic annotation using structured data

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Fine-Grained Geographical Relation Extraction from Wikipedia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
    Presentation Transcript
    1. Fine-Grained Geographical Relation Extraction from Wikipedia André Blessing Hinrich Schütze University of Stuttgart Institute for Natural Language Processing (IMS)

    2. Overview • motivation • why are fine-grained relations important? • self-annotation • automatic annotation using structured data • use this annotation for training classifier • extraction framework • evaluation and conclusion

    3. Geographical data provider • GeoNames • gazetteer • names, type, coordinates • 8 million entries • 2.6 million populated places • community-based • Creative Commons Attribution 3.0 License • Free to share

    4. GeoNames

    5. GeoNames – hierarchical types

    6. GeoNames – missing hierarchical relations

    7. Task Definition • relation definition • R1-2 • ADM3-ADM4 • Landkreis (county)- Gemeinde (municipality) • R0-1 • ADM4-PPL • Gemeinde (municipality) and Ortsteil (suburb) • task • classify all possible binary relations of named entities in one sentence

    8. Example - binary relations between all NEs • Gebroth ist eine Ortsgemeinde im Landkreis Bad Kreuznach in Rheinland-Pfalz (Deutschland). • Gebroth is a municipality in the county Bad Kreuznach in Rheinland-Pfalz (Germany). • binary relations between NEs • (Gebroth,Bad Kreuznach) element of R1_2 • (Gebroth, Rheinland-Pfalz) • (Gebroth, Deutschland) • (Bad Kreuznach, Rheinland-Pfalz) • (Bad Kreuznach, Germany) • (Rheinland-Pfalz, Deutschland)

    9. Requirements for extraction system • fast to develop • requested relation types can change • avoid expensive manual annotation • fine-grained relation types • e.g. simple part-of relation is not sufficient • trained system need no structured data • several input sources (Wikipedia, blogs, twitter, news) • German data

    10. Wikipedia as resource • structured data • templates (e.g. infoboxes), links, categories, tables, lists • unstructured data • written text • high quality • many users • WikiBots • structured data can be used to annotate unstructured data → self-annotation

    11. Self-Annotation - example R1_2(Gebroth, Bad Kreuznach) Gebroth Gebroth ist eine Ortsgemeinde im Landkreis Bad Kreuznach in Rheinland-Pfalz (Deutschland). unstructureddata structureddata Landkreis Bad Kreuznach (county)

    12. Self-annotation - challenges • infoboxes are not always complete/correct/coherent filled • matching with unstructured data • pattern matching not sufficient • orthographic variances • morphology • multi-word expressions • matching need some manual adjustment • only one relation per article

    13. Extraction framework • UIMA (Unstructured Information Management Architecture) • pipeline architecture • easy exchange of components • fast development • extended components • CollectionReader for Wikipedia • linguistic annotation • supervised classifier

    14. Extraction pipeline German Wikipedia FSPar-Engine MaxEnt-Classifier structured data JWPL unstructured text UIMA Pipeline FSPar-Annotator CollectionReader Self-Annotation ClearTK Consumer CollectionReader text GeoNames

    15. Linguistic processing • FSPar engine (Schiehlen 2003) • tokenizer • PoS-tagger (bases on TreeTagger) • chunker • partial dependency parser

    16. Supervised classification • extended ClearTK-Annotator • feature sets • F0: NE distance (baseline) • F1: Window-based (pos, lemma, size=2) • F2: chunks (parent chunks of NEs) • F3: dependency parse (paths between NEs) • MaxEntClassifier

    17. Evaluation • 9000 articles about German municipalities and suburbs • 5300 articles for training • 1800 articles for development • 1800 articles for final evaluation • R1_2 relation is also available from the Federal Statistical Office of Germany • Used for evaluate self-annotation • 99.9 % ( 1 error in 1304 sentences)

    18. Results

    19. Conclusion • text is important resource for context-aware systems • self-annotation • automatic annotation using structured data • Wikipedia is a valuable resource • structured and unstructured data • containing fine-grained relations • UIMA based implementation • fine-grained geographical relation extraction is possible

    20. Questions: ?! www.nexus.uni-stuttgart.de