Fine-Grained Geographical Relation Extraction from Wikipedia

Fine-Grained Geographical Relation Extraction from Wikipedia André Blessing Hinrich Schütze University of Stuttgart Institute for Natural Language Processing (IMS)

Overview • motivation • why are fine-grained relations important? • self-annotation • automatic annotation using structured data • use this annotation for training classifier • extraction framework • evaluation and conclusion

Geographical data provider • GeoNames • gazetteer • names, type, coordinates • 8 million entries • 2.6 million populated places • community-based • Creative Commons Attribution 3.0 License • Free to share

GeoNames

GeoNames – hierarchical types

GeoNames – missing hierarchical relations

Task Definition • relation definition • R1-2 • ADM3-ADM4 • Landkreis (county)- Gemeinde (municipality) • R0-1 • ADM4-PPL • Gemeinde (municipality) and Ortsteil (suburb) • task • classify all possible binary relations of named entities in one sentence

Example - binary relations between all NEs • Gebroth ist eine Ortsgemeinde im Landkreis Bad Kreuznach in Rheinland-Pfalz (Deutschland). • Gebroth is a municipality in the county Bad Kreuznach in Rheinland-Pfalz (Germany). • binary relations between NEs • (Gebroth,Bad Kreuznach) element of R1_2 • (Gebroth, Rheinland-Pfalz) • (Gebroth, Deutschland) • (Bad Kreuznach, Rheinland-Pfalz) • (Bad Kreuznach, Germany) • (Rheinland-Pfalz, Deutschland)

Requirements for extraction system • fast to develop • requested relation types can change • avoid expensive manual annotation • fine-grained relation types • e.g. simple part-of relation is not sufficient • trained system need no structured data • several input sources (Wikipedia, blogs, twitter, news) • German data

Wikipedia as resource • structured data • templates (e.g. infoboxes), links, categories, tables, lists • unstructured data • written text • high quality • many users • WikiBots • structured data can be used to annotate unstructured data → self-annotation

Self-Annotation - example R1_2(Gebroth, Bad Kreuznach) Gebroth Gebroth ist eine Ortsgemeinde im Landkreis Bad Kreuznach in Rheinland-Pfalz (Deutschland). unstructureddata structureddata Landkreis Bad Kreuznach (county)

Self-annotation - challenges • infoboxes are not always complete/correct/coherent filled • matching with unstructured data • pattern matching not sufficient • orthographic variances • morphology • multi-word expressions • matching need some manual adjustment • only one relation per article

Extraction framework • UIMA (Unstructured Information Management Architecture) • pipeline architecture • easy exchange of components • fast development • extended components • CollectionReader for Wikipedia • linguistic annotation • supervised classifier

Extraction pipeline German Wikipedia FSPar-Engine MaxEnt-Classifier structured data JWPL unstructured text UIMA Pipeline FSPar-Annotator CollectionReader Self-Annotation ClearTK Consumer CollectionReader text GeoNames

Linguistic processing • FSPar engine (Schiehlen 2003) • tokenizer • PoS-tagger (bases on TreeTagger) • chunker • partial dependency parser

Supervised classification • extended ClearTK-Annotator • feature sets • F0: NE distance (baseline) • F1: Window-based (pos, lemma, size=2) • F2: chunks (parent chunks of NEs) • F3: dependency parse (paths between NEs) • MaxEntClassifier

Evaluation • 9000 articles about German municipalities and suburbs • 5300 articles for training • 1800 articles for development • 1800 articles for final evaluation • R1_2 relation is also available from the Federal Statistical Office of Germany • Used for evaluate self-annotation • 99.9 % ( 1 error in 1304 sentences)

Results

Conclusion • text is important resource for context-aware systems • self-annotation • automatic annotation using structured data • Wikipedia is a valuable resource • structured and unstructured data • containing fine-grained relations • UIMA based implementation • fine-grained geographical relation extraction is possible

Questions: ?! www.nexus.uni-stuttgart.de

Fine-Grained Geographical Relation Extraction from Wikipedia

Fine-Grained Geographical Relation Extraction from Wikipedia

Presentation Transcript

Wresting Control from BGP: Scalable Fine-grained Route Control

Relation Extraction

Enhancing Fine-Grained Parallelism

Enhancing Fine-Grained Parallelism

Relation Extraction

Information Extraction Lecture 7 – Relation Extraction

Coarse to Fine Grained Sense Disambiguation in Wikipedia

FILA: Fine-grained Indoor Localization

PowerSpy: Fine Grained Power Profiler

Wresting Control from BGP: Scalable Fine-grained Route Control

Fine-Grained Authorization in Databases

Relation Extraction

Fine-grained and Coarse-grained Word Sense Disambiguation

Fine-Grained Soft Semantic Constraints

Fine-Grained Layered Multicast

Relation Extraction

Information Extraction Lecture 7 – Relation Extraction

An Integrated Approach for Relation Extraction from Wikipedia Texts

Enhancing Fine-Grained Parallelism

Fine-Grained Soils:

Relation Extraction

Fine Grained Auditing