1 / 51

Extracting Geographical Gazetteers from the Internet

Extracting Geographical Gazetteers from the Internet. Olga Uryupina 30.05.03. Overview. Named Entity Recognition & Gazetteers Data Initial Algorithm Bootstrapping approach Evaluation ToDo. NE Recognition.

kyra-boone
Download Presentation

Extracting Geographical Gazetteers from the Internet

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03

  2. Overview • Named Entity Recognition & Gazetteers • Data • Initial Algorithm • Bootstrapping approach • Evaluation • ToDo

  3. NE Recognition National Gallery ofScotland – The nucleus of the Gallery was formed by the Royal Institution‘s collection, later expanded by bequests and purchasing. Playfair designed (1850-57) the imposing classical building to house the works.

  4. State-of-the-art systems Standard approaches usually combine • Rules • Statistics • Gazetteers Classes distinguished: • Person • Organisation • Location

  5. NE Recognition – with and without gazetteers (Mikheev, Moens, and Grover, 1999) ran their system in different modes

  6. Fine-grained NER Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.

  7. Fine-grained NER Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.

  8. Fine-grained NER Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.

  9. Fine-grained NER Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.

  10. Manually created gazetteers Available resources: • Word lists from the Web • Atlases & maps • Digital gazetteers (e.g. Alexandria Digital Library)

  11. Manually created gazetteers – drawbacks • Only positive data (no way to find out whether Mainau island does not exist or is simly not listed) • Difficult to adjust when new classes are required • Not available for most languages: Aquisgrana

  12. Task We can get rid of manually compiled gazetteers by using the Internet. Task: subclassify locations using the Internet counts (obtained from the Altavista Search Engine). Offline vs. Online processing

  13. Data Manually created gazetteer (1260 items) Classes: • COUNTRY Pitcairn • REGION Bavaria/Bayern • RIVER Oder • ISLAND Savai‘i • MOUNTAIN Ohmberge • CITY Nancy Washington: 11xCITY, 1xMOUNTAIN, 2xISLAND, (31+1+1)xREGION

  14. Data Gazetteer example

  15. Data For each class we sample 100 items from the gazetteer. As the lists overlap, this results in 520 different items (TRAINING data). The rest was used for TESTING. CITY: ... REGION: ... COUNTRY: ... RIVER: ..., Victoria, ... ISLAND: ..., Victoria, ... MOUNTAIN: ..., Victoria, ... • TRAINING: Victoria [+CITY, +REGION, +RIVER, +ISLAND, +MOUNTAIN, -COUNTRY]

  16. Initial system For each class a set of keywords was created. ISLAND island islands archipelago

  17. Initial system For each item X to be classified, queries of the form “X KEYWORD“ and “KEYWORD of X“ are sent to the Altavista search engine.

  18. Initial system Machine learners use the counts to induce classifications. Learners tested for this task: • C4.5 • TiMBL • Ripper

  19. Initial system – drawbacks Still needs manually created resources: • Set of patterns • Initial gazetteer (TRAINING) Only online (slow) processing – the system can only classify items, provided by the user, but not extract new names itself

  20. Bootstrapping Riloff & Jones, 1999 – Bootstrapping for IE task ITEMS PATTERNS

  21. Bootstrapping Main problem – noise: the patterns set can get infected Remedies: • Vaccine (external algorithm for evaluating patterns) • Stop lists • Human experts

  22. Initial gazetteer Extraction items Collecting patterns Classifying items Learned high-precision classifier Discarding most general patterns Discarding common names Learning classifiers Collecting items Extraction patterns

  23. Initial gazetteer Extraction items Collecting patterns Classifying items Learned high-precision classifier Discarding most general patterns Discarding common names Learning classifiers Collecting items Extraction patterns

  24. Initial gazetteer Extraction items Collecting patterns Classifying items Learned high-precision classifier Discarding most general patterns Discarding common names Learning classifiers Collecting items Extraction patterns

  25. Collecting patterns (step 1) • Go to AltaVista • ask for an item • download first n pages • match with a simple regexp • patterns

  26. Example – step 1 10 best patterns for ISLAND: of X 70 the X 60 X and 58 X the 55 to X 53 in X 52 and X 47 X is 45 X in 45 on X 45

  27. Initial gazetteer Extraction items Collecting patterns Classifying items Learned high-precision classifier Discarding most general patterns Discarding common names Learning classifiers Collecting items Extraction patterns

  28. Rescoring (step 2) Goal: discard too general patterns – score of pattern p for class c – penalty for appearing in more than one class

  29. Example – step 2 10 best patterns for ISLAND: X island 17 island of X 9 X islands 8 island X 7 islands X 7 insel X 7 the island X 6 X elects 5 of X islands 5 zealand X 4

  30. Initial gazetteer Extraction items Collecting patterns Classifying items Learned high-precision classifier Discarding most general patterns Discarding common names Learning classifiers Collecting items Extraction patterns

  31. Learning classifiers (step 3) 20 best patterns are used to train Ripper (as in the initial system) Produced classifiers: • high-recall • high-accuracy • high-precision

  32. Example – step 3 • High-recall classifier for ISLAND: if #(„X island“)/#X >= 0.003879 classify X as +ISLAND if #(„and X islands“)/#X >= 0.000002 classify X as +ISLAND if #(„insel X“)/#X >= 0.017099 classify X as +ISLAND otherwise classify X as –ISLAND • Extraction patterns: „X island“, „and X islands“, „insel X“

  33. One more example – step 3 • High-accuracy classifier for ISLAND: if #(„X island“)/#X >= 0.000636 classify X as +ISLAND if #(„and X islands“)/#X >= 0.000002 and #(„X sea“)/#X>=0.000013 and #(„X geography“)<13 classify X as +ISLAND if #(„X islands“)/#X >= 0.000056 and #(„pacific islands X“)/#X>=0.000006 classify X as +ISLAND otherwise classify X as –ISLAND

  34. Initial gazetteer Extraction items Collecting patterns Classifying items Learned high-precision classifier Discarding most general patterns Discarding common names Learning classifiers Collecting items Extraction patterns

  35. Initial gazetteer Extraction items Collecting patterns Classifying items Learned high-precision classifier Discarding most general patterns Discarding common names Learning classifiers Collecting items Extraction patterns

  36. Collecting and discarding items (steps 4&5) The same procedure as the step 1: go to AltaVista, ask for extraction patterns (cf. step 3), .. Discarding: common names (beginning with low-case letters), stop words (not necessary, but save time)

  37. Example – steps 4 and 5 Extracted islands (alphabetically):

  38. Initial gazetteer Extraction items Collecting patterns Classifying items Learned high-precision classifier Discarding most general patterns Discarding common names Learning classifiers Collecting items Extraction patterns

  39. Classifying (step 6) High-precision classifier (cf. step 3) is run on collected items • rejected items are discarded • accepted items used for extraction at the next loop

  40. Example – step 6 Extracted islands (alphabetically):

  41. Evaluation Classifiers: • initial system • bootstrapping from the seed gazetteer • bootstrapping from positive examples only Items lists: • bootstrapping from the seed gazetteer

  42. Initial system – evaluation

  43. Bootstrapping – evaluation

  44. Comparing the performance RIVER, MOUNTAIN, COUNTRY – the new system is better! ISLAND – the new system improved and became better after the 2nd loop. REGION – infected category („departments of X“); however, the system is improving. CITY – very heterogeneous class (homonymy); 1st loop – „streets of X“, 2nd loop – „km from X“, „ort X“.

  45. Comparing the systems Bootstrapping (vs. the initial system): + patterns learned automatically + word lists produced • cheap seed gazetteer Problem: it‘s easy to download huge lists of islands etc., but very difficult to check them and classify properly

  46. Learning from positives CITY: ... REGION: ... COUNTRY: ... RIVER: ..., Victoria, ... ISLAND: ..., Victoria, ... MOUNTAIN: ..., Victoria, ... Before: => TRAINING: Victoria [+CITY, +REGION, +RIVER, +ISLAND, +MOUNTAIN, -COUNTRY] Now: => TRAINING: Victoria [-CITY, -REGION, +RIVER, +ISLAND, +MOUNTAIN, -COUNTRY]

  47. Initial system – evaluation

  48. Bootstrapping with positives only – evaluation

  49. New items New ISLANDs: true islands 121 (90.3%) found in the atlases 93 not found 28 descriptions 5 (3.7%) parts of names 3 (2.2%) mistakes 5 (3.7%) _______ all 134

  50. Conclusion Advantages of our approach: • very few manually collected data required (seed gazetteer) • no sophisticated engineering – patterns produced automatically • on-line classifiers provide negative information and are applicable to any entity • new items (off-line gazetteer) collected automatically

More Related