1 / 18

Ad-hoc Georeferencing of Web-pages Using Street-name Prefix Trees

http://cs.joensuu.fi/mopsi/. Ad-hoc Georeferencing of Web-pages Using Street-name Prefix Trees. Andrei Tabarcea , Ville Hautamäki , Pasi Fränti University of Eastern Finland. Introduction. Our goal is to find services and points of interest close to the user’s location

Download Presentation

Ad-hoc Georeferencing of Web-pages Using Street-name Prefix Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. http://cs.joensuu.fi/mopsi/ Ad-hoc Georeferencing of Web-pages Using Street-name Prefix Trees Andrei Tabarcea, Ville Hautamäki, PasiFränti University of Eastern Finland

  2. Introduction • Our goal is to find services and points of interest close to the user’s location • We call this “location-based search” • We try to find location information in web-pages

  3. Ad-Hoc Georeferencing <HTML> <HEADprofile"="http://geotags.com/geo> <METAname="geo.position" content="62.35;29.44"> <METAname="geo.region" content="FI"> <METAname="geo.placename" content="Joensuu"> <METAhttp-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <linkrel="stylesheet" href="http://www.joensuu.fi/tkt/sivutyyli.css" type="text/css"> <TITLE>Pages of PasiFränti</TITLE> </HEAD> • The problem is how to extract and validate location data from free-form text • Most web pages don’t contain explicit georeferencing (eg. geo-tags) • Postal address is the most common location data found • Our goal is to give geographical coordinates to services mentioned in web-pages • We call this method ad-hoc georeferencing

  4. MOPSI Location-Based Search MOPSI = Mobiilitpaikkatieto-sovelluksetjaInternet (Mobile location based applications and Internet) Available on http://cs.joensuu.fi/mopsi/ Main focus areas: • Mobile search engine • How to collect & present location-based data • Other location-related topics

  5. Mobile search engine • How can you find services: • Asking directions • Advertisements • Wandering around • Yellow pages • Internet • Query consists of: • Keyword • Location

  6. Mobile Search engine structure Core server software Keyword Coordinates Mobile application Geocoded street-name database Coordinates Search results • Search Engine consists of: • User interface • Core server software • Geocoded street-name database Address Keyword Coordinates Web user interface Search results

  7. Core Server software Geocoded database Coordinates Coordinates Municipalities list Addresses Georeferencing module Relevant municipalities detector Page parser Address and description detector Address validator Sorted results list Word list Keyword Municipalities Results list Keyword, Address, Coordinates <keyword, municipality> query Result links

  8. Core Server software Geocoded database Coordinates Coordinates Municipalities list Addresses Georeferencing module Relevant municipalities detector Page parser Address and description detector Address validator Sorted results list Word list Keyword Municipalities Results list Keyword, Address, Coordinates <keyword, municipality> query Result links

  9. Our Solution • A rule-based solution that detects address-based locations using a gazetteer and street-name prefix trees created from the gazetteer • We compare this approach against: • a method that doesn’t require a gazetteer (a heuristic method that assumes that the street-name has a certain structure) • a method that also uses data structures created from the gazetteer in the form of street-name arrays StreetNameDetection(words) { WHILE i < count(words) DO { IF words[i] = street name THEN { Search for street number, postal code and other address elements near words[i]. IF address elements found THEN { Create address block Get coordinates using Geocoded Database IF coordinates found THEN Add address block to address list } }i = i + 1; }}

  10. Street-address Detection • We use a rule-based pattern matching algorithm • The detection of street-names is the starting point of the algorithm • An address-block candidate is constructed by detecting typical address elements (street names, numbers, postal codes, telephone numbers and municipal names) • Address block candidates are validated using the gazetteer

  11. Street-name Detection • Street-name detection is the starting point of the address detection • Heuristic and brute-force method are compared against our Prefix Tree solution • Our application uses a commercial gazetteer for Finland and, for Singapore, street data from the free map project OpenStreetMap

  12. Prefix Trees • Invented by Friedkin (1960) • The prefix tree (or trie) is a fast ordered tree data structure used for retrieval • Root is associated with an empty string • All the descendants of a node have a common prefix of the string associated with that node • Some nodes can have associated values (usually they mark the end of a word)

  13. Street-name Prefix Trees • Our solution is to detect street-names using prefix trees constructed from the gazetteer • A street-name prefix tree is build for each municipality used in the search • The user’s location and his area of interested are known, therefore prefix-trees can be limited to municipalities

  14. Other solutions • Heuristic solution • Relies on regular expression matching • Street names usually have similar endings or similar prefixes • A gazetteer is not needed (except for validation) • Can be fast but not precise • Brute-force solution • Every word should be checked if it exists in the gazetteer • An optimized solution is used (gazetteer is locally limited and preloaded into arrays)

  15. Experiments • 10 urban locations (blue) and 10 rural location (orange) were used for testing • Testing was done using the MOPSI prototype for Finland and Singapore • Both commercial and non-commercial keywords were used:

  16. Results • Average processing times for every solution were calculated • The prefix tree solution proved to be on average 57% faster and 10% more accurate than the heuristic solution and 10 times faster than the brute-force solution • The resulting solution improves the speed and quality of web-page georeferencing

  17. Open Problems • Support approximate matching to avoid problems in misspellings • Improve flexibility of the address detection algorithm • Implement a way to learn rules automatically using hand tagged example corpus.

  18. http://cs.joensuu.fi/mopsi Thank you!

More Related