Automated Georeferencing of Natural History Museum Data. Nelson E. Rios. Abstract. Design: Natural Language Processing. Discussion.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Automated Georeferencing of Natural History Museum Data
Nelson E. Rios
Design: Natural Language Processing
A locality description along with country, state and county information is input into GEOLocate. Georeferencing begins by standardizing the locality description string into a common terms format. For example, distances mentioned in a locality string are converted to miles. Once standardized the locality string is parsed into key geographic identifiers. Some example geographic identifiers used by GEOLocate include the occurrence of named places, navigable river miles, highway names, water body names, legal locations and displacement patterns. These identifiers within the string are used to determine geographic coordinates from database lookups and geographic calculations. The resulting coordinates are ranked based on the type of information found within the string and plotted on the digital map display for user verification, correction and error determination.
It is estimated that the number of biological specimens in US museums and herbaria exceeds 750 million. In the vast majority of instances the collection location is recorded as a string of text and typically lacks geographic coordinates. We have developed a tool for interpreting descriptive locality text associated with natural history collections data, determining geographic coordinates and allowing the user to verify and correct the coordinates. Traditional methods for georeferencing collection data from text descriptions are tedious and time consuming, typically involving finding the locality on either a hardcopy or digital maps, plotting the locality and determining the coordinates. Using our tool, GEOLocate, considerably reduces the time required to georeference locality information. It took 1 staff member approximately 1.5 years to georeference the 15,000 unique locality descriptions within the Tulane fish collection. Time trials with GEOLocate suggest that this job could have been accomplished in under 6 months.
Using GEOLocate can significantly reduce the time required to georeference natural history data. GEOLocate was able to assign coordinates to over 98% of the locality data tested. This initial assignment of coordinates should only be considered a "rough" pass at the data and each record should be visually inspected and corrected as necessary.
Locality records with incorrect or missing county information typically have greater error associated with resultant coordinates. This is due to the greater search area involved when county information is absent. Depending on the quality of the original locality data, georeferencing results can be improved by prior checks of misspelled, missing, incorrect, and/or ambiguous information within the locality dataset.
Application: Test Bed Results
11521 unique locality descriptions containing geographic coordinates were extracted from the TUMNH database and imported into GEOLocate. Of these, 11295 records were auto-assigned coordinates by GEOLocate within a 3 hour period. 36% of the georeferenced records were within 1 mile of the original coordinates. 83% of the records were within 15 miles of the original, permitting easy verification and correction on the map display. Time trials using GEOLocate average 45-60 seconds to georeference, verify and correct a locality record.
I would like to thank the following for reviewing early versions of GEOLocate: James S. Albert, Jonathan Armbruster, Jeremy Bartley, Andy Bentley, Stephanie Coste, Paul David, Bud Freeman, John Friel, Tom Giermakowski, Robert Glaubitz, Sara J. Gottlieb, Brendan Haley, Chad Hargrave, Dean Hendrickson, Mikaela Howie, Denny Hugg, Janeen Jones, Edie Marsh, Kris McNyset, Jonathon Rothman, Barbara Scudder, Steph Smith and John Wieczorek.
This research was supported by
a grant from the National
The Tulane University Fish Collection, with 7.1 million fluid-preserved specimens in over 190,000 lots, collected from over 15,000 locations worldwide, is one of the largest collections in the world and is recognized as ”National Center of Ichthyology Resource Collection”. During the early 90's, the entire collection was computerized and georeferenced. Georeferencing the collection took nearly 2 years, requiring labor intensive lookups in a both paper and digital maps. This experience along with the resultant dataset of georeferenced information, became a test bed for the development of an automated georeferencing system for natural history information called GEOLocate. GEOLocate is a software tool that enables researchers to easily assign geographic coordinates to a descriptive string of locality information, visualize the location, and make corrections as necessary.