120 likes | 132 Views
Mining Gazetteer Data from Digital Library Collections. David Smith Perseus Project Tufts University. Corpus Preview. Preview: 1400-1600. What DLs can do for gazetteers. Directly manage gazetteers Raw materials for gazetteers Reference works Monolingual and parallel corpora
E N D
Mining Gazetteer Data from Digital Library Collections David Smith Perseus Project Tufts University
Corpus Preview Perseus Project, JCDL 2002
Preview: 1400-1600 Perseus Project, JCDL 2002
What DLs can do for gazetteers • Directly manage gazetteers • Raw materials for gazetteers • Reference works • Monolingual and parallel corpora • Testbeds for improving these technologies • E.g. alignment helps name tagging, and name tagging helps alignment Perseus Project, JCDL 2002
Lexicographical parallels • Original “slipping” process • First, get a madman ... • Creation of Brown and other corpora • Kucera and Lewis • Cobuild dictionary and friends • But names “get no respect” in lexicography (McDonald, 1996) Perseus Project, JCDL 2002
Cultural dependencies Perseus Project, JCDL 2002
Toponym Results Perseus Project, JCDL 2002
Projection principles • Exploits asymmetry in human language technologies (Yarowsky, HLT 2001) • English, French, Chinese, Czech (!) have • POS taggers, morphological analyzers • Named entity identifiers • Parsers and bracketers • Parallel corpus alignment allows projection of these resources Perseus Project, JCDL 2002
Projection principles Perseus Project, JCDL 2002
Projection on the cheap • Align texts at coarse structural level • Geocode source text (English) • Optionally winnow target text (e.g. non-capitalized words where applicable) • Calculate mutual information (Church & Hanks, 1990) • Transliteration may be too ad hoc Perseus Project, JCDL 2002
Preliminary results • Greek/English testbed • 98% precision • 70.8% recall (Why?) • Ethnic designations present interesting problems • “Stephanus of Byzantium” • Morphology outside of English Perseus Project, JCDL 2002
Proposals • Preservation of gazetteer source materials • DLs as home for gazetteer “slips” • Parallel texts as key resource • (also cf. Berkeley TIDES work) • Persistent documents as training sets for automatic methods • http://www.perseus.tufts.edu Perseus Project, JCDL 2002