Indexing methods for faster and more effective person name search
This presentation is the property of its rightful owner.
Sponsored Links
1 / 18

Indexing Methods for Faster and More Effective Person Name Search PowerPoint PPT Presentation


  • 53 Views
  • Uploaded on
  • Presentation posted in: General

Indexing Methods for Faster and More Effective Person Name Search. Mark Arehart MITRE Corporation [email protected] Goals. Not about NER per se. Assume NER is already done. Make output useful to users Searchable with approximate matching Not an offline process: fast response time

Download Presentation

Indexing Methods for Faster and More Effective Person Name Search

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Indexing methods for faster and more effective person name search

Indexing Methods for Faster and More Effective Person Name Search

Mark Arehart

MITRE Corporation

[email protected]


Goals

Goals

  • Not about NER per se.

  • Assume NER is already done.

  • Make output useful to users

    • Searchable with approximate matching

    • Not an offline process: fast response time

  • Balance search effectiveness and speed.


Context darpa tigr system

Context: DARPA TIGR system


Person names in tigr

Person Names in TIGR

  • Entered by soldiers in reports.

  • Users lack linguistic expertise.

  • Spelling/transliteration variation.

  • Data entry errors.

  • Generic text search provided by IR system does not compensate.

  • Name index created by NER (Miller et al 10).


Approximate name matching

Approximate Name Matching

  • Research community:

    • phonetic keys

    • n-gram matching

    • edit-based measures (with fixed, variable, or learned edit costs)

    • Frequency-based measures

    • String based and token-based

    • Refs: Winkler 90, Zobel and Dart95, Ristad and Yianilos 98, Bilenko and Mooney 03, Cohen et al 03, Christen 06.

  • Commercial systems (expensive)


Performance problem

Performance Problem

  • Fuzzy-matching is slow.

  • 2000 comps/sec sounds fast, right?

  • Match query to every database name:

    query_time = size_db * avg_match_time

  • 0.5 ms times db size of 100,000 = 50 seconds per query.

  • Not fast.


Solution part 1

Solution Part 1

  • Make comparison function faster.

  • Say you more than double the speed through code optimization.

  • 0.18ms * 100,000 records = 18 seconds.

  • Much better, but…


Solution part 2

Solution Part 2

  • Pass 1: blocking

    • developed in record linkage (Winkler 06 for overview)

    • quick (dumb) retrieval of candidates.

  • Pass 2: matching

    • slow (smart) comparison function.

  • Blocking function must:

    • Retrieve a small subset of the db.

    • Do so quickly.

    • Include all the true matches.


Two pass matching

Two-Pass Matching

  • Create text index of database names.

  • Each name is indexed by one or more keys.

  • At query time, generate keys for query name.

  • Retrieve candidates using direct key lookup.

  • Apply comparison function to candidates.


Ways to make keys

Ways to Make Keys

Original name = Saddam Hussein Al Tikriti

Exact  [SADDAM, HUSSEIN, (AL), TIKRITI]

Substring [SADD, HUSS, (AL), TIKR]

Phonetic  [STM, HSN, (AL), TKRT]

Better to not index particles like AL, ABU, BIN


Key based index

Key-based Index

STM  [Saddam Hussein Al Tikriti,

Saddam Husein, …]

HSM [Saddam Hussein Al Tikriti,

Hosein Mohamed,

Ahmed Hassan, …]

TKRT [Saddam Hussein Al Tikriti,

Uday Hussein Al Tikriti, …]


Retrieval using keys

Retrieval Using Keys

  • Generate keys from query name.

    • Refinement: don’t index particles (using stoplist).

  • Return names associated with each key.

    • Refinement: for longer names, require more than one key match.

  • Do fuzzy matching on the retrieved candidates.


Evaluation

Evaluation

  • Existing datasets not appropriate.

    • String matching research: too small or not right kinds of variations (Pfeifer 95, Zobel and Dart 95, Cohen et al 03, Bilenko and Mooney 03)

    • Record linkage: multiple data fields (Winkler 06)

  • Our test set (previously developed) of approx 700 queries run against 70,000 names.

    • Test data is noisy and multicultural.

    • Contains many kinds of Arabic name variants.

  • Runs evaluated for accuracy and speed.


Matching functions

Matching Functions

  • JaroWinkler: generic string matching baseline

  • Level 2 JaroWinkler: tokenized

  • Romarabic: custom algorithm (Freeman 06)

    • dictionary of common variants

    • name part similarity backs off to edit distance

    • aware of multi-segment name parts

    • finds optimal alignment


Jarowinkler

JaroWinkler


Level 2 jarowinkler

Level 2 JaroWinkler


Romarabic

Romarabic


Conclusion

Conclusion

  • For NER to be useful, system performance must be considered.

    • Most accurate matcher may be impractical

  • Multiple pass algorithm

    • Speed/accuracy not a tradeoff here.

  • Very simple methods are often the best.

    • custom phonetic key did worse than prefix

  • Important to use large and realistic test set.


  • Login