Indexing methods for faster and more effective person name search
1 / 18

Indexing Methods for Faster and More Effective Person Name Search - PowerPoint PPT Presentation

  • Uploaded on

Indexing Methods for Faster and More Effective Person Name Search. Mark Arehart MITRE Corporation [email protected] Goals. Not about NER per se. Assume NER is already done. Make output useful to users Searchable with approximate matching Not an offline process: fast response time

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Indexing Methods for Faster and More Effective Person Name Search' - marlie

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Goals Search

  • Not about NER per se.

  • Assume NER is already done.

  • Make output useful to users

    • Searchable with approximate matching

    • Not an offline process: fast response time

  • Balance search effectiveness and speed.

Person names in tigr
Person Names in TIGR Search

  • Entered by soldiers in reports.

  • Users lack linguistic expertise.

  • Spelling/transliteration variation.

  • Data entry errors.

  • Generic text search provided by IR system does not compensate.

  • Name index created by NER (Miller et al 10).

Approximate name matching
Approximate Name Matching Search

  • Research community:

    • phonetic keys

    • n-gram matching

    • edit-based measures (with fixed, variable, or learned edit costs)

    • Frequency-based measures

    • String based and token-based

    • Refs: Winkler 90, Zobel and Dart95, Ristad and Yianilos 98, Bilenko and Mooney 03, Cohen et al 03, Christen 06.

  • Commercial systems (expensive)

Performance problem
Performance Problem Search

  • Fuzzy-matching is slow.

  • 2000 comps/sec sounds fast, right?

  • Match query to every database name:

    query_time = size_db * avg_match_time

  • 0.5 ms times db size of 100,000 = 50 seconds per query.

  • Not fast.

Solution part 1
Solution Part 1 Search

  • Make comparison function faster.

  • Say you more than double the speed through code optimization.

  • 0.18ms * 100,000 records = 18 seconds.

  • Much better, but…

Solution part 2
Solution Part 2 Search

  • Pass 1: blocking

    • developed in record linkage (Winkler 06 for overview)

    • quick (dumb) retrieval of candidates.

  • Pass 2: matching

    • slow (smart) comparison function.

  • Blocking function must:

    • Retrieve a small subset of the db.

    • Do so quickly.

    • Include all the true matches.

Two pass matching
Two-Pass Matching Search

  • Create text index of database names.

  • Each name is indexed by one or more keys.

  • At query time, generate keys for query name.

  • Retrieve candidates using direct key lookup.

  • Apply comparison function to candidates.

Ways to make keys
Ways to Make Keys Search

Original name = Saddam Hussein Al Tikriti


Substring [SADD, HUSS, (AL), TIKR]

Phonetic  [STM, HSN, (AL), TKRT]

Better to not index particles like AL, ABU, BIN

Key based index
Key-based Index Search

STM  [Saddam Hussein Al Tikriti,

Saddam Husein, …]

HSM  [Saddam Hussein Al Tikriti,

Hosein Mohamed,

Ahmed Hassan, …]

TKRT  [Saddam Hussein Al Tikriti,

Uday Hussein Al Tikriti, …]

Retrieval using keys
Retrieval Using Keys Search

  • Generate keys from query name.

    • Refinement: don’t index particles (using stoplist).

  • Return names associated with each key.

    • Refinement: for longer names, require more than one key match.

  • Do fuzzy matching on the retrieved candidates.

Evaluation Search

  • Existing datasets not appropriate.

    • String matching research: too small or not right kinds of variations (Pfeifer 95, Zobel and Dart 95, Cohen et al 03, Bilenko and Mooney 03)

    • Record linkage: multiple data fields (Winkler 06)

  • Our test set (previously developed) of approx 700 queries run against 70,000 names.

    • Test data is noisy and multicultural.

    • Contains many kinds of Arabic name variants.

  • Runs evaluated for accuracy and speed.

Matching functions
Matching Functions Search

  • JaroWinkler: generic string matching baseline

  • Level 2 JaroWinkler: tokenized

  • Romarabic: custom algorithm (Freeman 06)

    • dictionary of common variants

    • name part similarity backs off to edit distance

    • aware of multi-segment name parts

    • finds optimal alignment

JaroWinkler Search

Romarabic Search

Conclusion Search

  • For NER to be useful, system performance must be considered.

    • Most accurate matcher may be impractical

  • Multiple pass algorithm

    • Speed/accuracy not a tradeoff here.

  • Very simple methods are often the best.

    • custom phonetic key did worse than prefix

  • Important to use large and realistic test set.