1 / 16

Adjudicator Agreement and System Rankings for Person Name Search

Adjudicator Agreement and System Rankings for Person Name Search. Mark Arehart, Chris Wolf, Keith Miller The MITRE Corporation { marehart , cwolf , keith }@ mitre.org. Summary. Matching multicultural name variants is knowledge intensive Ground truth dataset requires tedious adjudication

aria
Download Presentation

Adjudicator Agreement and System Rankings for Person Name Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adjudicator Agreement and System Rankings for Person Name Search Mark Arehart, Chris Wolf, Keith Miller The MITRE Corporation {marehart, cwolf, keith}@mitre.org

  2. Summary Matching multicultural name variants is knowledge intensive Ground truth dataset requires tedious adjudication Guidelines not comprehensive, adjudicators often disagree Previous evaluations: multiple adjudication, voting Results of study: High agreement, multiple adjudication not needed “Nearly” same payoff for much less effort

  3. Dataset Watchlist, ~71K Deceased persons lists Mixed cultures 1.1K variants for 404 base names Ave. 2.8 variants per base record Queries, 700 404 base names 296 randomly selected from watchlist Subset of 100 randomly selected for this study

  4. Method Adjudication pools as in TREC: pool from 13 algorithms Four judges complete pools (1712 pairs, excluding exact matches) Compare system rankings under different versions of ground truth

  5. Adjudicator Agreement Measures overlap = a / (a + b + c) p+ = 2a / (2a + b + c) p- = 2d / (2d + b + c)

  6. Adjudicator Agreement Lowest is A~B kappa 0.57 Highest is C~D kappa 0.78

  7. So far… Test watchlist and query list Results from 13 algorithms Adjudications by 4 volunteers Ways of compiling alternate ground truth sets Still need…

  8. Comparing System Rankings A complete ranking A A How similar? Kendall’s tau Spearman’s rank correlation B C C B D E E D

  9. Significance Testing Not all differences are significant (duh) F1-measure: harmonic mean of precision & recall Not a proportion or mean of independent observations Not amenable to traditional significance tests Like other IR measures, e.g. MAP Bootstrap resampling Sample with replacement from data Compute difference for many trials Produces a distribution of differences

  10. Incomplete Ranking Not all differences significant  partial ordering B A A How similar? B C C D E D E

  11. Evaluation Statements A>B A>C A>D A>E B=C B>D B>E C>D C>E D=E B A<B A>C A>D A>E B>C B>D B>E C>D C>E D=E A A B C C D E D E

  12. Similarity n systems  n(n-1) / 2 evaluation statements Reversal rate: proportion of reversed relations: 10% A>B A>C A>D A>E B=C B>D B>E C>D C>E D=E A<B A>C A>D A>E B>C B>D B>E C>D C>E D=E Total disagreement: 20% Sensitivity: proportion of relations with sig diff Sens = 80% Sens = 90%

  13. Comparisons With Baseline No reversals except with intersection GT (one algorithm) Highest and lowest agr with consensus Low!

  14. GT Comparisons

  15. Comparison With Random 1000 GT versions created by randomly selecting a judge Consensus sensitivity = 74.4% Average random sensitivity = 72.9% (sig diff at 0.05) Average disagreement with consensus = 7.3% 5% disagreement expected (actually more) 2.3% remainder (actually less) attributable to GT method No reversals in any of the 1000 sets

  16. Conclusion Multiple adjudicators judge everything  expensive Single adjudicator  variability in sensitivity Multiple adjudicators randomly divide pool: Slightly less sensitivity No reversals of results Much less labor Differences wash out approximating consensus Practically same result for less effort

More Related