1 / 8

Sifter

Sifter. for classifying books dense (or not) in family history information & for selecting 3-page sequences for evaluation . Deryle Lonsdale 1 Oct. 2013. The task. Develop a data-rich family history text range recognizer Perl Machine learning Mostly OTS components Fully automatic

donal
Download Presentation

Sifter

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sifter for classifying books dense (or not) in family history information & for selecting 3-page sequences for evaluation Deryle Lonsdale 1 Oct. 2013

  2. The task • Develop a data-rich family history text range recognizer • Perl • Machine learning • Mostly OTS components • Fully automatic • Arbitrary text chunk size • Evaluate performance

  3. Method • Document features • Language identifier (and confidence) • We only want English (for now) • Used a pre-existing Perl module (Simões) • Type/token ratio • We want narrow-domain • % FH lexical items • We want to prefer FH vocabulary • Hand-coded, 49 words (died, married, cremation, etc.) • % integer words, % person words, % date words, % organization words, % location words • We want it to be data-rich • Used Stanford named entity engine • Average sentence length • Maybe sentences are shorter in FH text?? • One vector (floating-point features) per text chunk (e.g. document)

  4. Evaluation • Gigaword corpus newswire • Associated Press Worldstream articles (Nov. 1994-May 1995) • 585 obituaries (192,000 words) • 649 non-obituaries (221,000 words, randomly selected from 85,000 articles) • TiMBL machine learning

  5. Results F-Score beta=1, microav: 0.939263 F-Score beta=1, macroav: 0.939184 AUC, microav: 0.940449 AUC, macroav: 0.940449 overall accuracy: 0.939222 (1159/1234), of which 128 exact matches Confusion Matrix: nonobit obit -------------- nonobit | 595 54 obit | 21 564 -*- | 0 0

  6. Feature ranking • % FH lexical items • % integers • % person names • % dates • Average sentence length • Type/token ratio • % locations • % organizations

  7. Errors False positives False negatives Lists of creative works Credits from George Abbott's stage career, compiled by his office and from theater reference books: The Misleading Lady, 1913, actor. Yeoman of the Guard, 1915, actor. The Queens Enemies, 1916, actor. Lightnin', 1918, rewrote scenes. … Tagging errors EDITORS: Two versions of Yugoslavia-Obit-Djilas moved on circuits. Please disregard the second, shorter, unbylined version. The AP • Articles about people perishing in concentration camps • Crime stories (murders, serial killers, murder trial, terrorist acts) • Accident stories

  8. Caveats • Obituaries, not FH data per se • Newswire, not books • One source • Will it scale? • Can it port to FSL? • Didn’t do any ML tuning • Binary acceptor; continuous values possible? • Effect of OCR errors?

More Related