1 / 26

Henry S. Baird Michael A. Moll Sui-Yu Wang

A Highly Legible CAPTCHA that Resists Segmentation Attacks. Henry S. Baird Michael A. Moll Sui-Yu Wang. Some Typical CAPTCHAs. AltaVista eBay/PayPal

Download Presentation

Henry S. Baird Michael A. Moll Sui-Yu Wang

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Highly Legible CAPTCHAthat Resists Segmentation Attacks Henry S. Baird Michael A. Moll Sui-Yu Wang

  2. Some Typical CAPTCHAs AltaVista eBay/PayPal Yahoo! PARC’s PessimalPrint

  3. All These Are Vulnerable to Segment-then-Recognize Attack Effective strategy of attack: • Segment image into characters • Apply aggressive OCR to isolated chars • If it’s known (or guessed) that the word is ‘spellable’ (e.g. legal English), use the lexicon to constrain interpretations Patrice Simard (MS Research) et al report that this breaks many widely used CAPTCHAs

  4. We try to generate word-imagesthat will be hard to segment into characters Slice characters up: -vertical cuts; then -horizontal cuts Set size of cuts to constant within a word Choose positions of cuts randomly Force pieces to drift apart: ‘scatter’ horiz. & vert. Change intercharacter space

  5. Character fragments can interpenetrate Not only is it hard to segment the word into characters, …. … it can be hard to recombine characters’ fragments into characters

  6. Nonsense Words • We use nonsense (but English-like) words (as in BaffleText): • generated pseudorandomly by a stochastic variable-length character n-gram model • trained on the Brown corpus … this protects against lexicon-driven attacks • Why not use random strings? • We want to help human readers feel confident they have made a plausible choice, so they’ll put up with severe image degradations (Cf. research in psychophysics of reading.) M. Chew & H. S. Baird, “BaffleText: a Human Interactive Proof,” Proc., 10th SPIE/IS&T Document Recognition and Retrieval Conf., (DRR2003), Santa Clara, CA, January 23-24, 2003.

  7. How Well Can People Read These? We carried out a human legibility trial with the help of ~60 volunteers: students, faculty, & staff at Lehigh Univ. plus colleagues at Avaya Labs Research

  8. Subjects were told they got it right/wrong– after they rated its ‘difficulty’

  9. Subjective difficulty ratingswere correlated with objective difficulty • People often know when they’ve done well • This can be used to ensure that challenges aren’t too hard (frustrating, angering)

  10. The same data, graphically 1 Easy 2 3 4 5 Impossible Right: Wrong:

  11. People Rated These “Easy’ (1/5) aferatic memmari heiwho nampaign

  12. Rated “Medium Hard” (3/5) overch/ovorch wouwould atlager/ adager weland/ wejund

  13. Rated “Impossible” (5/5) acchown / echaeva gualing / gealthas bothere / beadave caquired / engaberse

  14. Why is ScatterType legible? • Does it surprise you that this is legible…? • I speculate that we can read it because: • we exploit typeface consistency … the evidence is small details of local shape • this ability seems largely unconscious

  15. Ensuring that ScatterType is Legible We mapped the domain of legibility as a function of engineering choices: • typefaces • characters in the alphabet • cutting & scattering parameters: • cut fraction • expansion fraction • horizontal scatter mean • vertical scatter mean • h & v scatter variance • character separation

  16. Some typefaces remain legiblewhile others degrade quickly

  17. Some Characters QuicklyBecome Confusable overch ‘o’‘e’‘c’confusions

  18. Mean Horizontal Scattervs Mean Vertical Scatter 1 Easy 2 3 4 5 Impossible Right: Wrong: Mirage: data analysis tool, Tin Kam Ho, Bell Labs.

  19. Cut Fraction Histogram 1 Easy 2 3 4 5 Impossible Right: Wrong:

  20. Character Separation Histogram 1 Easy 2 3 4 5 Impossible Right: Wrong:

  21. Finding Parameter Rangesfor High Legibility d = Euclidean distance from origin of Mean Horiz Scatter vs Mean Vertical Scatter

  22. Guided by this Analysis, We Can Define Legibility Regimes Trivial: large cut fraction and small expansion Simple: character separation also decreases Easy: in original trial, correct 81% of time Medium Hard: larger scatter distances degrades legibility noticeably

  23. Other Examples - “Easy” “wexped” - difficult to segment ‘e’, ‘x’ and ‘p’. Shows difficulty of achieving 100% legibility “veral” - same parameters as above but different font. Not as difficult to segment

  24. Other Examples - “Too Hard” “thern” difficult to read, but easier than most with the same parameter values. Font makes a big difference. “wezre” satisfactorily illegible, though probably segmentable

  25. Future Work • We have exhausted the experimental data from the 1st trial • How can we automatically create images with given difficulty? • We have generated many images that seem difficult to segment automatically, but we don’t understand how to guarantee this • We need to understand the effects of typefaces on ScatterType legibility • We want to study character-confusion pairs more • Attacking ScatterType • Testing on best OCR systems • Invite attacks from other researchers • Is it credible if we attack it ourselves, and fail?

  26. Contacts Henry S. Baird baird@cse.lehigh.edu Michael Moll mam7@lehigh.edu

More Related