1 / 53

Understanding the more data effect A closer look at learning curves

Understanding the more data effect A closer look at learning curves. Antal van den Bosch Tilburg University http://ilk.uvt.nl. Overview. The More Data effect Case study 1: learning curves and feature representations Case study 2: continuing learning curves with more data.

haruko
Download Presentation

Understanding the more data effect A closer look at learning curves

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Understanding the more data effectA closer look at learning curves Antal van den Bosch Tilburg University http://ilk.uvt.nl

  2. Overview • The More Data effect • Case study 1: learning curves and feature representations • Case study 2: continuing learning curves with more data

  3. The More Data effect • There’s no data like more data (speech recognition motto) • Banko and Brill (2001): confusibles • Differences between algorithms flip or disappear • Differences between representations disappear • Growth of curve seems log-linear (constant improvement with exponentially more data) • Explanation sought in “Zipf’s tail”

  4. Banko and Brill (2001) • Demonstrated on {to,two,too} using 1M to 1G examples: • Initial range between 3 classifiers at • 1M: 83-85% • 1G: 96-97% • Extremely simple memory-based classifier (one word left, one word right): • 86% at 1M, 93% at 1G • apparent constant improvement on log-growth

  5. Zipf • Frequency of nth most frequent word is inversely proportional to n • ~ log-linear relation between token frequencies vs numbers of types that have these frequencies

  6. WSJ, first 1,000 words

  7. WSJ, first 2,000 words

  8. WSJ, first 5,000 words

  9. WSJ, first 10,000 words

  10. WSJ, first 20,000 words

  11. WSJ, first 50,000 words

  12. WSJ, first 100,000 words

  13. WSJ, first 200,000 words

  14. WSJ, first 500,000 words

  15. WSJ, all 1,126,389 words

  16. Chasing Zipf’s tail • More data brings two benefits: • More observations of words already seen. • More new words become known (the tail) • This effect persists, no matter how often the data is doubled.

  17. Case study 1 • Learning curves vs feature representations • Van den Bosch & Buchholz, ACL 2002 • Perspective: how important are PoS features in shallow parsing? • Idea: • PoS features are robust • Robustness effect may decrease when more data is available

  18. Words, PoS, shallow parsing • “Assign limited syntactic structure to text” • Input: words and/or relevant clues from computed PoS • Most systems assume PoS • HPSG (Pollard & Sag 87) • Abney (91) • Collins (96), Ratnaparkhi (97): interleaved • Charniak (00): back-off

  19. Could words replace PoS? Simple intuition: • PoS disambiguate explicitly suspect-N vs suspect-V • words disambiguate implicitly … thesuspect … … wesuspect …

  20. Could words replace PoS? Words could provide PoS info implicitly • Pro: • No intermediary computation • No spurious PoS errors • Contra: • PoS offers back-off; PoS data is not sparse • PoS does resolve relevant ambiguity • What happens when there is more data?

  21. Case study: Overall setup • “chunking-function tagging”, English • Select input: • Gold-standard or predicted PoS • Words only • Both • Learn with increasing amounts of training data • Which learning curve grows faster? • Do they meet or cross? Where?

  22. Data (1): Get tree from PTB ((S (ADVP-TMP Once) (NP-SBJ-1 he) (VP was (VP held (NP *-1) (PP-TMP for (NP three months)) (PP without (S-NOM (NP-SBJ *-1) (VP being (VP charged) ))))) .))

  23. Data (2): Shallow parse [ADVPOnceADVP-TMP] [NPheNP-SBJ] [VP was heldVP/S] [PPforPP-TMP] [NP three monthsNP] [PPwithoutPP] [VP being chargedVP/SNOM]

  24. Data (3): Make instances • … _Oncehe … I-ADVP - ADVP-TMP • … Oncehe was … I-NP - NP-SBJ • … he was held … I-VP - NOFUNC • … washeldfor … I-VP - VP/S • … heldforthree … I-PP - PP-TMP • … for three months … I-NP - NOFUNC • … threemonthswithout … I-NP - NP • … monthswithoutbeing … I-PP - PP • … without being charged … I-VP - NOFUNC • … beingcharged. … I-VP - VP/S-NOM • … charged . _ … O - NOFUNC

  25. Case study: Details • experiments based on Penn Treebank III (WSJ, Brown, ATIS) • 74K sentences, 1,637,268 tokens (instances) • 62,472 unique words, 874 chunk-tag codes • 10-fold cross-validation experiments: • Split data 10 times in 90% train and 10% test • Grow every training set stepwise • precision-recall on correctly chunked and typed chunks with correct function tags • memory-based learning (TiMBL) • MVDM, k=7, gain ratio feature weights, inverse distance class voting • TRIBL level 2 (approximate k-NN)

  26. Case study: Extension (1) • Word attenuation (after Eisner 96): • Distrust low-frequency information (<10) • But keep whatever is informative (back-off) • Convert to MORPH-[CAP|NUM|SHORT|ss] A Daikin executive in charge of exports when the high-purity halogenated hydrocarbon was sold to the Soviets in 1986 received a suspended 10-month jail sentence . A MORPH-CAP executive in charge of exports when the MORPH-ty MORPH-ed MORPH-on was sold to the Soviets in 1986 received a suspended MORPH-th jail sentence .

  27. Results: learning curves

  28. Case study: Extension (2) • In contrast with gold-standard PoS, use automatically generated PoS • Memory-based tagger (MBT, Daelemans et al., 1996) • Separate optimized modules for known and unknown words • Generate tagger on training set • Apply generated tagger to test set • With all training data: 96.7% correct tags

  29. Case study: Results (2)

  30. Case study: Observations • Word curve grows roughly log-linear • PoS curve flattens more • Merit of words vs. PoS for current task depends on amount of training material • Extensions: • Attenuation improves performance • Adding (real) PoS improves performance • Both effects become smaller with more training material

  31. Case study 2 • Continuing learning curves with more data • Work in progress • Idea: • Add data from • the same annotated source, • a different annotated source, • unlabeled data, • And measure curve on test data from • the same annotated source, • a different annotated source

  32. Data • PTB II (red=test) • Wall Street Journal financial news articles • CoNLL shared task training set (211,737 words) • CoNLL shared task test set (47,377 words) • Rest of WSJ (914,662 words) • Brown (459,148 words) written English mix • ATIS (4,353 words): spoken English, questions, first-person sentences • Reuters Corpus (3.7Gb xml) newswire • Tagged by MBL trained on CoNLL shared task training set (w/ paramsearch)

  33. Tasks • CoNLL 2000 shared task (chunking) • (Tjong Kim Sang and Buchholz, 2000) • Kudo & Matsumoto, pairwise classif. SVM, 93.5 F-score • Later improvements over 94 • Function tagging on same data • (Van den Bosch and Buchholz, 2002) • MBL, 78 F-score • 3-1-3 word windows for both (no PoS) • Paramsearch and attenuation on both

  34. Use unlabeled data • Why not classify unlabeled data and add that? Well, • “One classifier does not work” • Negative effects outweigh positive (from % correct) • Adds more of the same • Imports errors • What does? (M$ question) • Co-training (2 interleaved classifiers) • Active learning (n classifiers plus 1 human) • Yarowski-boosting (1 iterated classifier) • Cf. Abney (2002)

  35. Yarowski boosting (1995) • Given labeled data and unlabeled data • Train rule inducer on labeled data, • Loop: • Relabel labeled data with rules • Remove examples below labeling confidence threshold • Label unlabeled data with rules • Add labeled examples above confidence threshold to labeled set • Train rule inducer again. • Demonstrated to work for WSD.

  36. Example boosting • Given labeled data and unlabeled data • Train MBL on labeled data, and use it to label unlabeled data • Revert roles: • Train MBL on automatically labeled data • Test on original labeled data • For each training instance, measure class prediction strength (Aha et al., Salzberg) • CPS = # times correct NN / # times NN • Select examples with top CPS, add them to labeled data • Here: 4M examples of Reuters labeled

  37. CPS: examples • High CPS: • zone with far less options , " I-NP-NOFUNC • would allow it to bypass its requirements I-VP-NOFUNC • surprise after he faded from the picture I-VP-VP/S • remain under pressure in the next quarter I-PP-PP-TMP • Low CPS: • ago and are not renewable . _ O-NOFUNC • six world championship points and is just I-NP-NP • is competitive , influenced mainly by the I-VP-VP/S • of key events in the rebellion : I-PP-PP-LOC

  38. Results: chunking

  39. Results + Brown

  40. Results + random classified

  41. Results + example boosting

  42. Observations, chunking • Testing on WSJ: • Curve flattens • Example boosting (50k and 100k) in line with others • Adding random data does not lower curve • Testing on ATIS • Curve does not appear to flatten • Adding Brown or example boosting works just as well • Adding random data shows negative effect • Note lower scores on ATIS

  43. Results: function-tagging

  44. Results + Brown

  45. Results + random classified

  46. Results + example boosting

  47. Observations, function tagging • Testing on WSJ: • Curve still going up • Brown yields flat curve • Example boosting (100k) unclear • Testing on ATIS: • Curve steeper at later stage • Adding Brown or example boosting works just as well

  48. Summary of results • Adding data from same source is generally good (testing on same data or other data) • Adding data from other source may only be effective when testing on other data • Learning curves testing on other data may go up later • Vocabulary effects

  49. Summary of results (2) • Adding random data labeled with 1 classifier shows predicted negative effect • Negative effect assumedly outweighs positive • Except when curve is already flat: negative effects are muted because there is sufficient ‘positive’ data? • Example boosting promising, but • Threshold issue: why 100k of 4M (2.5%)? Higher percentages make curve approach random curve of course • CPS is weak; smoothing (e.g. Laplace correction) • More work and comparisons needed

  50. Discussion • ‘More data’ tends to hold when source of data added is not changed, except that • Flattened curves appear to remain flat with more data • Vocabulary effects when testing on data from other sources • May produce delayed upward learning curves • Positive effect from adding data from other sources to training • Learning curves are a useful instrument • Comparisons between algorithms and between parameter settings (here: algorithm fixed = bias) • Comparisons between representations • Predictions for annotation projects

More Related