1 / 22

Using Web Text Sources for Conversational Speech Language Modeling

Using Web Text Sources for Conversational Speech Language Modeling. Ivan Bulyko, Mari Ostendorf & Andreas Stolcke University of Washington SRI International. Problem: LMs for conversational speech.

marcos
Download Presentation

Using Web Text Sources for Conversational Speech Language Modeling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Web Text Sources for Conversational Speech Language Modeling Ivan Bulyko, Mari Ostendorf & Andreas Stolcke University of Washington SRI International

  2. Problem: LMs for conversational speech • Language models need a lot of training data that matches the task both in terms of style and topic • Conversational speech transcripts are expensive to collect • Easily obtained news & web text isn’t conversational

  3. Status Update -- Summary/Outline • Where we were in May (reminder+) • Basic approach: web data + text normalization + class-dependent mixtures • Perplexity & WER reductions on both CTS and meeting data • What we’ve done since then • Further exploring class-dependent mixtures • First steps at application to Mandarin • Adding disfluencies to web data

  4. Review of Approach • Collect text data from the web (filtering for topic and style) • Clean up and transform to spoken form (text normalization + new work on disfluencies) • Use class-dependent interpolation for handling source mismatch

  5. Obtaining Data • Use to search for… • Exact matches of conversational n-grams “I never thought I would” “I would think so” “but I don’t know” • Topic-related data (preferably conversational) “wireless mikes like” “kilohertz sampling rate” “I know that recognizer” • Optionally filter based on likelihood of data according to Switchboard LM (after clean up)

  6. Examples • Conversational We were friends but we don’t actually have a relationship. • Topic-related (for ICSI meetings) For our experiments we used the Bellman-Ford algorithm... • Very conversational Well I actually I I really haven’t seen her for years … from transcripts of Friends

  7. Cleaning up data • Strip HTML tags and headers/footers • Ignore documents containing 8-bit characters or where OOV rate is >50% • Automatic sentence boundary detection • Text normalization (written spoken) 123 St. Mary’s St.  one twenty three Saint Mary’s Street

  8. Combining Sources • Use class-dependent mixture weights c(wi-1) = part-of-speech classes (35) + 100 most frequent words from swbd • On held-out data from target task: • Estimate mixture weights • Prune LM (remove n-grams with probabilities below a threshold)

  9. Jan-May Experiments • Task domains & test data • CTS (HUB5) eval2001 (swbd1+swbd2+cell) • Meeting recorder test set • LM training data sources • LDC conversational speech sources (3M words) • LDC broadcast news text (150M words) • General meeting transcripts (200K words) • Web text: general conversational (191M words), meeting topics (28M), Fisher conv (102M) • Both tasks use SRI HUB5 recognizer in rescoring mode, intermediate stage

  10. Class-based Mixture Weights on CTS No class Noun Backchannel • Weights for web data are higher for content words, lower for conversational speech phenomena • Higher order n-grams have higher weight on web data

  11. Main Results: Meetings • Lots of web data is better than a little target data • Class-dependent mixture increases benefit in both cases • Pruning expts show that the benefit of class-dependent weights is not simply due to increased # of params

  12. Old CTS Results

  13. CTS Experiments • Initial expts (Jan report, on Eval01, old AM) • Web data helps (38.9 -> 37.5) • Class mixture helps (37.5 -> 37.2) • More development (May report, new AM) • Eval01: 30.4 -> 29.9 (all sources help a little) • Eval03: 33.8 -> 33.0 (no gain from Fisher web data) • Class mix gives small gain on Eval01 but not Eval03 Note: these results do not use interpolation with class (or other) n-grams.

  14. Recent Work: Learning from Eval03… • Text normalization fixes from IBM help them, but not us (in last data release from UW) • WER gains from class-dependent mixtures have disappeared, maybe because … • it’s mainly important when there is little in-domain data (e.g. meetings) • recent expts are with an improved acoustic model (though not the latest & greatest) but not because … • limited training for class-dependent mixture weights • But, web data is useful for almost-parsing LM (see Stolcke talk)

  15. Why do we think weight training is OK? • Increasing to 200 top words for classes doesn’t help, increasing much further hurts • No improvement from constraining class weights in how much they can deviate from class-independent weights, based on • Pre-defined priors, or • Number of observations in heldout data • No gain from order-independent vs. order-dependent mixture weights

  16. Mandarin LM – Preliminary Results • Web text normalization • Use punctuation for sentence segmentation • Word segmentation with ICSI word tokenization tools • Convert digits into spoken form (more to come…) • Classes = top 100 words + 30 categories for other words, either: • POS from LDC lexicon, OR • Automatically learned w/ SRI LMtools Same performance

  17. Inserting Disfluencies • Use SWBD class n-grams (POS+top 100 words) as a generative model to insert • um, uh and fragments • Repetitions of I, and, the • Sentence initial and, but, so, well • Randomly generate according to a linear combination of standard and reverse N-grams P(Dfbefore wi) = P(DF|wi-1, wi-2, wi-3)+(1-)P(DF|wi,wi+1,wi+2)

  18. Examples • Well I I don’t know how uh she puts up with this speech impairment of mine • And I I think that’s really important instead of always doing um the hard work • Well monitoring for acid rain where uh the the primary components are sulphates and nitrates was conducted in twenty nine parks

  19. Inserting Disfluencies -- Results Class-independent weights Before insertion After insertion • Weights of web data increase with added disfluencies • Small PP reduction, but no WER reduction… yet.

  20. Summary Observations • Findings that generalize across tasks (so far): • Web data is useful, but is better leveraged with Google “filtering” (+ text normalization) • Additional perplexity-based filtering is not useful • No gain, but no loss with automatic classes IF top 100 words are included • Results that vary with task: • Class-dependent mixture weights are mainly useful when there is less in-domain data • The verdict is still out on usefulness of disfluency insertion.

  21. Other Observations (in response to Rich) • Pruning LMs doesn’t hurt that much (but maybe we didn’t get big enough) • High-order n-gram hit rate is not as good a predictor as perplexity (bigram is not bad), based on correlation with WER.

  22. Questions • Will web data still be useful for English CTS once we have the new data? (Note: web data will still be 10x in-domain data or more.) • Should we collect (and give out) more English CTS-oriented web data in the next couple months? • If so, should we switch focus to more topic-driven collections? • Research challenge: how to model style differences in a principled way?

More Related