Using Web Text Sources for Conversational Speech Language Modeling

Using Web Text Sources for Conversational Speech Language Modeling Ivan Bulyko, Mari Ostendorf & Andreas Stolcke University of Washington SRI International

Problem: LMs for conversational speech • Language models need a lot of training data that matches the task both in terms of style and topic • Conversational speech transcripts are expensive to collect • Easily obtained news & web text isn’t conversational

Status Update -- Summary/Outline • Where we were in May (reminder+) • Basic approach: web data + text normalization + class-dependent mixtures • Perplexity & WER reductions on both CTS and meeting data • What we’ve done since then • Further exploring class-dependent mixtures • First steps at application to Mandarin • Adding disfluencies to web data

Review of Approach • Collect text data from the web (filtering for topic and style) • Clean up and transform to spoken form (text normalization + new work on disfluencies) • Use class-dependent interpolation for handling source mismatch

Obtaining Data • Use to search for… • Exact matches of conversational n-grams “I never thought I would” “I would think so” “but I don’t know” • Topic-related data (preferably conversational) “wireless mikes like” “kilohertz sampling rate” “I know that recognizer” • Optionally filter based on likelihood of data according to Switchboard LM (after clean up)

Examples • Conversational We were friends but we don’t actually have a relationship. • Topic-related (for ICSI meetings) For our experiments we used the Bellman-Ford algorithm... • Very conversational Well I actually I I really haven’t seen her for years … from transcripts of Friends

Cleaning up data • Strip HTML tags and headers/footers • Ignore documents containing 8-bit characters or where OOV rate is >50% • Automatic sentence boundary detection • Text normalization (written spoken) 123 St. Mary’s St.  one twenty three Saint Mary’s Street

Combining Sources • Use class-dependent mixture weights c(wi-1) = part-of-speech classes (35) + 100 most frequent words from swbd • On held-out data from target task: • Estimate mixture weights • Prune LM (remove n-grams with probabilities below a threshold)

Jan-May Experiments • Task domains & test data • CTS (HUB5) eval2001 (swbd1+swbd2+cell) • Meeting recorder test set • LM training data sources • LDC conversational speech sources (3M words) • LDC broadcast news text (150M words) • General meeting transcripts (200K words) • Web text: general conversational (191M words), meeting topics (28M), Fisher conv (102M) • Both tasks use SRI HUB5 recognizer in rescoring mode, intermediate stage

Class-based Mixture Weights on CTS No class Noun Backchannel • Weights for web data are higher for content words, lower for conversational speech phenomena • Higher order n-grams have higher weight on web data

Main Results: Meetings • Lots of web data is better than a little target data • Class-dependent mixture increases benefit in both cases • Pruning expts show that the benefit of class-dependent weights is not simply due to increased # of params

Old CTS Results

CTS Experiments • Initial expts (Jan report, on Eval01, old AM) • Web data helps (38.9 -> 37.5) • Class mixture helps (37.5 -> 37.2) • More development (May report, new AM) • Eval01: 30.4 -> 29.9 (all sources help a little) • Eval03: 33.8 -> 33.0 (no gain from Fisher web data) • Class mix gives small gain on Eval01 but not Eval03 Note: these results do not use interpolation with class (or other) n-grams.

Recent Work: Learning from Eval03… • Text normalization fixes from IBM help them, but not us (in last data release from UW) • WER gains from class-dependent mixtures have disappeared, maybe because … • it’s mainly important when there is little in-domain data (e.g. meetings) • recent expts are with an improved acoustic model (though not the latest & greatest) but not because … • limited training for class-dependent mixture weights • But, web data is useful for almost-parsing LM (see Stolcke talk)

Why do we think weight training is OK? • Increasing to 200 top words for classes doesn’t help, increasing much further hurts • No improvement from constraining class weights in how much they can deviate from class-independent weights, based on • Pre-defined priors, or • Number of observations in heldout data • No gain from order-independent vs. order-dependent mixture weights

Mandarin LM – Preliminary Results • Web text normalization • Use punctuation for sentence segmentation • Word segmentation with ICSI word tokenization tools • Convert digits into spoken form (more to come…) • Classes = top 100 words + 30 categories for other words, either: • POS from LDC lexicon, OR • Automatically learned w/ SRI LMtools Same performance

Inserting Disfluencies • Use SWBD class n-grams (POS+top 100 words) as a generative model to insert • um, uh and fragments • Repetitions of I, and, the • Sentence initial and, but, so, well • Randomly generate according to a linear combination of standard and reverse N-grams P(Dfbefore wi) = P(DF|wi-1, wi-2, wi-3)+(1-)P(DF|wi,wi+1,wi+2)

Examples • Well I I don’t know how uh she puts up with this speech impairment of mine • And I I think that’s really important instead of always doing um the hard work • Well monitoring for acid rain where uh the the primary components are sulphates and nitrates was conducted in twenty nine parks

Inserting Disfluencies -- Results Class-independent weights Before insertion After insertion • Weights of web data increase with added disfluencies • Small PP reduction, but no WER reduction… yet.

Summary Observations • Findings that generalize across tasks (so far): • Web data is useful, but is better leveraged with Google “filtering” (+ text normalization) • Additional perplexity-based filtering is not useful • No gain, but no loss with automatic classes IF top 100 words are included • Results that vary with task: • Class-dependent mixture weights are mainly useful when there is less in-domain data • The verdict is still out on usefulness of disfluency insertion.

Other Observations (in response to Rich) • Pruning LMs doesn’t hurt that much (but maybe we didn’t get big enough) • High-order n-gram hit rate is not as good a predictor as perplexity (bigram is not bad), based on correlation with WER.

Questions • Will web data still be useful for English CTS once we have the new data? (Note: web data will still be 10x in-domain data or more.) • Should we collect (and give out) more English CTS-oriented web data in the next couple months? • If so, should we switch focus to more topic-driven collections? • Research challenge: how to model style differences in a principled way?

Using Web Text Sources for Conversational Speech Language Modeling