cs 479 section 1 natural language processing
Download
Skip this Video
Download Presentation
CS 479, section 1: Natural Language Processing

Loading in 2 Seconds...

play fullscreen
1 / 33

CS 479, section 1: Natural Language Processing - PowerPoint PPT Presentation


  • 65 Views
  • Uploaded on

This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License . CS 479, section 1: Natural Language Processing. Lecture #16: Speech Recognition Overview (cont.).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' CS 479, section 1: Natural Language Processing' - marcus


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
cs 479 section 1 natural language processing

This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License.

CS 479, section 1:Natural Language Processing

Lecture #16: Speech Recognition Overview (cont.)

Thanks to Alex Acero (Microsoft Research), Jeff Adams (Nuance), Simon Arnfield (Sheffield), Dan Klein (UC Berkeley), Mazin Rahim (AT&T Research) for many of the materials used in this lecture.

announcements
Announcements
  • Reading Report #6 on Young’s Overview
    • Due: now
  • Reading Report #7 on M&S 7
    • Due: Friday
  • Review Questions
    • Typed list of 5 questions for Mid-term exam review
    • Due next Wednesday
objectives
Objectives
  • Continue our overview of an approach to speech recognition, picking up at acoustic modeling
  • See other examples of the source / channel (noisy channel) paradigm for modeling interesting processes
  • Apply language models
recall front end

Source

Noisy

Channel

Text

Recall: Front End
  • We want to predict a sentence given a feature vector

FE

ASR

Features

Speech

Text

acoustic modeling

Decoder,Search

Feature Extraction

Language Model

Acoustic Model

Word Lexicon

Acoustic Modeling
  • Goal:
    • Map acoustic feature vectors into distinct linguistic units
    • Such as phones, syllables, words, etc.
acoustic trajectories

f

sh

s

k

AW

OW

th

t

p

AH

AA

AY

h

ch

j

v

UH

z

zh

AWH

w

uh

r

EH

d

b

OO

y

g

eh

dh

un

um

m

n

ul

ng

AE

oh

l

ur

OH

UR

IH

OY

OOH

EY

ih

ee

EE

Acoustic Trajectories
acoustic models neighborhoods are not points
Acoustic Models:Neighborhoods are not Points
  • How do we describe what points in our “feature space” are likely to come from a given phoneme?
  • It’s clearly more complicated than just identifying a single point.
  • Also, the boundaries are not“clean”.
  • Use the normal distribution:
    • Points are likely to lie nearthe center.
    • We describe thedistribution with the mean& variance.
    • Easy to compute with
acoustic models neighborhoods are not points 2
Acoustic Models:Neighborhoods are not Points (2)
  • Normal distributions in M dimensions are analogous
  • A.k.a. “Gaussians”
  • Specify the mean point in M dimensions
    • Like an M-dimensional “hill” centered around the mean point
  • Specify the variances(as Co-variance matrix)
    • Diagonal gives the “widths”of the distributionin each direction
    • Off-diagonal values describe the“orientation”
    • “Full covariance” possibly “tilted”
    • “Diagonal covariance” not “tilted”
ams gaussians don t really cut it
AMs: Gaussians don’t really cut it
  • Consider the “AY” frames in our example. How can we describe these with an (elliptical) Gaussian?
  • A single (diagonal) Gaussian is too big to be helpful.
  • Full-covariance Gaussians are hard to train.
  • We often use multiple Gaussians (a.k.a. Gaussian mixture models)
ams phonemes are a path not a destination
AMs: Phonemes are a path, not a destination
  • Phonemes, like stories, have beginnings, middles, and ends.
  • This might be clear if you think of how the “AY” sound moves from a sort of “EH” to an “EE”.
  • Even non-diphthongs show these properties.
  • We often represent a phoneme with multiple “states”.
  • E.g. in our AY model, we might have 4 states.
  • And each of these states is modeled by a mixture of Gaussians.

STATE 2

STATE 3

STATE 4

STATE 1

ams whence whither
AMs: Whence & Whither
  • It matters where you come from (whence)and where you are going (whither).
  • Phonetic contextual effects
  • A way to model this is to use triphones
    • I.e. Depend on the previous & following phonemes
    • E.g. Our “AY” model should really be a silence-AY-S model

(… or pentaphones: use 2 phonemes before & after)

  • So what we really need for our “AY” model is a:
    • Mixture of Gaussians
    • For each of multiple states
    • For each possible set of predecessor & successor phonemes
hidden markov model hmm
Hidden Markov Model (HMM)
  • Captures:
    • Transitions between hidden states
    • Feature emissions as mixturesof gaussians
  • Spectral properties modeled bya parametric random process
    • i.e., a directed graphical model!
  • Advantages:
    • Powerful statistical method for a wide range of data and conditions
    • Highly reliable for recognizing speech
  • A collection of HMMs for each:
    • sub-word unit type
    • extraneous event: cough, um, sneeze, …
  • More on HMMs coming up in the course after classification!
anatomy of an hmm

sil-AY+S[2]

sil-AY+S[1]

sil-AY+S[3]

Anatomy of an HMM
  • HMM for /AY/ in context of preceding silence, followed by /S/

0.2

0.3

0.2

0.8

0.7

0.8

0.5

hmms as phone models

sil-AY+S[2]

sil-AY+S[1]

sil-AY+S[3]

HMMs as Phone Models

0.2

0.3

0.2

0.8

0.7

0.8

0.5

words and phones
Words and Phones

How do we know how to segment words into phones?

word lexicon

Decoder,Search

Feature Extraction

Language Model

Acoustic Model

Word Lexicon

Word Lexicon
  • Goal:
    • Map sub-word units into words
    • Usual sub-word units are phone(me)s
  • Lexicon: (CMUDict, ARPABET)
    • Phoneme Example Translation
    • AA odd AA D
    • AE at AE T
    • AH hut HH AH T
    • AO ought AO T
    • AW cow K AW
    • AY hide HH AY D
    • B be B IY
    • CH cheese CH IY Z
  • Properties:
    • Simple
    • Typically knowledge-engineered (not learned – shock!)
decoder

Source

Noisy

Channel

Text

Decoder
  • Predict a sentence given a feature vector

FE

ASR

Features

Speech

Text

decoding as state space search

Pattern Classification

Feature Extraction

Language Model

Acoustic Model

Word Lexicon

Decoding: as State-Space Search
decoding as search
Decoding as Search
  • Viterbi – Dynamic Programming
  • Multi-pass
  • A* (“stack decoding”)
  • N-best
noisy channel applications
Noisy Channel Applications
  • Speech recognition (dictation, commands, etc.)
    • text neurons, acoustic signal, transmission  acoustic waveforms  text
  • OCR
    • text  print, smudge, scan  image  text
  • Handwriting recognition
    • text neurons, muscles, ink, smudge, scan  image  text
  • Spelling correction
    • text  your spelling  mis-spelled text  text
  • Machine Translation (?)
    • text in target language translation in head  text in source language  text in target language
noisy channel models
Noisy-Channel Models
  • OCR
  • Handwriting recognition
  • Spelling Correction
  • Translation?
what s next
What’s Next
  • Upcoming lectures:
    • Classification / categorization
    • Naïve-Bayes models
    • Class-conditional language models
milestones in speech recognition
Milestones in Speech Recognition

Small Vocabulary, Acoustic Phonetics-based

Large Vocabulary; Syntax, Semantics,

Very Large Vocabulary; Semantics, Multimodal Dialog

Medium Vocabulary, Template-based

Large Vocabulary, Statistical-based

Isolated Words Connected Digits Continuous Speech

Continuous Speech Speech Understanding

Spoken dialog; Multiple modalities

Connected Words Continuous Speech

Isolated Words

Stochastic language understanding Finite-state machines Statistical learning

Pattern recognition LPC analysis Clustering algorithms Level building

Filter-bank analysis Time-normalization Dynamicprogramming

Concatenative synthesis Machine learning Mixed-initiative dialog

Hidden Markov models Stochastic Language modeling

1962 1967 1972 1977 1982 1987 1992 1997 2003

Year

dragon dictate progress
Dragon Dictate Progress
  • WERR* from Dragon NaturallySpeaking version 7 to version 8 to version 9:

DOMAIN 78 89

    • US English: 27% 23%
    • UK English: 21% 10%
    • German: 16% 10%
    • French: 24% 14%
    • Dutch: 27% 18%
    • Italian: 22% 14%
    • Spanish: 26% 17%

* WERR means relative word error rate reduction on an in-house evaluation set.

Results from Jeff Adams, ca. 2006

crazy speech marketplace
Crazy Speech Marketplace

Philips

IBM

Inso

Articulate

MedRemote

Kurzweil

ScanSoft

Nuance

L&H

etc.

Dragon

Nuance

etc.

Dictaphone

Speechworks

Voice

Signal

Dictaphone

Tegic

ca. 1980 ca. 2004

Year

speech vs text tokens vs characters
Speech vs. text:tokens vs. characters
  • Speech recognition recognizes a sequence of “tokens” taken from a discrete & finite set, called the lexicon.
  • Informally, tokens correspond to words, but the correspondence is inexact. In dictation applications, where we have to worry about converting between speech & text, we need to sort out a “token philosophy”:
    • Do we recognize “forty-two” or “forty two” or “42” or “40 2”?
    • Do we recognize “millimeters” or “mms” or “mm”?
    • What about common words which can also be names, e.g. “Brown” and “brown”?
    • What about capitalized phrases like “Nuance Communications” or “The White House” or “Main Street”?
    • What multi-word tokens should be in the lexicon, like “of_the”?
    • What do we do with complex morphologies or compounding?
converting between tokens text
Converting between tokens& text

TOKEN PHILOSOPHY

TOKENS

profits rose to twenty eight million dollars .\period see figure one a\a on page one twenty four .\period

TOKENIZATION

TEXT

Profits rose to $28 million. See fig. 1a on p. 124.

ITN

LEXICON

three examples tokenization
Three examples (Tokenization)

TEXT

  • P.J. O’Rourke said, "Giving money and power to government is like giving whiskey and car keys to teenage boys."
  • The 18-speed I bought sold on www.eBay.com for $611.00, including 8.5% sales tax.
  • From 1832 until August 15, 1838 they lived at No. 235 Main Street, "opposite the Academy," and from there they could see it all.

TOKENS

  • PJ O\'Rourke said ,\comma"\open-quotes giving money and power to government is like giving whiskey and car keys to teenage boys .\period "\close-quotes
  • the eighteen speed I bought sold on www.\WWW_dot eBay .com\dot_com for six hundred and eleven dollars zero cents ,\comma including eight .\point five percent sales tax .\period
  • from one eight three two until the fifteenth of August eighteen thirty eight they lived at number two thirty five Main_Street ,\comma "\open-quotes opposite the Academy ,\comma "\close-quotes and from there they could see it all .\period
missing from speech punctuation
Missing from speech: punctuation
  • When people speak they don’t explicitly indicate phrase and section boundaries instead listeners rely on prosody and syntax to know where these boundaries belong in dictation applications we normally rely on speakers to speak punctuation explicitly how can we remove that requirement
  • When people speak, they don’t explicitly indicate phrase and section boundaries.
    • Instead, listeners rely on prosody and syntax to know where these boundaries belong.
    • In dictation applications, we normally rely on speakers to speak punctuation explicitly.
    • How can we remove that requirement?
punctuation guessing example
Punctuation Guessing Example
  • Punctuation Guessing
    • As currently shipping in Dragon
    • Targeted towards free, unpunctuated speech
ad