Acoustics of speech
1 / 33

Acoustics of Speech - PowerPoint PPT Presentation

  • Updated On :

Acoustics of Speech. Julia Hirschberg CS 4706. Goal 1: Distinguishing One Phoneme from Another, Automatically. ASR: Did the caller say ‘ I want to fly to Newark ’ or ‘ I want to fly to New York ’? Forensic Linguistics: Did the accused say ‘ Kill him’ or ‘ Bill him’ ?

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Acoustics of Speech' - farhani

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Acoustics of speech l.jpg

Acoustics of Speech

Julia Hirschberg

CS 4706

Goal 1 distinguishing one phoneme from another automatically l.jpg
Goal 1: Distinguishing One Phoneme from Another, Automatically

  • ASR: Did the caller say ‘I want to fly to Newark’ or ‘I want to fly to New York’?

  • Forensic Linguistics: Did the accused say ‘Kill him’ or ‘Bill him’?

  • What evidence is there in the speech signal?

    • How accurately and reliably can we extract it?

Goal 2 determining how things are said is sometimes critical to understanding l.jpg
Goal 2: Determining AutomaticallyHow things are said is sometimes critical to understanding

  • Forensic Linguistics: ‘Kill him!’ or ‘Kill him?’

  • Call Center: ‘That amount is incorrect.’

  • What information do we need to extract from the speech signal?

  • What tools do we have to do this?

Today and next class l.jpg
Today and Next Class Automatically

  • Acoustic features to extract

    • Fundamental frequency (pitch)

    • Amplitude/energy (loudness)

    • Spectrum

    • Timing (pauses, rate)

  • Tools for extraction

    • Praat

    • Wavesurfer

    • Xwaves

Sound production l.jpg
Sound Production Automatically

  • Pressure fluctuations in the air caused by a musical instrument, a car horn, a voice

    • Cause eardrum to move

    • Auditory system translates into neural impulses

    • Brain interprets as sound

    • Plot sound as change in air pressure over time

  • From a speech-centric point of view, when sound is not produced by the human voice, we may term it noise

    • Ratio of speech-generated sound to other simultaneous sound: signal-to-noise ratio

      • Higher SNRs are better

How loud are common sounds how much pressure generated l.jpg
How ‘Loud’ are Common Sounds – How Much Pressure Generated?

Event Pressure (Pa) Db

Absolute 20 0

Whisper 200 20

Quiet office 2K 40

Conversation 20K 60

Bus 200K 80

Subway 2M 100

Thunder 20M 120

*DAMAGE* 200M 140

Some sounds are periodic l.jpg
Some Sounds are Periodic Generated?

  • Simple Periodic Waves (sine waves) defined by

    • Frequency: how often does pattern repeat per time unit

      • Cycle: one repetition

      • Period: duration of cycle

      • Frequency=# cycles per time unit, e.g.

        • Frequency in Hz = cycles per second or 1/period

        • E.g. 400Hz pitch = 1/.0025 (1 cycle has a period of .0025; 400 cycles complete in 1 sec)

    • Amplitude:peak deviation of pressure from normal atmospheric pressure

Slide8 l.jpg

  • Phase Generated?: timing of waveform relative to a reference point

Complex periodic waves l.jpg
Complex Periodic Waves Generated?

  • Cyclic but composed of multiple sine waves

  • Fundamental frequency (F0): rate at which largest pattern repeats (also GCD of component freqs)

  • Components not always easily identifiable: power spectrum graphs amplitude vs. frequency

  • Any complex waveform can be analyzed into a set of sine waves with their own frequencies, amplitudes, and phases (Fourier’s theorem)

Some sounds are aperiodic l.jpg
Some Sounds are Aperiodic Generated?

  • Waveforms with random or non-repeatingpatterns

    • Random aperiodic waveforms: white noise

      • Flat spectrum: equal amplitude for all frequency components

    • Transients: sudden bursts of pressure (clicks, pops, door slams)

      • Waveform shows a single impulse (click.wav)

      • Fourier analysis shows a flat spectrum

  • Some speech sounds, e.g. many consonants (e.g. cat.wav)

Speech production l.jpg
Speech Production Generated?

  • Voiced and voiceless sounds

  • Vocal fold vibration filtered by the Vocal tract produces complex periodic waveform

    • Cycles per sec of lowest frequency component of signal = fundamental frequency (F0)

    • Fourier analysis yields power spectrum with component frequencies and amplitudes

      • F0 is first (lowest frequency) peak

      • Harmonics are resonances of component frequencies amplified by vocal track

Vocal fold vibration l.jpg
Vocal fold vibration Generated?

[UCLA Phonetics Lab demo]

Places of articulation l.jpg

alveolar Generated?








Places of articulation

How do we capture speech for analysis l.jpg
How do we capture speech for analysis? Generated?

  • Recording conditions

    • A quiet office, a sound booth, an anachoic chamber

  • Microphones

  • Analog devices (e.g. tape recorders) store and analyze continuous air pressure variations (speech) as a continuous signal

  • Digital devices (e.g. computers,DAT) first convert continuous signals into discrete signals (A-to-D conversion)

Slide18 l.jpg

  • File format: Generated?

    • .wav, .aiff, .ds, .au, .sph,…

    • Conversion programs, e.g. sox

  • Storage

    • Function of how much information we store about speech in digitization

      • Higher quality, closer to original

      • More space (1000s of hours of speech take up a lot of space)

Sampling l.jpg
Sampling Generated?

  • Sampling rate: how often do we need to sample?

    • At least 2 samples per cycle to capture periodicity of a waveform component at a given frequency

      • 100 Hz waveform needs 200 samples per sec

      • Nyquist frequency: highest-frequency component captured with a given sampling rate (half the sampling rate)

Sampling storage tradeoff l.jpg
Sampling/storage tradeoff Generated?

  • Human hearing: ~20K top frequency

    • Do we really need to store 40K samples per second of speech?

  • Telephone speech: 300-4K Hz (8K sampling)

    • But some speech sounds (e.g. fricatives, /f/, /s/, /p/, /t/, /d/) have energy above 4K!

    • Peter/teeter/Dieter

  • 44k (CD quality audio) vs.16-22K (usually good enough to study pitch, amplitude, duration, …)

Sampling errors l.jpg
Sampling Errors Generated?

  • Aliasing:

    • Signal’s frequency higher than half the sampling rate

    • Solutions:

      • Increase the sampling rate

      • Filter out frequencies above half the sampling rate (anti-aliasing filter)

Quantization l.jpg
Quantization Generated?

  • Measuring the amplitude at sampling points: what resolution to choose?

    • Integer representation

    • 8, 12 or 16 bits per sample

  • Noise due to quantization steps avoided by higher resolution -- but requires more storage

    • How many different amplitude levels do we need to distinguish?

    • Choice depends on data and application (44K 16bit stereo requires ~10Mb storage)

Slide23 l.jpg

  • But Generated?clipping occurs when input volume is greater than range representable in digitized waveform

    • Increase the resolution

    • Decrease the amplitude

What can we do if our data is noisy l.jpg
What can we do if our data is ‘noisy’? Generated?

  • Acoustic filters block out certain frequencies of sounds

    • Low-pass filter blocks high frequency components of a waveform

    • High-pass filter blocks low frequencies

    • Reject band (what to block) vs. pass band (what to let through)

  • But if frequencies of two sounds overlap….source separation

How can we capture pitch contours pitch range l.jpg
How can we capture pitch contours, pitch range? Generated?

  • What is the pitch contour of this utterance? Is the pitch range of X greater than that of Y?

  • Pitch tracking: Estimate F0 over time as fn of vocal fold vibration

  • A periodic waveform is correlated with itself

    • One period looks much like another (cat.wav)

    • Find the period by finding the ‘lag’ (offset) between two windows on the signal for which the correlation of the windows is highest

    • Lag duration (T) is 1 period of waveform

    • Inverse is F0 (1/T)

Slide26 l.jpg

  • Errors to watch for: Generated?

    • Halving: shortest lag calculated is too long (underestimate pitch)

    • Doubling: shortest lag too short (overestimate pitch)

    • Microprosody effects (e.g. /v/)

Sample analysis file pitch track header l.jpg
Sample Analysis File: Pitch Track Header Generated?

  • version 1

  • type_code 4

  • frequency 12000.000000

  • samples 160768

  • start_time 0.000000

  • end_time 13.397333

  • bandwidth 6000.000000

  • dimensions 1

  • maximum 9660.000000

  • minimum -17384.000000

  • time Sat Nov 2 15:55:50 1991

  • operation record: padding xxxxxxxxxxxx

Sample analysis file pitch track data l.jpg
Sample Analysis File: Pitch Track Data Generated?

(F0 Pvoicing Energy A/C Score)

  • 147.896 1 2154.07 0.902643

  • 140.894 1 1544.93 0.967008

  • 138.05 1 1080.55 0.92588

  • 130.399 1 745.262 0.595265

  • 0 0 567.153 0.504029

  • 0 0 638.037 0.222939

  • 0 0 670.936 0.370024

  • 0 0 790.751 0.357141

  • 141.215 1 1281.1 0.904345

Pitch perception l.jpg
Pitch Perception Generated?

  • But do pitch trackers capture what humans perceive?

  • Auditory system’s perception of pitch is non-linear

    • Sounds at lower frequencies with same difference in absolute frequency sound more different than those at higher frequencies (male vs. female speech)

    • Bark scale (Zwicker) and other models of perceived difference

How do we capture loudness intensity l.jpg
How do we capture loudness/intensity? Generated?

  • Is one utterance louder than another?

  • Energy closely correlated experimentally with perceived loudness

  • For each window, square the amplitude values of the samples, take their mean, and take the root of that mean (RMS energy)

    • What size window?

    • Longer windows produce smoother amplitude traces but miss sudden acoustic events

Perception of loudness l.jpg
Perception of Loudness Generated?

  • But the relation is non-linear: sones or decibels (dB)

    • Differences in soft sounds more salient than loud

    • Intensity proportional to square of amplitude so…intensity of sound with pressure x vs. reference sound with pressure r = x2/r2

    • bel: base 10 log of ratio

    • decibel: 10 bels

    • dB = 10log10 (x2/r2)

    • Absolute (20 Pa, lowest audible pressure fluctuation of 1000 Hz tone), typical threshold level for tone at frequency

How do we capture l.jpg
How do we capture…. Generated?

  • For utterances X and Y

  • Pitch contour: Same or different?

  • Pitch range: Is X larger than Y?

  • Duration: Is utterance X longer than utterance Y?

  • Speaker rate: Is the speaker of X speaking faster than the speaker of Y?

  • Voice quality….

Next class l.jpg
Next Class Generated?

  • Tools for the Masses: Read the Praat tutorial

  • Download Praat from the course syllabus page and play with a speech file (e.g. or record your own)

  • Bring a laptop and headphones to class if you have them