290 likes | 318 Views
Explore how speech variations convey meaning and the tools, data, and techniques used in acoustics of speech analysis.
E N D
Acoustics of Speech Julia Hirschberg CS 4706
Claim: How things are said can be critical to understanding • I.e., Varying phrasing, prominence, pitch range, speaking rate, pitch contour, voice quality…conveys meaning • What is our evidence? How do we prove? • Observation • Hypotheses • Experimentation (perception, production) • Speech analysis (independent variables) • Correlation with dependent variable
What does our data look like? • What tools do we have for analysis?
What is sound? • Pressure fluctuations in the air caused by a musical instrument, a car horn, a voice • Cause eardrum to move • Auditory system translates into neural impulses • Brain interprets as sound • Can we tell one sound from another? • Can we distinguish one particular sound in ‘noise’?
From a speech-centric point of view, when sound is not produced by the human voice, we may term it noise • Ratio of speech-generated sound to other simultaneous sound: signal-to-noise ratio
How ‘Loud’ are Common Sounds? Event Pressure (Pa) Db Absolute 20 0 Whisper 200 20 Quiet office 2K 40 Conversation 20K 60 Bus 200K 80 Subway 2M 100 Thunder 20M 120 *DAMAGE* 200M 140
Some Sounds are Periodic • Simple Periodic Waves (sine waves) defined by • Frequency: how often does pattern repeat per time unit • Cycle: one repetition • Period: duration of cycle • Frequency=# cycles per time unit, e.g. • Frequency in Hz=1sec/period_in_sec • Horizontal axis of waveform • Amplitude:peak deviation of pressure from normal atmospheric pressure
Phase: timing of waveform relative to a reference point • Complex periodic waves • Cyclic but composed of two or more sine waves • Fundamental frequency (F0): rate at which largest pattern repeats (also GCD of component freqs) • Components not always easily identifiable: power spectrum graphs amplitude vs. frequency • Any complex waveform can be analyzed into a set of sine waves with their own frequencies, amplitudes, and phases (Fourier’s theorem) • E.g. some speech sounds (mostly vowels) cat.wav
Some Sounds are Aperiodic • Waveforms with random or non-repeating patterns • Random aperiodic waveforms: white noise • Flat spectrum: equal amplitude for all frequency components • Transients: sudden bursts of pressure (clicks, pops, door slams) • Waveform shows a single impulse (click.wav) • Fourier analysis shows a flat spectrum • Some speech sounds, e.g. many consonants (e.g. cat.wav)
Speech Production • Voiced and voiceless sounds • Vocal fold vibration filtered by the Vocal tract produces complex periodic waveform • Cycles per sec of lowest frequency component of signal = fundamental frequency (F0) • Fourier analysis yields power spectrum with component frequencies and amplitudes • F0 is first (lowest frequency) peak • Harmonics are resonances of vocal track, multiples of F0
Vocal fold vibration [UCLA Phonetics Lab demo]
alveolar post-alveolar/palatal dental velar uvular labial pharyngeal laryngeal/glottal Places of articulation http://www.chass.utoronto.ca/~danhall/phonetics/sammy.html
How do we capture speech for analysis? • Recording conditions • A quiet office, a sound booth, an anachoic chamber • Microphones • Analog devices (e.g. tape recorders) store and analyze continuous air pressure variations (speech) as a continuous signal • Digital devices (e.g. computers,DAT) first convert continuous signals into discrete signals (A-to-D conversion)
File format: • .wav, .aiff, .ds, .au, .sph,… • Conversion programs, e.g. sox • Storage • Function of how much information we store about speech in digitization • Higher quality, closer to original • More space (1000s of hours of speech take up a lot of space)
Sampling • Sampling rate: how often do we need to sample? • At least 2 samples per cycle to capture periodicity of a waveform component at a given frequency • 100 Hz waveform needs 200 samples per sec • Nyquist frequency: highest-frequency component captured with a given sampling rate (half the sampling rate)
Sampling/storage tradeoff • Human hearing: ~20K top frequency • Do we really need to store 40K samples per second of speech? • Telephone speech: 300-4K Hz (8K sampling) • But some speech sounds (e.g. fricatives, /f/, /s/, /p/, /t/, /d/) have energy above 4K! • Peter/teeter/Dieter • 44k (CD quality audio) vs.16-22K (usually good enough to study pitch, amplitude, duration, …)
Sampling Errors • Aliasing: • Signal’s frequency higher than half the sampling rate • Solutions: • Increase the sampling rate • Filter out frequencies above half the sampling rate (anti-aliasingfilter)
Quantization • Measuring the amplitude at sampling points: what resolution to choose? • Integer representation • 8, 12 or 16 bits per sample • Noise due to quantization steps avoided by higher resolution -- but requires more storage • How many different amplitude levels do we need to distinguish? • Choice depends on data and application (44K 16bit stereo requires ~10Mb storage)
But clipping occurs when input volume is greater than range representable in digitized waveform • Increase the resolution • Decrease the amplitude
What can we do if our data is ‘noisy’? • Acoustic filters block out certain frequencies of sounds • Low-pass filter blocks high frequency components of a waveform • High-pass filter blocks low frequencies • Rejectband (what to block) vs. passband (what to let through) • But if frequencies of two sounds overlap….source separation
How can we capture pitch contours, pitch range? • What is the pitch contour of this utterance? Is the pitch range of X greater than that of Y? • Pitch tracking: Estimate F0 over time as fn of vocal fold vibration • A periodic waveform is correlated with itself • One period looks much like another (cat.wav) • Find the period by finding the ‘lag’ (offset) between two windows on the signal for which the correlation of the windows is highest • Lag duration (T) is 1 period of waveform • Inverse is F0 (1/T)
Errors to watch for: • Halving: shortest lag calculated is too long (underestimate pitch) • Doubling: shortest lag too short (overestimate pitch) • Microprosody errors (e.g. /v/)
Sample Analysis File: Pitch Track Header • version 1 • type_code 4 • frequency 12000.000000 • samples 160768 • start_time 0.000000 • end_time 13.397333 • bandwidth 6000.000000 • dimensions 1 • maximum 9660.000000 • minimum -17384.000000 • time Sat Nov 2 15:55:50 1991 • operation record: padding xxxxxxxxxxxx
Sample Analysis File: Pitch Track Data (F0 Pvoicing Energy A/C Score) • 147.896 1 2154.07 0.902643 • 140.894 1 1544.93 0.967008 • 138.05 1 1080.55 0.92588 • 130.399 1 745.262 0.595265 • 0 0 567.153 0.504029 • 0 0 638.037 0.222939 • 0 0 670.936 0.370024 • 0 0 790.751 0.357141 • 141.215 1 1281.1 0.904345
Pitch Perception • But do pitch trackers capture what humans perceive? • Auditory system’s perception of pitch is non-linear • Sounds at lower frequencies with same difference in absolute frequency sound more different than those at higher frequencies (male vs. female speech) • Bark scale (Zwicker) and other models of perceived difference
How do we capture loudness/intensity? • Is one utterance louder than another? • Energy closely correlated experimentally with perceived loudness • For each window, square the amplitude values of the samples, take their mean, and take the root of that mean (RMS energy) • What size window? • Longer windows produce smoother amplitude traces but miss sudden acoustic events
Perception of Loudness • But the relation is non-linear: sones or decibels (dB) • Differences in soft sounds more salient than loud • Intensity proportional to square of amplitude so…intensity of sound with pressure x vs. reference sound with pressure r = x2/r2 • bel: base 10 log of ratio • decibel: 10 bels • dB = 10log10 (x2/r2) • Absolute (20 Pa, lowest audible pressure fluctuation of 1000 Hz tone), typical threshold level for tone at frequency
How do we capture…. • For utterances X and Y • Pitch contour: Same or different? • Pitch range: Is X larger than Y? • Duration: Is utterance X longer than utterance Y? • Speaker rate: Is the speaker of X speaking faster than the speaker of Y? • Voice quality….
Next Class • Tools for the Masses: Read the Praat tutorial • Download Praat from the course syllabus page and play with a speech file (e.g. http://www.cs.columbia.edu/~julia/cs4706/cc_001_sadness_1669.04_August-second-.wav or record your own)