1 / 25

Pitch-synchronous overlap add (TD-PSOLA)

Pitch-synchronous overlap add (TD-PSOLA). Purpose: Modify pitch or timing of a signal. PSOLA is a time domain algorithm Pseudo code Find the pitch points of the signal Apply H anning window centered on the pitch points and extending to the next and previous pitch point Add waves back

kiral
Download Presentation

Pitch-synchronous overlap add (TD-PSOLA)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pitch-synchronous overlap add (TD-PSOLA) Purpose: Modify pitch or timing of a signal • PSOLA is a time domain algorithm • Pseudo code • Find the pitch points of the signal • Apply Hanning window centered on the pitch points and extending to the next and previous pitch point • Add waves back • To slow down speech, duplicate frames • To speed up, remove frames • Hanning windowing preserves signal energy • Undetectable if epochs are accurately found. Why? We are not altering the vocal filter, but changing signal spacing

  2. TD-PSOLA Illustrations Pitch (window and add) Duration (insert or remove)

  3. TD-PSOLA Pitch Points (Epochs) • TD-PSOLA requires an exact marking of pitch points in a time domain signal • Pitch mark • Marking any part within a pitch period is okay as long as the algorithm marks the same point for every frame • The most common marking point is the instant of glottal closure, which identifies a quick time domain descent • Create an array of sample sample numbers comprise an analysis epoch sequence P = {p1, p2, …, pn} • Estimate pitch period distance = (pk – pk+1)/2

  4. TD-PSOLA Evaluation • Advantages • As a time domain algorithm, it is unlikely that any other approach will be more efficient (O(N)) • Listeners cannot perceive signal alteration of up to 50% • Disadvantages • Epoch marking must be exact • Only timing changes are possible

  5. Time Domain Pitch Detection • Auto Correlation • Correlate a window of speech with a previous window • Find the best match • Issue: too many false peaks • Peak and center clipping • Algorithm to reduce false peaks • clip the top/bottom of a signal • Center the remainder around 0 • Other alternatives • Researchers propose many other pitch detection algorithms • There are much debate as to which is the best

  6. Auto Correlation • Auto Correlation1/M ∑n=0,M-1 xn xn-k ;if n-k < 0 xn-k = 0Find the k that maximizes the sum • Difference Function1/M ∑n=1,M-1 |(xn – xn-k)|; if n-k<0 sn-k = 0Find the k that minimizes the sum • Considerations • Difference approach is faster • Both can get false positives • The YIN algorithm combines both techniques

  7. Harmonic Product Spectrum Pseudo Code Divide signal into frames (20-30 ms long) Perform FFT Down sample FFT by factors of 2, 3, 4 (taking every 2nd , 3rd , 4th values) Add FFT and down sampled spectrums together The pitch harmonics will line up (The spectrum will “spike” at the pitch value) Find the spike: return fsample / fftSize * index

  8. Frequency Spectrum

  9. Background Noise • Definition: an unwanted sound or an unwanted perturbation to a wanted signal • Examples: • Clicks from microphone synchronization • Ambient noise level: background noise • Roadway noise • Machinery • Additional speakers • Background activities: TV, Radio, dog barks, etc. • Classifications • Stationary: doesn’t change with time (i.e. fan) • Non-stationary: changes with time (i.e. door closing, TV)

  10. Noise Spectrums Power measured relative to frequency f • White Noise: constant over range of f • Pink Noise: Decreases by 3db per octave; perceived equal across f • Brown(ian): Decreases proportional to 1/f2 per octave • Red: Decreases with f (either pink or brown) • Blue: increases proportional to f • Violet: increases proportional to f2 • Gray: proportional to a psycho-acoustical curve • Orange: bands of 0 around musical notes • Green: noise of the world; pink, with a bump near 500 HZ • Black: 0 everywhere except 1/fβ where β>2 in spikes • Colored: Any noise that is not white Audio samples:http://en.wikipedia.org/wiki/Colors_of_noise Signal Processing Information Base:http://spib.rice.edu/spib.html

  11. Applications • ASR: Prevent significant degradation in noisy environments Goal: Minimize recognition degradation with noise present • Sound Editing and Archival: • Improve intelligibility of audio recordings • Goals: Eliminate perceptible noise; recover audio from wax recordings • Mobile Telephony: • Transmission of audio in high noise environments • Goal: Reduce transmission requirements • Comparing audio signals • A variety of digital signal processing applications • Goal: Normalize audio signals for ease of comparison

  12. Signal to Noise Ratio (SNR) • Definition: Power ratio between a signal and noise that interferes. • Standard Equation in decibels: SNRdb = 10 log(A Signal/ANoise)2 N= 20 log(Asignal/Anoise) • For digitized speech SNRf = P(signal)/P(noise) = 10 log(∑n=0,N-1sf(n)2/nf(x)2) • sf is an array holding samples from a frame • nf is an array of noise samples. • Note: if sf(n) = nf(x), SNRf = 0

  13. Stationary Noise Suppression • Requirements • Maximize the amount of noise removed • Minimize signal distortion • Efficient algorithm with low big-Oh complexity • Problems • Tradeoff between removing noise and distorting the signal • More noise removal tends to distort the signal • Popular approaches • Time domain: Moving average filter (distorts frequency domain) • Frequency domain: Spectral Subtraction • Time domain: Weiner filter (using LPC)

  14. Auto regression Noise Removal • Definition: An autoregressive process is one where a value can be determined by a linear combination of previous values • Formula: Xt = c + ∑0,P-1ai Xt-i + ntc is a constant, nt is the noise, the summation is the pure signal • This is none other than linear prediction; noise is the residue. • Applying the LPC filter to the signal separates noise from signal (Wiener Filter)

  15. Spectral Subtraction Assumption: Noisy signal: yt = st + ntst is the clean signal and nt is additive noise Perform FFT on all windowed frames IF speech not present Update the estimate of the noisy spectrum { σnt + (1- σ)nt-1, 0 <= σ <=1 } ELSE Subtract the estimated noise spectrum Perform an inverse FFT S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction," IEEE Trans. Acoustics, Speech, Signal Processing, vol. ASSP-27, Apr. 1979.

  16. Implementation Issues • Question: How do we estimate the noise?Answer: Use the frequency distribution during times when no voice is present • Question: How do we know when voice is present?Answer: Use Voice Activity Detection algorithms (VAD) • Question: Even if we know the noise amplitudes, what about phase differences between the clean and noisy signals?Answer: Human hearing largely ignores phase differences • Question: Is the noise independent of the signal?Answer: We assume that it noise is linear and does not interact with the signal. • Question: Are noise distributions really stationary?Answer: We assume yes.

  17. Phase Distortions • Problem: We don’t know how much of the phase in an FFT is from noise and from speech. • Assumption: The algorithm assumes the phase of both are the same (that of the noisy signal). • Result: When SNR approaches 0db the audio has an hoarse sounding voice. • Why? The phase assumption means that the expected noise magnitude is incorrectly calculated. • Conclusion: There is a limit to spectral subtraction utility when SNR is close to zero

  18. Evaluation • Advantage: Easy to understand and implement • Disadvantages • The noise estimate is not exact • When too high, speech portions will be lost • When too low, some noise remains • When a noise frequency exceeds the noisy sound frequency, a negative frequency results causes musical tone artifacts • Non-linear or interacting noise • Negligible with large SNR values • Significant impact when SNR is small

  19. Musical noise Definition:Random isolated tone bursts across the frequency. Why? Most implementations set frequency bin magnitudes to zero if noise reduction would cause them to become negative Green dashes: noisy signal, Solid line: noise estimate Black dots: projected clean signal

  20. Spectral Subtraction Enhancements • Eliminate negative frequencies • Reduce the noise estimates by some factor • Vary the noise estimate factor in different frequency bands • Larger in regions outside of human speech range • Apply psycho-acoustical methods • Only attempt to remove perceived noise, not all noise • Human hearing masks sounds of adjacent frequencies • A loud sound masks sounds even after it ceases • Adaptive noise estimation: Nt(f) = λFGt(p-1)+(1-λF)Nt-1(f)

  21. Threshold of Hearing

  22. Masking

  23. Acoustical Effects • Characteristic Frequency (CF): The frequency that causes maximum response at a point of the Cochlea Basilar Membrane • Neuron exhibit a maximum response for 20 ms and then decrease to a steady state, shortly after the stimulus is removed • Masking effects can be simultaneous or temporal • Simultaneous: one signal drowns out another • Temporal: One signal masks the ones that follow • Forward: still audible after masker removed (5ms–150ms) • Back: weak signal masked from a strong one following (5ms)

  24. Voice Activity Detector (VAD) • Many VAD algorithms exist • Possible approaches to consider • Energy above background noise • Low Zero crossing rate • Determine if pitch is present • Low fractal dimensions compared to pure noise • Low LPC residual • General principle: It is better to misclassify noise as speech than to misclassify speech as noise • Standard algorithms: telephone/cell phone environments

  25. Possible VAD algorithm Note: energy and 0-crossings of noise estimated from the initial ¼ second booleanvad: double[] frame // returns true if speech present IF frame energy < low noise threshold (standard deviation units) RETURN false; IF energy < low noise threshold RETURN FALSE IF energy > high noise threshold RETURN TRUEFOR forward frames IF forward frame energy < low noise threshold RETURN FALSE IF forward frame energy > high noise threshold FOR previous ¼ second of frames COUNT previous frames having a large 0-crossing rate IF count > 0-crossing threshold (standard deviation units) IF this frame index > than first frame with 0-crossing rate > threshold RETURN true RETURN false

More Related